vollo_compiler
- class vollo_compiler.Config
A Vollo accelerator configuration.
Each Vollo bitstream contains a specific configuration of the Vollo accelerator, e.g. number of cores, size of each core, etc.
Program
s need to be compiled for the accelerator configuration that they will be run on.For the bitstreams included in the Vollo SDK, use the preset configs
ia_420f_c6b32()
,ia_840f_c3b64()
.- static ia_420f_c6b32()
IA-420f configuration with 6 cores and block size 32
- static ia_840f_c3b64()
IA-840f configuration with 3 cores and block size 64
- save(json_path: str) None
Save a hardware configuration to a JSON file
- static load(json_path: str, check_version_matches: bool = True) Config
Load a hardware configuration from a JSON file
- Parameters:
json_path – Path to the JSON file
check_version_matches – if the provided JSON file is versioned, check that the version matches the current version
- block_size
Size of the tensor blocks
- num_cores
Number of cores
- tensor_ram_depth
Amount of tensor RAM per-core (in blocks)
- tensor_descriptor_count
Maximum number of tensors per core
- weight_store_depth
Amount of weight store per-core (in blocks)
- accum_store_depth
Amount of accumulator store per-core (in blocks)
- cell_state_depth
Amount of LSTM cell state per-core (in blocks)
- clamp_store_depth
Amount of clamp store per-core (i.e. the maximum number of different clamp configurations that can be used on a single core)
- max_read_size
Maximum size of data that instructions can perform operations on (in blocks)
- io_size
Minimum size of IO packet (in values)
- class vollo_compiler.NNIR
Neural Network Intermediate Representation
The representation of neural networks in the Vollo compiler. It can be built from a PyTorch model using
vollo_torch.fx.nnir.to_nnir()
, or from an ONNX model.- static from_onnx(onnx_path: str, override_input_shapes: Sequence[int] | Sequence[Sequence[int]] | None) NNIR
Load an ONNX model from a file and convert it to an NNIR graph
- Parameters:
onnx_path – The path to the ONNX file.
override_input_shapes (Optional[Sequence[int] | Sequence[Sequence[int]]]) – If the model has dynamic input shapes, you must pass fixed input shapes. Can either be a Sequence[int] for single-input models, or a Sequence[Sequence[int]] for multi-input models (also works for single-input models). If this argument is used, you must provide a shape for each model input.
- streaming_transform(streaming_axes: int | Sequence[int])
Performs the streaming transform, converting the NNIR to a streaming model
- Parameters:
streaming_axes (int | Sequence[int]) – The dimensions over which to split the inputs into timesteps. Can either be an int for single-input models, or a sequence of ints for multi-input models (also works for single-input models). There should be one axis given for each model input.
Returns: - The transformed NNIR graph - The output streaming axis/es. This will either be an int for single-output models, or a tuple of ints for multi-output models
- to_program(config: Config, name: str | None = None, *, optimize_transforms: bool = true, output_buffer_capacity: int = 64, write_queue_capacity: int = 32) Program
Compile a NNIR graph to a
Program
.Note that the NNIR model given must be a streaming model.
- Parameters:
config – The hardware configuration to compile the program for
optimize_transforms – Whether to run the VM to decide whether to apply certain transformations or not
output_buffer_capacity – The size of the output buffer in the VM (only used when optimise_transforms is true)
write_queue_capacity – The size of the write queue in the VM (only used when optimise_transforms is true)
name – The name of the program
- __new__(**kwargs)
- class vollo_compiler.Program
A program which can be run on a
VM
, or can be used with the Vollo Runtime to put the program onto hardware.- compute_duration_per_inference_us(clock_mhz: int = 320, write_queue_capacity: int = 32, output_buffer_capacity: int = 64, model_index: int = 0) float
Translate the program’s cycle count per inference to a figure in microseconds by dividing it by the clock speed.
- Parameters:
clock_mhz – Clock speed of the Vollo frequency in MHz.
- cycle_count_per_inference(write_queue_capacity: int = 32, output_buffer_capacity: int = 64, model_index: int = 0) int
The number of cycles the program takes in one inference.
- cycle_summary_per_inference(write_queue_capacity: int = 32, output_buffer_capacity: int = 64, model_index: int = 0)
A summary with the number of cycles the program takes from the first input word
Returns: A tuple containing (cycles_to_last_input, cycles_to_first_output, cycles_to_last_output)
- static io_only_test(config: Config, input_values: int, output_values: int) Program
Make a new program that does no compute and aranges IO such that output only starts when all the input is available on the accelerator
- model_input_shape(model_index: int = 0, input_index: int = 0) Tuple[int]
Get the shape of the input at input_index in model at model_index
- model_input_streaming_dim(model_index: int = 0, input_index: int = 0) int | None
Get the shape of the input at input_index in model at model_index
- model_num_inputs(model_index: int = 0) int
Get the number of inputs model at model_index uses.
- model_num_outputs(model_index: int = 0) int
Get the number of outputs model at model_index uses.
- model_output_shape(model_index: int = 0, output_index: int = 0) Tuple[int]
Get the shape of the output at output_index in model at model_index
- model_output_streaming_dim(model_index: int = 0, output_index: int = 0) int | None
Get the shape of the output at output_index in model at model_index
- num_models() int
The number of models in the program.
- save(output_path: str)
- save_bytes() bytes
- to_vm(write_queue_capacity: int = 32, output_buffer_capacity: int = 64, bit_accurate: bool = True, no_compute=False) VM
Construct a stateful Virtual Machine for simulating a Vollo Program.
- Parameters:
bit_accurate (bool) – Use a compute model that replicates the VOLLO accelerator with bit-accuracy. Disable to use single precision compute. Defaults to True.
no_compute (bool) – If True, the VM will not do any computations, and will just move data. (When set to True, the bit_accurate parameter is ignored.) This is useful for when you only need the cycle count of running a program as it is faster than doing the computations. Defaults to False.
- class vollo_compiler.ProgramBuilder
Tracks an internal list of NNIRs which can be compiled into a single multi-model program
- add_nnir(nnir: NNIR, name: str | None = None, *, optimize_transforms: bool = false, output_buffer_capacity: int = 64, write_queue_capacity: int = 32)
Adds a model program compiled from an NNIR to the ProgramBuilder
- Parameters:
nnir – The NNIR to add
optimize_transforms – Whether to run the VM to decide whether to apply certain transformations or not
output_buffer_capacity – The size of the output buffer in the VM (only used if optimise_transforms is true)
write_queue_capacity – The size of the write queue in the VM (only used if optimise_transforms is true)
name – The name of the model
- to_program()
Builds a program the internal NNIRs
- Parameters:
config – The config describing resources available for the final program
- class vollo_compiler.Metrics
Static metrics of a program.
- clamp_store_depth
Total amount of clamp store available on each core
- clamp_store_used
Amount of clamp store used by the program on each core
- input_bytes
Number of bytes input per-inference for each model
- model_names
The name of each model if specified
- num_instrs
Number of instructions on each core
- output_bytes
Number of bytes output per-inference for each model
- tensor_ram_depth
Total amount of tensor ram available on each core
- tensor_ram_used
Tensor ram used by the program on each core
- weight_store_depth
Total amount of weight store available on each core
- weight_store_used
Amount of weight store used by the program on each core
- class vollo_compiler.VM
- compute_duration_us(clock_mhz: int = 320) float
Translate the VM’s cycle count to a figure in microseconds by dividing it by the clock speed.
- Parameters:
clock_mhz – Clock speed of the Vollo frequency in MHz.
- cycle_count() int
The number of cycles that have been performed so far on the VM across all inferences.
- run(inputs: numpy.ndarray | Sequence[numpy.ndarray], model_index: int = 0, f32_round: bool = True) numpy.ndarray | Tuple[numpy.ndarray, ...]
Run the VM on a shaped input.
- Parameters:
inputs (numpy.ndarray | Sequence[numpy.ndarray]) – Can either be a numpy array for single-input models, or a sequence of numpy arrays for multi-input models (also works for single-input models).
model_index (int) – Which model to run. Defaults to 0.
f32_round (bool) – Only used for bit-accurate VMs. When True, f32 inputs will be converted to bf16 by rounding. When False, they will be converted by truncating. Defaults to True.
- Returns:
Either an array for single-output models, or a tuple of arrays for multi-output models.
- run_flat(inputs: numpy.ndarray | Sequence[numpy.ndarray], model_index: int = 0, f32_round: bool = True) numpy.ndarray | Tuple[numpy.ndarray, ...]
Run the VM on a 1D input.
- Parameters:
inputs (numpy.ndarray | Sequence[numpy.ndarray]) – Can either be a numpy array for single-input models, or a sequence of numpy arrays for multi-input models (also works for single-input models).
model_index (int) – Which model to run. Defaults to 0.
f32_round (bool) – Only used for bit-accurate VMs. When True, f32 inputs will be converted to bf16 by rounding. When False, they will be converted by truncating. Defaults to True.
- Returns:
Either an array for single-output models, or a tuple of arrays for multi-output models.
- run_flat_timesteps(inputs: numpy.ndarray | Sequence[numpy.ndarray], input_timestep_dims: int | Sequence[int], output_timestep_dims: int | Sequence[int], model_index: int = 0, f32_round: bool = True) numpy.ndarray | Tuple[numpy.ndarray, ...]
Run the VM on multiple timesteps of inputs.
- Parameters:
inputs (numpy.ndarray | Sequence[numpy.ndarray]) – Can either be a numpy array for single-input models, or a sequence of numpy arrays for multi-input models (also works for single-input models).
input_timestep_dims (int | Sequence[int]) – The dimensions over which to split the inputs into timesteps. Can either be an int for single-input models, or a sequence of ints for multi-input models. If it’s a sequence, then inputs should also be a sequence of the same length.
output_timestep_dims (int | Sequence[int]) – The dimension over which to build up the output timesteps, i.e. the timesteps are stacked along this dimension. Can either be an int for single-output models, or a sequence of ints for multi-output models.
model_index (int) – Which model to run. Defaults to 0.
f32_round (bool) – Only used for bit-accurate VMs. When True, f32 inputs will be converted to bf16 by rounding. When False, they will be converted by truncating. Defaults to True.
- Returns:
Either an array for single-output models, or a tuple of arrays for multi-output models.
- run_timesteps(inputs: numpy.ndarray | Sequence[numpy.ndarray], input_timestep_dims: int | Sequence[int], output_timestep_dims: int | Sequence[int], model_index: int, f32_round: bool = True) numpy.ndarray | Tuple[numpy.ndarray, ...]
Run the VM on multiple timesteps with shaped inputs.
- Parameters:
inputs (numpy.ndarray | Sequence[numpy.ndarray]) – Can either be a numpy array for single-input models, or a sequence of numpy arrays for multi-input models (also works for single-input models).
input_timestep_dims (int | Sequence[int]) – The dimensions over which to split the inputs into timesteps. Can either be an int for single-input models, or a sequence of ints for multi-input models. If it’s a sequence, then inputs should also be a sequence of the same length.
output_timestep_dims (int | Sequence[int]) – The dimension over which to build up the output timesteps, i.e. the timesteps are stacked along this dimension. Can either be an int for single-output models, or a sequence of ints for multi-output models.
model_index (int) – Which model to run. Defaults to 0.
f32_round (bool) – Only used for bit-accurate VMs. When True, f32 inputs will be converted to bf16 by rounding. When False, they will be converted by truncating. Defaults to True.
- Returns:
Either an array for single-output models, or a tuple of arrays for multi-output models.
- exception vollo_compiler.AllocationError
Failed to allocate memory during compilation.
This can happen if a model requires more space to store weights/activations, etc. than is available for the accelerator configuration.
- exception vollo_compiler.SaveError
Failed to save program.
- exception vollo_compiler.LoadError
Failed to load program.