vollo_compiler
- class vollo_compiler.Config
A Vollo accelerator configuration.
Each Vollo bitstream contains a specific configuration of the Vollo accelerator, e.g. number of cores, size of each core, etc.
Program
s need to be compiled for the accelerator configuration that they will be run on.For the bitstreams included in the Vollo SDK, use the preset configs
ia_420f_c6b32()
,ia_840f_c3b64()
,ia_840f_c2b64d()
,v80_c6b32()
.- static ia_420f_c6b32()
IA-420f configuration with 6 cores and block size 32
Supports up to 3M parameters
- static ia_840f_c3b64()
IA-840f configuration with 3 cores and block size 64
Supports up to 6M parameters
- static ia_840f_c2b64d()
IA-840f configuration with 2 cores, block size 64, and deeper weight stores
Supports up to 8M parameters
- static v80_c6b32()
V80 configuration with 6 cores and block size 32
Supports up to 24M parameters
- static ip_core()
Generate a Vollo IP Core configuration with a given number of cores and block size
- save(json_path: str) None
Save a hardware configuration to a JSON file
- static load(json_path: str, check_version_matches: bool = True) Config
Load a hardware configuration from a JSON file
- Parameters:
json_path – Path to the JSON file
check_version_matches – if the provided JSON file is versioned, check that the version matches the current version
- block_size
Size of the tensor blocks
- num_cores
Number of cores
- tensor_ram_depth
Amount of tensor RAM per-core (in blocks)
- tensor_descriptor_count
Maximum number of tensors per core
- weight_store_depth
Amount of weight store per-core (in blocks)
- accum_store_depth
Amount of accumulator store per-core (in blocks)
- cell_state_depth
Amount of LSTM cell state per-core (in blocks)
- clamp_store_depth
Amount of clamp store per-core (i.e. the maximum number of different clamp configurations that can be used on a single core)
- max_read_size
Maximum size of data that instructions can perform operations on (in blocks)
- io_size
Minimum size of IO packet (in values)
- fabric
Type of DSP used
- class vollo_compiler.NNIR
Neural Network Intermediate Representation
The representation of neural networks in the Vollo compiler. It can be built from a PyTorch model using
vollo_torch.fx.nnir.to_nnir()
, or from an ONNX model.- static from_onnx(onnx_path: str, override_input_shapes: Sequence[int] | Sequence[Sequence[int]] | None) NNIR
Load an ONNX model from a file and convert it to an NNIR graph
- Parameters:
onnx_path – The path to the ONNX file.
override_input_shapes (Optional[Sequence[int] | Sequence[Sequence[int]]]) – If the model has dynamic input shapes, you must pass fixed input shapes. Can either be a Sequence[int] for single-input models, or a Sequence[Sequence[int]] for multi-input models (also works for single-input models). If this argument is used, you must provide a shape for each model input.
- streaming_transform(streaming_axes: int | Sequence[int])
Performs the streaming transform, converting the NNIR to a streaming model
- Parameters:
streaming_axes (int | Sequence[int]) – The dimensions over which to split the inputs into timesteps. Can either be an int for single-input models, or a sequence of ints for multi-input models (also works for single-input models). There should be one axis given for each model input.
Returns: - The transformed NNIR graph - The output streaming axis/es. This will either be an int for single-output models, or a tuple of ints for multi-output models
- to_program(config: Config, name: str | None = None, *, optimize_transforms: bool = true, output_buffer_capacity: int = 64, write_queue_capacity: int = 32) Program
Compile a NNIR graph to a
Program
.Note that the NNIR model given must be a streaming model.
- Parameters:
config – The hardware configuration to compile the program for
optimize_transforms – Whether to run the VM to decide whether to apply certain transformations or not
output_buffer_capacity – The size of the output buffer in the VM (only used when optimise_transforms is true)
write_queue_capacity – The size of the write queue in the VM (only used when optimise_transforms is true)
name – The name of the program
- __new__(**kwargs)
- class vollo_compiler.Program
A program which can be run on a
VM
, or can be used with the Vollo Runtime to put the program onto hardware.- compute_duration_per_inference_us(clock_mhz: int | None = None, write_queue_capacity: int = 32, output_buffer_capacity: int = 64, model_index: int = 0) float
Translate the program’s cycle count per inference to a figure in microseconds by dividing it by the clock speed.
- Parameters:
clock_mhz – Clock speed of the Vollo frequency in MHz. The default depends on the hardware configuration. The default clock speed is 320 MHz for Agilex fabrics and 166 MHz for Versal fabrics.
- cycle_count_per_inference(write_queue_capacity: int = 32, output_buffer_capacity: int = 64, model_index: int = 0) int
The number of cycles the program takes in one inference.
- cycle_summary_per_inference(write_queue_capacity: int = 32, output_buffer_capacity: int = 64, model_index: int = 0)
A summary with the number of cycles the program takes from the first input word
Returns: A tuple containing (cycles_to_last_input, cycles_to_first_output, cycles_to_last_output)
- static io_only_test(config: Config, input_values: int, output_values: int) Program
Make a new program that does no compute and aranges IO such that output only starts when all the input is available on the accelerator
- model_input_shape(model_index: int = 0, input_index: int = 0) Tuple[int]
Get the shape of the input at input_index in model at model_index
- model_input_streaming_dim(model_index: int = 0, input_index: int = 0) int | None
Get the shape of the input at input_index in model at model_index
- model_num_inputs(model_index: int = 0) int
Get the number of inputs model at model_index uses.
- model_num_outputs(model_index: int = 0) int
Get the number of outputs model at model_index uses.
- model_output_shape(model_index: int = 0, output_index: int = 0) Tuple[int]
Get the shape of the output at output_index in model at model_index
- model_output_streaming_dim(model_index: int = 0, output_index: int = 0) int | None
Get the shape of the output at output_index in model at model_index
- num_models() int
The number of models in the program.
- save(output_path: str)
- save_bytes() bytes
- to_vm(write_queue_capacity: int = 32, output_buffer_capacity: int = 64, bit_accurate: bool = True, no_compute=False) VM
Construct a stateful Virtual Machine for simulating a Vollo Program.
- Parameters:
bit_accurate (bool) – Use a compute model that replicates the VOLLO accelerator with bit-accuracy. Disable to use single precision compute. Defaults to True.
no_compute (bool) – If True, the VM will not do any computations, and will just move data. (When set to True, the bit_accurate parameter is ignored.) This is useful for when you only need the cycle count of running a program as it is faster than doing the computations. Defaults to False.
- class vollo_compiler.ProgramBuilder
Tracks an internal list of NNIRs which can be compiled into a single multi-model program
- add_nnir(nnir: NNIR, name: str | None = None, *, optimize_transforms: bool = false, output_buffer_capacity: int = 64, write_queue_capacity: int = 32)
Adds a model program compiled from an NNIR to the ProgramBuilder
- Parameters:
nnir – The NNIR to add
optimize_transforms – Whether to run the VM to decide whether to apply certain transformations or not
output_buffer_capacity – The size of the output buffer in the VM (only used if optimise_transforms is true)
write_queue_capacity – The size of the write queue in the VM (only used if optimise_transforms is true)
name – The name of the model
- to_program()
Builds a program the internal NNIRs
- Parameters:
config – The config describing resources available for the final program
- class vollo_compiler.Metrics
Static metrics of a program.
- clamp_store_depth
Total amount of clamp store available on each core
- clamp_store_used
Amount of clamp store used by the program on each core
- input_bytes
Number of bytes input per-inference for each model
- model_names
The name of each model if specified
- num_instrs
Number of instructions on each core
- output_bytes
Number of bytes output per-inference for each model
- tensor_ram_depth
Total amount of tensor ram available on each core
- tensor_ram_used
Tensor ram used by the program on each core
- weight_store_depth
Total amount of weight store available on each core
- weight_store_used
Amount of weight store used by the program on each core
- class vollo_compiler.VM
- compute_duration_us(clock_mhz: int | None = None) float
Translate the VM’s cycle count to a figure in microseconds by dividing it by the clock speed.
- Parameters:
clock_mhz – Clock speed of the Vollo frequency in MHz. The default depends on the hardware configuration. The default clock speed is 320 MHz for Agilex fabrics and 166 MHz for Versal fabrics.
- cycle_count() int
The number of cycles that have been performed so far on the VM across all inferences.
- run(inputs: numpy.ndarray | Sequence[numpy.ndarray], model_index: int = 0, f32_round: bool = True) numpy.ndarray | Tuple[numpy.ndarray, ...]
Run the VM on a shaped input.
- Parameters:
inputs (numpy.ndarray | Sequence[numpy.ndarray]) – Can either be a numpy array for single-input models, or a sequence of numpy arrays for multi-input models (also works for single-input models).
model_index (int) – Which model to run. Defaults to 0.
f32_round (bool) – Only used for bit-accurate VMs. When True, f32 inputs will be converted to bf16 by rounding. When False, they will be converted by truncating. Defaults to True.
- Returns:
Either an array for single-output models, or a tuple of arrays for multi-output models.
- run_flat(inputs: numpy.ndarray | Sequence[numpy.ndarray], model_index: int = 0, f32_round: bool = True) numpy.ndarray | Tuple[numpy.ndarray, ...]
Run the VM on a 1D input.
- Parameters:
inputs (numpy.ndarray | Sequence[numpy.ndarray]) – Can either be a numpy array for single-input models, or a sequence of numpy arrays for multi-input models (also works for single-input models).
model_index (int) – Which model to run. Defaults to 0.
f32_round (bool) – Only used for bit-accurate VMs. When True, f32 inputs will be converted to bf16 by rounding. When False, they will be converted by truncating. Defaults to True.
- Returns:
Either an array for single-output models, or a tuple of arrays for multi-output models.
- run_flat_timesteps(inputs: numpy.ndarray | Sequence[numpy.ndarray], input_timestep_dims: int | Sequence[int], output_timestep_dims: int | Sequence[int], model_index: int = 0, f32_round: bool = True) numpy.ndarray | Tuple[numpy.ndarray, ...]
Run the VM on multiple timesteps of inputs.
- Parameters:
inputs (numpy.ndarray | Sequence[numpy.ndarray]) – Can either be a numpy array for single-input models, or a sequence of numpy arrays for multi-input models (also works for single-input models).
input_timestep_dims (int | Sequence[int]) – The dimensions over which to split the inputs into timesteps. Can either be an int for single-input models, or a sequence of ints for multi-input models. If it’s a sequence, then inputs should also be a sequence of the same length.
output_timestep_dims (int | Sequence[int]) – The dimension over which to build up the output timesteps, i.e. the timesteps are stacked along this dimension. Can either be an int for single-output models, or a sequence of ints for multi-output models.
model_index (int) – Which model to run. Defaults to 0.
f32_round (bool) – Only used for bit-accurate VMs. When True, f32 inputs will be converted to bf16 by rounding. When False, they will be converted by truncating. Defaults to True.
- Returns:
Either an array for single-output models, or a tuple of arrays for multi-output models.
- run_timesteps(inputs: numpy.ndarray | Sequence[numpy.ndarray], input_timestep_dims: int | Sequence[int], output_timestep_dims: int | Sequence[int], model_index: int, f32_round: bool = True) numpy.ndarray | Tuple[numpy.ndarray, ...]
Run the VM on multiple timesteps with shaped inputs.
- Parameters:
inputs (numpy.ndarray | Sequence[numpy.ndarray]) – Can either be a numpy array for single-input models, or a sequence of numpy arrays for multi-input models (also works for single-input models).
input_timestep_dims (int | Sequence[int]) – The dimensions over which to split the inputs into timesteps. Can either be an int for single-input models, or a sequence of ints for multi-input models. If it’s a sequence, then inputs should also be a sequence of the same length.
output_timestep_dims (int | Sequence[int]) – The dimension over which to build up the output timesteps, i.e. the timesteps are stacked along this dimension. Can either be an int for single-output models, or a sequence of ints for multi-output models.
model_index (int) – Which model to run. Defaults to 0.
f32_round (bool) – Only used for bit-accurate VMs. When True, f32 inputs will be converted to bf16 by rounding. When False, they will be converted by truncating. Defaults to True.
- Returns:
Either an array for single-output models, or a tuple of arrays for multi-output models.
- exception vollo_compiler.AllocationError
Failed to allocate memory during compilation.
This can happen if a model requires more space to store weights/activations, etc. than is available for the accelerator configuration.
- exception vollo_compiler.SaveError
Failed to save program.
- exception vollo_compiler.LoadError
Failed to load program.