vollo_compiler

class vollo_compiler.Config

A Vollo accelerator configuration.

Each Vollo bitstream contains a specific configuration of the Vollo accelerator, e.g. number of cores, size of each core, etc. Program s need to be compiled for the accelerator configuration that they will be run on.

For the bitstreams included in the Vollo SDK, use the preset configs ia_420f_c6b32(), ia_840f_c3b64(), ia_840f_c2b64d(), :meth: nt400d11_c6b32, v80_c6b32(), :meth: fb4cgg3_c3b32.

static ia_420f_c6b32() → Config

IA-420f configuration with 6 cores and block size 32

Supports up to 3M parameters

static ia_840f_c3b64() → Config

IA-840f configuration with 3 cores and block size 64

Supports up to 6M parameters

static ia_840f_c2b64d() → Config

IA-840f configuration with 2 cores, block size 64, and deeper weight stores

Supports up to 8M parameters

static nt400d11_c6b32() → Config

NT400D11 configuration with 6 cores and block size 32

Supports up to 3M parameters

static v80_c6b32() → Config

V80 configuration with 6 cores and block size 32, with deep weight stores

Supports up to 24M parameters

static fb4cgg3_c3b32() → Config

fb4CGg3 configuration with 3 cores, block_size 32, with deep weight stores

Supports up to 12M parameters

static ip_core(num_cores: int, block_size: int) → Config: Generate a Vollo IP Core configuration with a given number of cores and block size

save(json_path: str) → None: Save a hardware configuration to a JSON file

static load(json_path: str, check_version_matches: bool = True) → Config

Load a hardware configuration from a JSON file

Parameters:

json_path – Path to the JSON file
check_version_matches – if the provided JSON file is versioned, check that the version matches the current version

block_size: Size of the tensor blocks

num_cores: Number of cores

tensor_ram_depth: Amount of tensor RAM per-core (in blocks)

tensor_descriptor_count: Maximum number of tensors per core

weight_store_depth: Amount of weight store per-core (in blocks)

accum_store_depth: Amount of accumulator store per-core (in blocks)

cell_state_depth: Amount of LSTM cell state per-core (in blocks)

clamp_store_depth: Amount of clamp store per-core (i.e. the maximum number of different clamp configurations that can be used on a single core)

max_read_size: Maximum size of data that instructions can perform operations on (in blocks)

io_size: Minimum size of IO packet (in values)

fabric: Type of DSP used

class vollo_compiler.NNIR

Neural Network Intermediate Representation

The representation of neural networks in the Vollo compiler. It can be built from a PyTorch model using vollo_torch.fx.nnir.to_nnir(), or from an ONNX model.

static from_onnx(onnx_path: str, override_input_shapes: Sequence[int] | Sequence[Sequence[int]] | None) → NNIR

Load an ONNX model from a file and convert it to an NNIR graph

Parameters:

onnx_path – The path to the ONNX file.
override_input_shapes (Optional[Sequence[int] | Sequence[Sequence[int]]]) – If the model has dynamic input shapes, you must pass fixed input shapes. Can either be a Sequence[int] for single-input models, or a Sequence[Sequence[int]] for multi-input models (also works for single-input models). If this argument is used, you must provide a shape for each model input.

streaming_transform(streaming_axes: int | Sequence[int])

Performs the streaming transform, converting the NNIR to a streaming model

Parameters:: streaming_axes (int | Sequence[int]) – The dimensions over which to split the inputs into timesteps. Can either be an int for single-input models, or a sequence of ints for multi-input models (also works for single-input models). There should be one axis given for each model input.

Returns: - The transformed NNIR graph - The output streaming axis/es. This will either be an int for single-output models, or a tuple of ints for multi-output models

to_program(config: Config, name: str | None = None, *, optimize_transforms: bool = true, output_buffer_capacity: int = 64, write_queue_capacity: int = 32) → Program

Compile a NNIR graph to a Program.

Note that the NNIR model given must be a streaming model.

Parameters:

config – The hardware configuration to compile the program for
optimize_transforms – Whether to run the VM to decide whether to apply certain transformations or not
output_buffer_capacity – The size of the output buffer in the VM (only used when optimise_transforms is true)
write_queue_capacity – The size of the write queue in the VM (only used when optimise_transforms is true)
name – The name of the program

__new__(**kwargs)

class vollo_compiler.Program

A program which can be run on a VM, or can be used with the Vollo Runtime to put the program onto hardware.

compute_duration_per_inference_us(clock_mhz: int | None = None, write_queue_capacity: int = 32, output_buffer_capacity: int = 64, model_index: int = 0, spaced: bool = True) → float

Translate the program’s cycle count per inference to a figure in microseconds by dividing it by the clock speed.

Parameters:

clock_mhz –

Clock speed of the Vollo frequency in MHz. The default depends on the hardware configuration.

The default clock speed is 320 MHz for Agilex fabrics and 300 MHz for Versal fabrics.

120 MHz is used for UltraScale fabrics.

spaced (bool): Whether to model the inference as having been run immediately after a previous inference (False)
or with time between them (True).

cycle_count_per_inference(write_queue_capacity: int = 32, output_buffer_capacity: int = 64, model_index: int = 0, spaced: bool = True) → int

The number of cycles the program takes in one inference.

Parameters:: spaced (bool) – Whether to model the inference as having been run immediately after a previous inference (False) or with time between them (True).

cycle_summary_per_inference(write_queue_capacity: int = 32, output_buffer_capacity: int = 64, model_index: int = 0, spaced: bool = True)

A summary with the number of cycles the program takes from the first input word

Parameters:: spaced (bool) – Whether to model the inference as having been run immediately after a previous inference (False) or with time between them (True).

hw_config() → Config

static io_only_test(config: Config, input_values: int, output_values: int) → Program: Make a new program that does no compute and aranges IO such that output only starts when all the input is available on the accelerator

static load(input_path: str) → Program

static load_bytes(data: bytes) → Program

metrics() → Metrics: Static Metrics

model_input_shape(model_index: int = 0, input_index: int = 0) → Tuple[int]: Get the shape of the input at input_index in model at model_index

model_input_streaming_dim(model_index: int = 0, input_index: int = 0) → int | None: Get the shape of the input at input_index in model at model_index

model_num_inputs(model_index: int = 0) → int: Get the number of inputs model at model_index uses.

model_num_outputs(model_index: int = 0) → int: Get the number of outputs model at model_index uses.

model_output_shape(model_index: int = 0, output_index: int = 0) → Tuple[int]: Get the shape of the output at output_index in model at model_index

model_output_streaming_dim(model_index: int = 0, output_index: int = 0) → int | None: Get the shape of the output at output_index in model at model_index

num_models() → int: The number of models in the program.

save(output_path: str)

save_bytes() → bytes

to_vm(write_queue_capacity: int = 32, output_buffer_capacity: int = 64, bit_accurate: bool = True, no_compute=False) → VM

Construct a stateful Virtual Machine for simulating a Vollo Program.

Parameters:

bit_accurate (bool) – Use a compute model that replicates the VOLLO accelerator with bit-accuracy. Disable to use single precision compute. Defaults to True.
no_compute (bool) – If True, the VM will not do any computations, and will just move data. (When set to True, the bit_accurate parameter is ignored.) This is useful for when you only need the cycle count of running a program as it is faster than doing the computations. Defaults to False.

transform_to_io_only_test() → Program: Make a new program that is IO compatible but does no compute, an IO only test

class vollo_compiler.ProgramBuilder

Tracks an internal list of NNIRs which can be compiled into a single multi-model program

add_nnir(nnir: NNIR, name: str | None = None, *, optimize_transforms: bool = True, output_buffer_capacity: int = 64, write_queue_capacity: int = 32)

Adds a model program compiled from an NNIR to the ProgramBuilder

Parameters:

nnir – The NNIR to add
optimize_transforms – Whether to run the VM to decide whether to apply certain transformations or not
output_buffer_capacity – The size of the output buffer in the VM (only used if optimise_transforms is true)
write_queue_capacity – The size of the write queue in the VM (only used if optimise_transforms is true)
name – The name of the model

to_program(): Builds a program from the internal NNIRs

class vollo_compiler.Metrics

Static metrics of a program.

clamp_store_depth: Total amount of clamp store available on each core

clamp_store_used: Amount of clamp store used by the program on each core

input_bytes: Number of bytes input per-inference for each model

model_names: The name of each model if specified

num_instrs: Number of instructions on each core

num_micro_instructions: Number of micro instructions on each core

output_bytes: Number of bytes output per-inference for each model

tensor_ram_depth: Total amount of tensor ram available on each core

tensor_ram_used: Tensor ram used by the program on each core

weight_store_depth: Total amount of weight store available on each core

weight_store_used: Amount of weight store used by the program on each core

class vollo_compiler.VM

compute_duration_us(clock_mhz: int | None = None) → float

Translate the VM’s cycle count to a figure in microseconds by dividing it by the clock speed.

Parameters:: clock_mhz – Clock speed of the Vollo frequency in MHz. The default depends on the hardware configuration. The default clock speed is 320 MHz for Agilex fabrics and 300 MHz for Versal fabrics.

120 MHz for Ultrascale fabrics.

Warning

This method has been deprecated. Use vollo_compiler.Program.compute_duration_per_inference_us() instead.

cycle_count() → int: The number of cycles that have been performed so far on the VM across all inferences.

Warning

This method has been deprecated. Use vollo_compiler.Program.cycle_count_per_inference() instead.

metrics() → Metrics: Get the static metrics of the program held by the VM

run(inputs: numpy.ndarray | Sequence[numpy.ndarray], model_index: int = 0, f32_round: bool = True) → numpy.ndarray | Tuple[numpy.ndarray, ...]

Run the VM on a shaped input.

Parameters:

inputs (numpy.ndarray | Sequence[numpy.ndarray]) – Can either be a numpy array for single-input models, or a sequence of numpy arrays for multi-input models (also works for single-input models).
model_index (int) – Which model to run. Defaults to 0.
f32_round (bool) – Only used for bit-accurate VMs. When True, f32 inputs will be converted to bf16 by rounding. When False, they will be converted by truncating. Defaults to True.

Returns:

Either an array for single-output models, or a tuple of arrays for multi-output models.

run_flat(inputs: numpy.ndarray | Sequence[numpy.ndarray], model_index: int = 0, f32_round: bool = True) → numpy.ndarray | Tuple[numpy.ndarray, ...]

Run the VM on a 1D input.

Parameters:

inputs (numpy.ndarray | Sequence[numpy.ndarray]) – Can either be a numpy array for single-input models, or a sequence of numpy arrays for multi-input models (also works for single-input models).
model_index (int) – Which model to run. Defaults to 0.
f32_round (bool) – Only used for bit-accurate VMs. When True, f32 inputs will be converted to bf16 by rounding. When False, they will be converted by truncating. Defaults to True.

Returns:

Either an array for single-output models, or a tuple of arrays for multi-output models.

run_flat_timesteps(inputs: numpy.ndarray | Sequence[numpy.ndarray], input_timestep_dims: int | Sequence[int], output_timestep_dims: int | Sequence[int], model_index: int = 0, f32_round: bool = True) → numpy.ndarray | Tuple[numpy.ndarray, ...]

Run the VM on multiple timesteps of inputs.

Parameters:

inputs (numpy.ndarray | Sequence[numpy.ndarray]) – Can either be a numpy array for single-input models, or a sequence of numpy arrays for multi-input models (also works for single-input models).
input_timestep_dims (int | Sequence[int]) – The dimensions over which to split the inputs into timesteps. Can either be an int for single-input models, or a sequence of ints for multi-input models. If it’s a sequence, then inputs should also be a sequence of the same length.
output_timestep_dims (int | Sequence[int]) – The dimension over which to build up the output timesteps, i.e. the timesteps are stacked along this dimension. Can either be an int for single-output models, or a sequence of ints for multi-output models.
model_index (int) – Which model to run. Defaults to 0.
f32_round (bool) – Only used for bit-accurate VMs. When True, f32 inputs will be converted to bf16 by rounding. When False, they will be converted by truncating. Defaults to True.

Returns:

Either an array for single-output models, or a tuple of arrays for multi-output models.

run_timesteps(inputs: numpy.ndarray | Sequence[numpy.ndarray], input_timestep_dims: int | Sequence[int], output_timestep_dims: int | Sequence[int], model_index: int, f32_round: bool = True) → numpy.ndarray | Tuple[numpy.ndarray, ...]

Run the VM on multiple timesteps with shaped inputs.

Parameters:

inputs (numpy.ndarray | Sequence[numpy.ndarray]) – Can either be a numpy array for single-input models, or a sequence of numpy arrays for multi-input models (also works for single-input models).
input_timestep_dims (int | Sequence[int]) – The dimensions over which to split the inputs into timesteps. Can either be an int for single-input models, or a sequence of ints for multi-input models. If it’s a sequence, then inputs should also be a sequence of the same length.
output_timestep_dims (int | Sequence[int]) – The dimension over which to build up the output timesteps, i.e. the timesteps are stacked along this dimension. Can either be an int for single-output models, or a sequence of ints for multi-output models.
model_index (int) – Which model to run. Defaults to 0.
f32_round (bool) – Only used for bit-accurate VMs. When True, f32 inputs will be converted to bf16 by rounding. When False, they will be converted by truncating. Defaults to True.

Returns:

Either an array for single-output models, or a tuple of arrays for multi-output models.

exception vollo_compiler.AllocationError

Failed to allocate memory during compilation.

This can happen if a model requires more space to store weights/activations, etc. than is available for the accelerator configuration.

exception vollo_compiler.SaveError: Failed to save program.

exception vollo_compiler.LoadError: Failed to load program.