vollo_compiler

class vollo_compiler.Config

A Vollo accelerator configuration.

Each Vollo bitstream contains a specific configuration of the Vollo accelerator, e.g. number of cores, size of each core, etc. Program s need to be compiled for the accelerator configuration that they will be run on.

For the bitstreams included in the Vollo SDK, use the preset configs ia_420f_c6b32(), ia_840f_c3b64().

static ia_420f_c6b32()

IA-420f configuration with 6 cores and block size 32

static ia_840f_c3b64()

IA-840f configuration with 3 cores and block size 64

save(json_path: str) None

Save a hardware configuration to a JSON file

static load(json_path: str, check_version_matches: bool = True) Config

Load a hardware configuration from a JSON file

Parameters:
  • json_path – Path to the JSON file

  • check_version_matches – if the provided JSON file is versioned, check that the version matches the current version

block_size

Size of the tensor blocks

num_cores

Number of cores

tensor_ram_depth

Amount of tensor RAM per-core (in blocks)

tensor_descriptor_count

Maximum number of tensors per core

weight_store_depth

Amount of weight store per-core (in blocks)

accum_store_depth

Amount of accumulator store per-core (in blocks)

cell_state_depth

Amount of LSTM cell state per-core (in blocks)

clamp_store_depth

Amount of clamp store per-core (i.e. the maximum number of different clamp configurations that can be used on a single core)

max_read_size

Maximum size of data that instructions can perform operations on (in blocks)

io_size

Minimum size of IO packet (in values)

class vollo_compiler.NNIR

Neural Network Intermediate Representation

The representation of neural networks in the Vollo compiler. It can be built from a PyTorch model using vollo_torch.fx.nnir.to_nnir(), or from an ONNX model.

static from_onnx(onnx_path: str, override_input_shapes: Sequence[int] | Sequence[Sequence[int]] | None) NNIR

Load an ONNX model from a file and convert it to an NNIR graph

Parameters:
  • onnx_path – The path to the ONNX file.

  • override_input_shapes (Optional[Sequence[int] | Sequence[Sequence[int]]]) – If the model has dynamic input shapes, you must pass fixed input shapes. Can either be a Sequence[int] for single-input models, or a Sequence[Sequence[int]] for multi-input models (also works for single-input models). If this argument is used, you must provide a shape for each model input.

streaming_transform(streaming_axes: int | Sequence[int])

Performs the streaming transform, converting the NNIR to a streaming model

Parameters:

streaming_axes (int | Sequence[int]) – The dimensions over which to split the inputs into timesteps. Can either be an int for single-input models, or a sequence of ints for multi-input models (also works for single-input models). There should be one axis given for each model input.

Returns: - The transformed NNIR graph - The output streaming axis/es. This will either be an int for single-output models, or a tuple of ints for multi-output models

to_program(config: Config, name: str | None = None, *, optimize_transforms: bool = true, output_buffer_capacity: int = 64, write_queue_capacity: int = 32) Program

Compile a NNIR graph to a Program.

Note that the NNIR model given must be a streaming model.

Parameters:
  • config – The hardware configuration to compile the program for

  • optimize_transforms – Whether to run the VM to decide whether to apply certain transformations or not

  • output_buffer_capacity – The size of the output buffer in the VM (only used when optimise_transforms is true)

  • write_queue_capacity – The size of the write queue in the VM (only used when optimise_transforms is true)

  • name – The name of the program

__new__(**kwargs)
class vollo_compiler.Program

A program which can be run on a VM, or can be used with the Vollo Runtime to put the program onto hardware.

compute_duration_per_inference_us(clock_mhz: int = 320, write_queue_capacity: int = 32, output_buffer_capacity: int = 64, model_index: int = 0) float

Translate the program’s cycle count per inference to a figure in microseconds by dividing it by the clock speed.

Parameters:

clock_mhz – Clock speed of the Vollo frequency in MHz.

cycle_count_per_inference(write_queue_capacity: int = 32, output_buffer_capacity: int = 64, model_index: int = 0) int

The number of cycles the program takes in one inference.

cycle_summary_per_inference(write_queue_capacity: int = 32, output_buffer_capacity: int = 64, model_index: int = 0)

A summary with the number of cycles the program takes from the first input word

Returns: A tuple containing (cycles_to_last_input, cycles_to_first_output, cycles_to_last_output)

hw_config() Config
static io_only_test(config: Config, input_values: int, output_values: int) Program

Make a new program that does no compute and aranges IO such that output only starts when all the input is available on the accelerator

static load(input_path: str) Program
static load_bytes(data: bytes) Program
metrics() Metrics

Static Metrics

model_input_shape(model_index: int = 0, input_index: int = 0) Tuple[int]

Get the shape of the input at input_index in model at model_index

model_input_streaming_dim(model_index: int = 0, input_index: int = 0) int | None

Get the shape of the input at input_index in model at model_index

model_num_inputs(model_index: int = 0) int

Get the number of inputs model at model_index uses.

model_num_outputs(model_index: int = 0) int

Get the number of outputs model at model_index uses.

model_output_shape(model_index: int = 0, output_index: int = 0) Tuple[int]

Get the shape of the output at output_index in model at model_index

model_output_streaming_dim(model_index: int = 0, output_index: int = 0) int | None

Get the shape of the output at output_index in model at model_index

num_models() int

The number of models in the program.

save(output_path: str)
save_bytes() bytes
to_vm(write_queue_capacity: int = 32, output_buffer_capacity: int = 64, bit_accurate: bool = True, no_compute=False) VM

Construct a stateful Virtual Machine for simulating a Vollo Program.

Parameters:
  • bit_accurate (bool) – Use a compute model that replicates the VOLLO accelerator with bit-accuracy. Disable to use single precision compute. Defaults to True.

  • no_compute (bool) – If True, the VM will not do any computations, and will just move data. (When set to True, the bit_accurate parameter is ignored.) This is useful for when you only need the cycle count of running a program as it is faster than doing the computations. Defaults to False.

transform_to_io_only_test() Program

Make a new program that is IO compatible but does no compute, an IO only test

class vollo_compiler.ProgramBuilder

Tracks an internal list of NNIRs which can be compiled into a single multi-model program

add_nnir(nnir: NNIR, name: str | None = None, *, optimize_transforms: bool = false, output_buffer_capacity: int = 64, write_queue_capacity: int = 32)

Adds a model program compiled from an NNIR to the ProgramBuilder

Parameters:
  • nnir – The NNIR to add

  • optimize_transforms – Whether to run the VM to decide whether to apply certain transformations or not

  • output_buffer_capacity – The size of the output buffer in the VM (only used if optimise_transforms is true)

  • write_queue_capacity – The size of the write queue in the VM (only used if optimise_transforms is true)

  • name – The name of the model

to_program()

Builds a program the internal NNIRs

Parameters:

config – The config describing resources available for the final program

class vollo_compiler.Metrics

Static metrics of a program.

clamp_store_depth

Total amount of clamp store available on each core

clamp_store_used

Amount of clamp store used by the program on each core

input_bytes

Number of bytes input per-inference for each model

model_names

The name of each model if specified

num_instrs

Number of instructions on each core

output_bytes

Number of bytes output per-inference for each model

tensor_ram_depth

Total amount of tensor ram available on each core

tensor_ram_used

Tensor ram used by the program on each core

weight_store_depth

Total amount of weight store available on each core

weight_store_used

Amount of weight store used by the program on each core

class vollo_compiler.VM
compute_duration_us(clock_mhz: int = 320) float

Translate the VM’s cycle count to a figure in microseconds by dividing it by the clock speed.

Parameters:

clock_mhz – Clock speed of the Vollo frequency in MHz.

cycle_count() int

The number of cycles that have been performed so far on the VM across all inferences.

metrics() Metrics

Get the static metrics of the program held by the VM

run(inputs: numpy.ndarray | Sequence[numpy.ndarray], model_index: int = 0, f32_round: bool = True) numpy.ndarray | Tuple[numpy.ndarray, ...]

Run the VM on a shaped input.

Parameters:
  • inputs (numpy.ndarray | Sequence[numpy.ndarray]) – Can either be a numpy array for single-input models, or a sequence of numpy arrays for multi-input models (also works for single-input models).

  • model_index (int) – Which model to run. Defaults to 0.

  • f32_round (bool) – Only used for bit-accurate VMs. When True, f32 inputs will be converted to bf16 by rounding. When False, they will be converted by truncating. Defaults to True.

Returns:

Either an array for single-output models, or a tuple of arrays for multi-output models.

run_flat(inputs: numpy.ndarray | Sequence[numpy.ndarray], model_index: int = 0, f32_round: bool = True) numpy.ndarray | Tuple[numpy.ndarray, ...]

Run the VM on a 1D input.

Parameters:
  • inputs (numpy.ndarray | Sequence[numpy.ndarray]) – Can either be a numpy array for single-input models, or a sequence of numpy arrays for multi-input models (also works for single-input models).

  • model_index (int) – Which model to run. Defaults to 0.

  • f32_round (bool) – Only used for bit-accurate VMs. When True, f32 inputs will be converted to bf16 by rounding. When False, they will be converted by truncating. Defaults to True.

Returns:

Either an array for single-output models, or a tuple of arrays for multi-output models.

run_flat_timesteps(inputs: numpy.ndarray | Sequence[numpy.ndarray], input_timestep_dims: int | Sequence[int], output_timestep_dims: int | Sequence[int], model_index: int = 0, f32_round: bool = True) numpy.ndarray | Tuple[numpy.ndarray, ...]

Run the VM on multiple timesteps of inputs.

Parameters:
  • inputs (numpy.ndarray | Sequence[numpy.ndarray]) – Can either be a numpy array for single-input models, or a sequence of numpy arrays for multi-input models (also works for single-input models).

  • input_timestep_dims (int | Sequence[int]) – The dimensions over which to split the inputs into timesteps. Can either be an int for single-input models, or a sequence of ints for multi-input models. If it’s a sequence, then inputs should also be a sequence of the same length.

  • output_timestep_dims (int | Sequence[int]) – The dimension over which to build up the output timesteps, i.e. the timesteps are stacked along this dimension. Can either be an int for single-output models, or a sequence of ints for multi-output models.

  • model_index (int) – Which model to run. Defaults to 0.

  • f32_round (bool) – Only used for bit-accurate VMs. When True, f32 inputs will be converted to bf16 by rounding. When False, they will be converted by truncating. Defaults to True.

Returns:

Either an array for single-output models, or a tuple of arrays for multi-output models.

run_timesteps(inputs: numpy.ndarray | Sequence[numpy.ndarray], input_timestep_dims: int | Sequence[int], output_timestep_dims: int | Sequence[int], model_index: int, f32_round: bool = True) numpy.ndarray | Tuple[numpy.ndarray, ...]

Run the VM on multiple timesteps with shaped inputs.

Parameters:
  • inputs (numpy.ndarray | Sequence[numpy.ndarray]) – Can either be a numpy array for single-input models, or a sequence of numpy arrays for multi-input models (also works for single-input models).

  • input_timestep_dims (int | Sequence[int]) – The dimensions over which to split the inputs into timesteps. Can either be an int for single-input models, or a sequence of ints for multi-input models. If it’s a sequence, then inputs should also be a sequence of the same length.

  • output_timestep_dims (int | Sequence[int]) – The dimension over which to build up the output timesteps, i.e. the timesteps are stacked along this dimension. Can either be an int for single-output models, or a sequence of ints for multi-output models.

  • model_index (int) – Which model to run. Defaults to 0.

  • f32_round (bool) – Only used for bit-accurate VMs. When True, f32 inputs will be converted to bf16 by rounding. When False, they will be converted by truncating. Defaults to True.

Returns:

Either an array for single-output models, or a tuple of arrays for multi-output models.

exception vollo_compiler.AllocationError

Failed to allocate memory during compilation.

This can happen if a model requires more space to store weights/activations, etc. than is available for the accelerator configuration.

exception vollo_compiler.SaveError

Failed to save program.

exception vollo_compiler.LoadError

Failed to load program.