vollo_rt

class vollo_rt.VolloRTContext

A context for performing computation on Vollo

This wraps the C bindings for the Vollo Runtime. In order for contexts to be properly garbage collected, VolloRTContext is a context manager, and should be used in a with block. This ensures that the context id correctly destroyed after use. VolloRTContext should not be nested, only a single context should be open at any given time.

The general order of operations is as follows:

Initialise context in a with statement
add_accelerator()
load_program() / load_program_bytes()
Gather metadata about the models available in the loaded program
Run inference with either:
- run()
- Low level API:
  
  Add jobs with add_job(), add_job_f32() or add_job_bf16()
  
  poll() until a job is completed
  
  get_result() to retrieve the result of the computation

High level API example:

ctx.add_accelerator(0)
ctx.load_program("program.vollo")

input_arr = np.random.rand(100,50).astype(np.float32)
output = ctx.run(input_arr)

Low level API example:

ctx.add_accelerator(0)
ctx.load_program("program.vollo")

# add jobs
input_arr = np.random.rand(100,50).astype(np.float32)
user_ctx_f32 = ctx.add_job_f32(input_arr)

# loop until the job is complete
completed_jobs = []
while(user_ctx_f32 not in completed_jobs):
    completed_jobs = completed_jobs + ctx.poll()

# retrieve the results from each computation
output_f32 = ctx.get_result(user_ctx_f32)

accelerator_block_size(accelerator_index: int) → int

Get the block size of a Vollo accelerator.

If used on a VM before loading a program, it will return 0, because the VM hardware config is determined by the requirements of the loaded program.

accelerator_num_cores(accelerator_index: int) → int

Get the number of cores of a Vollo accelerator. For Vollo Trees bitstreams, this is the number of tree units.

If used on a VM before loading a program, it will return 0, because the VM hardware config is determined by the requirements of the loaded program.

add_accelerator(accelerator_index: int): Add an accelerator. The accelerator is specified by its index. The index refers to an accelerator in the sorted list of PCI addresses.

add_job(inputs, model_index: int = 0)

Sets up a computation on the vollo where the inputs and outputs have type numpy.float32, torch.float32, or torch.bfloat16. Returns a user context, user_ctx. The poll() function will return a list containing user_ctx once the job has been completed.

Note

The computation will be performed in bf16 but the driver will perform the conversion (if needed).
The computation is only started on the next call to poll. This way it is possible to set up several computations that are kicked off at the same time.

Arg:

inputs: One of:

A float32 torch tensor (for single-input models)

A float32 numpy array (for single-input models)

A bfloat16 torch tensor (for single-input models)

A Sequence of float32 torch tensors (for single-/multi-input models)

A Sequence of float32 numpy arrays (for single-/multi-input models)

A Sequence of bfloat16 torch tensors (for single-/multi-input models)

Returns:: A single tensor/array (for single-output models) or a tuple of tensors/arrays (for multi-output models). Whether they’re tensors or arrays, and the data type within them, is determined by the inputs.

add_job_bf16(inputs, model_index: int = 0)

Sets up a computation on the vollo where the inputs and outputs have type torch.bfloat16. Returns a user context, user_ctx. The poll function will return a list containing user_ctx once the job has been completed.

Note: The computation is only started on the next call to poll. This way it is possible to set up several computations that are kicked off at the same time.

Arg:

inputs: One of:

A bfloat16 torch tensor (for single-input models)

A Sequence of bfloat16 torch tensors (for single-/multi-input models)

Returns:: A single bfloat16 tensor (for single-output models) or a tuple of bfloat16 tensors (for multi-output models).

add_job_f32(inputs, model_index: int = 0) → int

Sets up a computation on the vollo where the inputs and outputs have type numpy.float32 or torch.float32. Returns a user context, user_ctx. The poll() function will return a list containing user_ctx once the job has been completed.

Note

The computation will still be performed in bf16 but the driver will perform the conversion.
The computation is only started on the next call to poll. This way it is possible to set up several computations that are kicked off at the same time.

Arg:

inputs: One of:

A float32 torch tensor (for single-input models)

A float32 numpy array (for single-input models)

A Sequence of float32 torch tensors (for single-/multi-input models)

A Sequence of float32 numpy arrays (for single-/multi-input models)

Returns:: A single float32 tensor/array (for single-output models) or a tuple of float32 tensors/arrays (for multi-output models). Whether they’re tensors or arrays is determined by the inputs.

add_vm(accelerator_index: int, bit_accurate: bool)

Add a VM, to run a program in software simulation rather than on hardware. Allows testing the API without needing an accelerator or license, giving correct results but much slower.

You can choose any accelerator_index to assign to the VM then use the rest of the API as though the VM is an accelerator. However, the VM hardware config is determined by the requirements of the loaded program, so until you call vollo_rt_load_program the values returned by accelerator_num_cores and accelerator_block_size will be 0.

Cannot currently be used with Vollo Trees programs.

This should be called before load_program.

Arg:

bit_accurate: Use a compute model that replicates the VOLLO accelerator with bit-accuracy. Disable to use single precision compute.

get_result(user_ctx: int): Retrieve the result of the computation corresponding to user_ctx

load_program(program: str | PathLike | Any)

Load a pre-compiled program onto the accelerator.

Arg:

program: One of:

A path to a Vollo program (typically .vollo)

A program straight from the Vollo Compiler or the Vollo Trees Compiler

load_program_bytes(program_bytes: bytes): Load a pre-compiled Vollo program from a bytes object

model_input_shape(model_index: int = 0, input_index: int = 0) → Tuple[int]: Get the shape of a model input

model_input_streaming_dim(model_index: int = 0, input_index: int = 0) → int | None: Get the input streaming dimension or None

model_name(model_index: int = 0) → str | None: Get the name of a model if set

model_num_inputs(model_index: int = 0) → int: Get the number of inputs the model uses.

model_num_outputs(model_index: int = 0) → int: Get the number of outputs the model uses.

model_output_shape(model_index: int = 0, output_index: int = 0) → Tuple[int]: Get the shape of a model output

model_output_streaming_dim(model_index: int = 0, output_index: int = 0) → int | None: Get the output streaming dimension or None

num_models() → int: The number of models in the loaded program.

poll() → List[int]

Poll the vollo for completion. Returns a list of user contexts corresponding to completed jobs.

Note: Polling also initiates transfers for new jobs, so you must poll before any progress on these new jobs can be made.

run(inputs, model_index: int = 0, max_poll_iterations: int = 10000, sleep_duration_ms: float | None = None)

Run a single inference of a model on Vollo

This is a simpler API to run single inferences without needing to use add_job(), poll() and get_result() manually

Arg:

sleep_duration_ms: The number of milliseconds to wait between each poll. There will be no wait if set to None.