vollo_rt
- class vollo_rt.VolloRTContext
A context for performing computation on Vollo
This wraps the C bindings for the Vollo Runtime. In order for contexts to be properly garbage collected,
VolloRTContextis a context manager, and should be used in awithblock. This ensures that the context id correctly destroyed after use.VolloRTContextshould not be nested, only a single context should be open at any given time.- The general order of operations is as follows:
Initialise context in a
withstatementGather metadata about the models available in the loaded program
- Run inference with either:
- Low level API:
Add jobs with
add_job(),add_job_f32()oradd_job_bf16()poll()until a job is completedget_result()to retrieve the result of the computation
High level API example:
ctx.add_accelerator(0) ctx.load_program("program.vollo") input_arr = np.random.rand(100,50).astype(np.float32) output = ctx.run(input_arr)
Low level API example:
ctx.add_accelerator(0) ctx.load_program("program.vollo") # add jobs input_arr = np.random.rand(100,50).astype(np.float32) user_ctx_f32 = ctx.add_job_f32(input_arr) # loop until the job is complete completed_jobs = [] while(user_ctx_f32 not in completed_jobs): completed_jobs = completed_jobs + ctx.poll() # retrieve the results from each computation output_f32 = ctx.get_result(user_ctx_f32)
- accelerator_block_size(accelerator_index: int) int
Get the block size of a Vollo accelerator.
If used on a VM before loading a program, it will return 0, because the VM hardware config is determined by the requirements of the loaded program.
- accelerator_num_cores(accelerator_index: int) int
Get the number of cores of a Vollo accelerator. For Vollo Trees bitstreams, this is the number of tree units.
If used on a VM before loading a program, it will return 0, because the VM hardware config is determined by the requirements of the loaded program.
- add_accelerator(accelerator_index: int)
Add an accelerator. The accelerator is specified by its index. The index refers to an accelerator in the sorted list of PCI addresses.
- add_job(inputs, model_index: int = 0, output_formats=None)
Sets up a computation on the vollo.
- Parameters:
inputs – A Tensor/Array or Sequence of Tensors/Arrays.
model_index – The index of the model to use.
output_formats – (Optional) A list of vollo number_formats (self.number_format_bf16/number_format_fp32) specifying the desired output types. If None, the model’s native output formats are used.
- Returns:
User context handle. The results (retrieved via poll) will match the types specified in output_formats (or the model’s native types).
- add_job_bf16(inputs, model_index: int = 0)
Sets up a computation where inputs and outputs are bfloat16.
- add_job_f32(inputs, model_index: int = 0) int
Sets up a computation where inputs and outputs are float32.
- add_vm(accelerator_index: int, bit_accurate: bool)
Add a VM, to run a program in software simulation rather than on hardware. Allows testing the API without needing an accelerator or license, giving correct results but much slower.
You can choose any accelerator_index to assign to the VM then use the rest of the API as though the VM is an accelerator. However, the VM hardware config is determined by the requirements of the loaded program, so until you call vollo_rt_load_program the values returned by accelerator_num_cores and accelerator_block_size will be 0.
Cannot currently be used with Vollo Trees programs.
This should be called before load_program.
Arg:
bit_accurate: Use a compute model that replicates the VOLLO accelerator with bit-accuracy. Disable to use single precision compute.
- get_result(user_ctx: int)
Retrieve the result of the computation corresponding to
user_ctx
- load_program(program: str | PathLike | Any)
Load a pre-compiled program onto the accelerator.
Arg:
program: One of:
A path to a Vollo program (typically .vollo)
A program straight from the Vollo Compiler or the Vollo Trees Compiler
- load_program_bytes(program_bytes: bytes)
Load a pre-compiled Vollo program from a bytes object
- model_input_shape(model_index: int = 0, input_index: int = 0) Tuple[int]
Get the shape of a model input
- model_input_streaming_dim(model_index: int = 0, input_index: int = 0) int | None
Get the input streaming dimension or None
- model_name(model_index: int = 0) str | None
Get the name of a model if set
- model_num_inputs(model_index: int = 0) int
Get the number of inputs the model uses.
- model_num_outputs(model_index: int = 0) int
Get the number of outputs the model uses.
- model_output_format(model_index: int = 0, output_index: int = 0)
Get the native number format (bf16 or fp32) of a specific output of the model. Returns the enum value (self.number_format_bf16 or self.number_format_fp32).
- model_output_shape(model_index: int = 0, output_index: int = 0) Tuple[int]
Get the shape of a model output
- model_output_streaming_dim(model_index: int = 0, output_index: int = 0) int | None
Get the output streaming dimension or None
- num_models() int
The number of models in the loaded program.
- poll() List[int]
Poll the vollo for completion. Returns a list of user contexts corresponding to completed jobs.
Note: Polling also initiates transfers for new jobs, so you must poll before any progress on these new jobs can be made.
- run(inputs, model_index: int = 0, max_poll_iterations: int = 20000000000, sleep_duration_ms: float | None = None)
Run a single inference of a model on Vollo
This is a simpler API to run single inferences without needing to use
add_job(),poll()andget_result()manuallyArg:
sleep_duration_ms: The number of milliseconds to wait between each poll. There will be no wait if set to None.