C API
The Vollo runtime API is a C API with simple types and functions in order to be straight forward to use from any language with a C FFI.
- Header file:
$VOLLO_SDK/include/vollo-rt.h
- Dynamic library:
$VOLLO_SDK/lib/libvollo_rt.so
- Static library:
$VOLLO_SDK/lib/libvollo_rt.a
It links against GLIBC (from version 2.27), contact us if you have other requirements.
To compile against Vollo RT with a standard C compiler, you can use the following flags:
-I $VOLLO_SDK/include -L $VOLLO_SDK/lib -lvollo_rt
These are the main steps (in order) a program using vollo_rt
will follow:
- Initialise the Vollo runtime using
vollo_rt_init
- Add Vollo accelerators to the runtime using
vollo_rt_add_accelerator
(Note: the current release only supports one accelerator) - Load a Vollo program onto the Vollo accelerators with
vollo_rt_load_program
- Optionally, inspect the metadata about the models in the program using API calls such as
vollo_rt_num_models
andvollo_rt_model_num_inputs
- Queue and run inference jobs by first calling
vollo_rt_add_job_bf16
(orvollo_rt_add_job_fp32
) and then polling in a loop for their completion usingvollo_rt_poll
. You can queue several jobs before callingvollo_rt_poll
or add extra jobs at any point. - Finally call
vollo_rt_destroy
to release resources.
The API is designed to explicitly return errors when it can to let the user handle them as they see fit. The metadata functions will instead error out themselves if any of the documented pre-conditions they rely on aren't met. Any other crash is considered a bug and we would be very grateful if you could tell us about it.
Initialisation
A vollo context is created by calling vollo_rt_init
.
Add an accelerator by using the vollo_rt_add_accelerator
function.
/**
* Initialise the vollo-rt context. This must be called before any other vollo-rt functions.
*
* Logging level can be configured by setting the environment variable `VOLLO_RT_LOG` to one of:
* "error", "warn", "info", "debug", or "trace"
*/
vollo_rt_error_t vollo_rt_init(vollo_rt_context_t* context_ptr);
/**
* Destroy vollo-rt context, releasing its associated resources.
*/
void vollo_rt_destroy(vollo_rt_context_t vollo);
/**
* Add an accelerator.
* The accelerator is specified by its index. The index refers to an accelerator in the sorted list
* of PCI addresses. This should be called after `vollo_rt_init` but before `vollo_rt_load_program`
*/
vollo_rt_error_t vollo_rt_add_accelerator(vollo_rt_context_t vollo, size_t accelerator_index);
Loading a program
A program is loaded onto the Vollo accelerator using the vollo_rt_load_program
function.
/**
* Load a program onto the Vollo accelerators.
* This should be called after `vollo_rt_add_accelerator`
*
* A Vollo program is generated by the Vollo compiler, it is typically named
* "<program_name>.vollo".
* The program is intended for a specific hardware config (number of accelerators,
* cores and other configuration options), this function will return an
* error if any accelerator configuration is incompatible with the program.
* Once loaded, the program provides inference for several models concurrently.
*
* Note: This should only be called once per `vollo_rt_context_t`, as such if
* a program needs to be changed or reset, first `vollo_rt_destroy` the current
* context, then start a new context with `vollo_rt_init`.
*/
vollo_rt_error_t vollo_rt_load_program(vollo_rt_context_t vollo, const char* program_path);
Model metadata
Once a program is loaded, it provides inference for one or more models. Metadata about
a model is obtained with vollo_rt_model_*
functions.
Each model can have multiple distinct inputs and outputs. Each input and each output has a multi-dimensional shape associated with it. All of the metadata is defined by the program as supplied by the Vollo compiler. All the shapes are statically defined.
Some models can be compiled as streaming statefully over a dimension, that dimension is then erased from the inference shape but its possition can be recovered in the model metadata.
/**
* Inspect the number of models in the program loaded onto the vollo.
*
* Programs can contain multiple models, a `model_index` is used to select a
* specific model
*/
size_t vollo_rt_num_models(vollo_rt_context_t vollo);
/**
* Get the number of inputs of a model
*
* Each input has its own distinct shape
*
* Requirements (panics otherwise):
* - a program was loaded with `vollo_rt_load_program`
* - `model_index < vollo_rt_num_models`
*/
size_t vollo_rt_model_num_inputs(vollo_rt_context_t vollo, size_t model_index);
/**
* Get the number of outputs of a model
*
* Each output has its own distinct shape
*
* Requirements (panics otherwise):
* - a program was loaded with `vollo_rt_load_program`
* - `model_index < vollo_rt_num_models`
*/
size_t vollo_rt_model_num_outputs(vollo_rt_context_t vollo, size_t model_index);
/**
* Get the shape for input at a given index
*
* The return value is a 0 terminated array of dims containing the input shape
* The value lives for as long as the model
*
* Requirements (panics otherwise):
* - a program was loaded with `vollo_rt_load_program`
* - `model_index < vollo_rt_num_models`
* - `input_index < vollo_rt_model_num_inputs`
*/
const size_t* vollo_rt_model_input_shape(
vollo_rt_context_t vollo, size_t model_index, size_t input_index);
/**
* Get the shape for output at a given index
*
* The return value is a 0 terminated array of dims containing the output shape
* The value lives for as long as the model
*
* Requirements (panics otherwise):
* - a program was loaded with `vollo_rt_load_program`
* - `model_index < vollo_rt_num_models`
* - `output_index < vollo_rt_model_num_outputs`
*/
const size_t* vollo_rt_model_output_shape(
vollo_rt_context_t vollo, size_t model_index, size_t output_index);
/**
* Get the number of elements for input at a given index
*
* This is simply the product of the dimensions returned by `vollo_rt_model_input_shape`,
* it is provided to make it easier to allocate the correct number of elements.
*
* Requirements (panics otherwise):
* - a program was loaded with `vollo_rt_load_program`
* - `model_index < vollo_rt_num_models`
* - `input_index < vollo_rt_model_num_inputs`
*/
size_t vollo_rt_model_input_num_elements(
vollo_rt_context_t vollo, size_t model_index, size_t input_index);
/**
* Get the number of elements for output at a given index
*
* This is simply the product of the dimensions returned by `vollo_rt_model_output_shape`,
* it is provided to make it easier to allocate the correct number of elements.
*
* Requirements (panics otherwise):
* - a program was loaded with `vollo_rt_load_program`
* - `model_index < vollo_rt_num_models`
* - `output_index < vollo_rt_model_num_outputs`
*/
size_t vollo_rt_model_output_num_elements(
vollo_rt_context_t vollo, size_t model_index, size_t output_index);
/**
* In a streaming model, the streaming dimension is not part of the shape.
*
* - It returns -1 when there is no streaming dimension
* - It otherwise returns the dim index
* For example, for a shape `(a, b, c)` and streaming dim index 1, the full shape is:
* `(a, streaming_dim, b, c)`
*
* Requirements (panics otherwise):
* - a program was loaded with `vollo_rt_load_program`
* - `model_index < vollo_rt_num_models`
* - `input_index < vollo_rt_model_num_inputs`
*/
int vollo_rt_model_input_streaming_dim(
vollo_rt_context_t vollo, size_t model_index, size_t input_index);
/**
* In a streaming model, the streaming dimension is not part of the shape.
*
* - It returns -1 when there is no streaming dimension
* - It otherwise returns the dim index
* For example, for a shape `(a, b, c)` and streaming dim index 1, the full shape is:
* `(a, streaming_dim, b, c)`
*
* Requirements (panics otherwise):
* - a program was loaded with `vollo_rt_load_program`
* - `model_index < vollo_rt_num_models`
* - `output_index < vollo_rt_model_num_outputs`
*/
int vollo_rt_model_output_streaming_dim(
vollo_rt_context_t vollo, size_t model_index, size_t output_index);
Running inference
The interface returns results asynchronously so that inference requests can be made as fast
as the system can support, without blocking on output data being returned. This way, it also
supports running multiple requests concurrently.
Before any compute is started a job with associated input and output buffers needs to be
registered with the runtime using one of vollo_rt_add_job_bf16
or vollo_rt_add_job_fp32
.
The bf16
variant uses bfloat16
which is effectively a cropped version of single precision floating point format
fp32
(same exponent, smaller mantissa).
Note: do NOT use C floating point literals for bf16
as it is simply a uint16_t
in the API
A fp32
variant is also provided despite the Vollo accelerator expecting its
inputs and outputs to be in fp16
. If you are working with fp32
, prefer
this version instead of the bf16
variant as it is able to make the conversion
while copying to/from DMA buffers, avoiding an extra copy.
/**
* Sets up a computation on the vollo accelerator where the inputs and outputs are in brain-float 16
* format.
*
* Note: The computation is only started on the next call to vollo_rt_poll. This way it is possible
* to set up several computations that are kicked off at the same time.
*
* - vollo:
* the context that the computation should be run on
* - model_index:
* the model to run
* - user_ctx:
* a user context that will be returned on completion. This can be used to disambiguate when
* multiple models are running concurrently.
* NOTE: the jobs for a single model are guaranteed to come back in order, but the jobs for
* different models are not.
* - input_data:
* a pointer to the start of an array with pointers to the start of the data to each input the
* number of inputs is given by `vollo_rt_model_num_inputs` each input length is the product of
* the shape given by `vollo_rt_model_input_shape`
* (or more convenient: `vollo_rt_model_input_num_elements`)
* lifetime:
* - The outer array only needs to live until `vollo_rt_add_job_bf16` returns
* - The input buffers need to live until `vollo_rt_poll` returns with the completion for
* this job
* - output_data:
* a pointer to the start of an array with pointers to the start of the data to each output
* buffer the number of outputs is given by `vollo_rt_model_num_outputs` each output length is
* the product of the shape given by `vollo_rt_model_output_shape`
* (or more convenient: `vollo_rt_model_output_num_elements`)
* lifetime:
* - The outer array only needs to live until `vollo_rt_add_job_bf16` returns
* - The output buffers need to live until `vollo_rt_poll` returns with the completion for
* this job
*/
vollo_rt_error_t vollo_rt_add_job_bf16(
vollo_rt_context_t vollo,
size_t model_index,
uint64_t user_ctx,
const bf16* const* input_data,
bf16* const* output_data);
vollo_rt_error_t vollo_rt_add_job_fp32(
vollo_rt_context_t vollo,
size_t model_index,
uint64_t user_ctx,
const float* const* input_data,
float* const* output_data);
To actually start and later complete an inference you must use the vollo_rt_poll
function
multiple times. It is typically called in a loop (with a timeout) until some or all of the jobs
are completed.
/**
* Poll the vollo accelerator for completion.
*
* Note: Polling also initiates transfers for new jobs, so poll must be called
* before any progress on these new jobs can be made.
*
* num_completed: out: the number of completed user_ctx returned
* returned_user_ctx: buffer for the returned user_ctx of completed jobs, this will only be
* valid until the next call to vollo_rt_poll.
*/
vollo_rt_error_t vollo_rt_poll(
vollo_rt_context_t vollo, size_t* num_completed, const uint64_t** returned_user_ctx);