C API

The Vollo runtime API is a C API with simple types and functions in order to be straight forward to use from any language with a C FFI.

  • Header file: $VOLLO_SDK/include/vollo-rt.h
  • Dynamic library: $VOLLO_SDK/lib/libvollo_rt.so
  • Static library: $VOLLO_SDK/lib/libvollo_rt.a

It was built against GLIBC version 2.17.

To compile against Vollo RT with a standard C compiler, you can use the following flags: -I $VOLLO_SDK/include -L $VOLLO_SDK/lib -lvollo_rt

These are the main steps (in order) a program using vollo_rt will follow:

  1. Initialise the Vollo runtime using vollo_rt_init
  2. Add Vollo accelerators to the runtime using vollo_rt_add_accelerator (Note: the current release only supports one accelerator)
  3. Load a Vollo program onto the Vollo accelerators with vollo_rt_load_program
  4. Optionally, inspect the metadata about the models in the program using API calls such as vollo_rt_num_models and vollo_rt_model_num_inputs
  5. Queue and run inference jobs by first calling vollo_rt_add_job_bf16 (or vollo_rt_add_job_fp32) and then polling in a loop for their completion using vollo_rt_poll. You can queue several jobs before calling vollo_rt_poll or add extra jobs at any point.
  6. Finally call vollo_rt_destroy to release resources.

The API is designed to explicitly return errors when it can to let the user handle them as they see fit. The metadata functions will instead error out themselves if any of the documented pre-conditions they rely on aren't met. Any other crash is considered a bug and we would be very grateful if you could tell us about it.

Initialisation

A vollo context is created by calling vollo_rt_init. Add an accelerator by using the vollo_rt_add_accelerator function.

/**
 * Initialise the vollo-rt context. This must be called before any other vollo-rt functions.
 *
 * Logging level can be configured by setting the environment variable `VOLLO_RT_LOG` to one of:
 * "error", "warn", "info", "debug", or "trace"
 */
vollo_rt_error_t vollo_rt_init(vollo_rt_context_t* context_ptr);

/**
 * Destroy vollo-rt context, releasing its associated resources.
 */
void vollo_rt_destroy(vollo_rt_context_t vollo);

/**
 * Add an accelerator.
 * The accelerator is specified by its index. The index refers to an accelerator in the sorted list
 * of PCI addresses. This should be called after `vollo_rt_init` but before `vollo_rt_load_program`
 */
vollo_rt_error_t vollo_rt_add_accelerator(vollo_rt_context_t vollo, size_t accelerator_index);

Loading a program

A program is loaded onto the Vollo accelerator using the vollo_rt_load_program function.

/**
 * Load a program onto the Vollo accelerators.
 * This should be called after `vollo_rt_add_accelerator`
 *
 * A Vollo program is generated by the Vollo compiler, it is typically named
 * "<program_name>.vollo".
 * The program is intended for a specific hardware config (number of accelerators,
 * cores and other configuration options), this function will return an
 * error if any accelerator configuration is incompatible with the program.
 * Once loaded, the program provides inference for several models concurrently.
 *
 * Note: This should only be called once per `vollo_rt_context_t`, as such if
 * a program needs to be changed or reset, first `vollo_rt_destroy` the current
 * context, then start a new context with `vollo_rt_init`.
 */
vollo_rt_error_t vollo_rt_load_program(vollo_rt_context_t vollo, const char* program_path);

Model metadata

Once a program is loaded, it provides inference for one or more models. Metadata about a model is obtained with vollo_rt_model_* functions.

Each model can have multiple distinct inputs and outputs. Each input and each output has a multi-dimensional shape associated with it. All of the metadata is defined by the program as supplied by the Vollo compiler. All the shapes are statically defined.

Some models can be compiled as streaming statefully over a dimension, that dimension is then erased from the inference shape but its possition can be recovered in the model metadata.

/**
 * Inspect the number of models in the program loaded onto the vollo.
 *
 * Programs can contain multiple models, a `model_index` is used to select a
 * specific model
 */
size_t vollo_rt_num_models(vollo_rt_context_t vollo);

/**
 * Get the number of inputs of a model
 *
 * Each input has its own distinct shape
 *
 * Requirements (panics otherwise):
 * - a program was loaded with `vollo_rt_load_program`
 * - `model_index < vollo_rt_num_models`
 */
size_t vollo_rt_model_num_inputs(vollo_rt_context_t vollo, size_t model_index);

/**
 * Get the number of outputs of a model
 *
 * Each output has its own distinct shape
 *
 * Requirements (panics otherwise):
 * - a program was loaded with `vollo_rt_load_program`
 * - `model_index < vollo_rt_num_models`
 */
size_t vollo_rt_model_num_outputs(vollo_rt_context_t vollo, size_t model_index);

/**
 * Get the shape for input at a given index
 *
 * The return value is a 0 terminated array of dims containing the input shape
 * The value lives for as long as the model
 *
 * Requirements (panics otherwise):
 * - a program was loaded with `vollo_rt_load_program`
 * - `model_index < vollo_rt_num_models`
 * - `input_index < vollo_rt_model_num_inputs`
 */
const size_t* vollo_rt_model_input_shape(
  vollo_rt_context_t vollo, size_t model_index, size_t input_index);

/**
 * Get the shape for output at a given index
 *
 * The return value is a 0 terminated array of dims containing the output shape
 * The value lives for as long as the model
 *
 * Requirements (panics otherwise):
 * - a program was loaded with `vollo_rt_load_program`
 * - `model_index < vollo_rt_num_models`
 * - `output_index < vollo_rt_model_num_outputs`
 */
const size_t* vollo_rt_model_output_shape(
  vollo_rt_context_t vollo, size_t model_index, size_t output_index);

/**
 * Get the number of elements for input at a given index
 *
 * This is simply the product of the dimensions returned by `vollo_rt_model_input_shape`,
 * it is provided to make it easier to allocate the correct number of elements.
 *
 * Requirements (panics otherwise):
 * - a program was loaded with `vollo_rt_load_program`
 * - `model_index < vollo_rt_num_models`
 * - `input_index < vollo_rt_model_num_inputs`
 */
size_t vollo_rt_model_input_num_elements(
  vollo_rt_context_t vollo, size_t model_index, size_t input_index);

/**
 * Get the number of elements for output at a given index
 *
 * This is simply the product of the dimensions returned by `vollo_rt_model_output_shape`,
 * it is provided to make it easier to allocate the correct number of elements.
 *
 * Requirements (panics otherwise):
 * - a program was loaded with `vollo_rt_load_program`
 * - `model_index < vollo_rt_num_models`
 * - `output_index < vollo_rt_model_num_outputs`
 */
size_t vollo_rt_model_output_num_elements(
  vollo_rt_context_t vollo, size_t model_index, size_t output_index);

/**
 * In a streaming model, the streaming dimension is not part of the shape.
 *
 * - It returns -1 when there is no streaming dimension
 * - It otherwise returns the dim index
 *   For example, for a shape `(a, b, c)` and streaming dim index 1, the full shape is:
 *   `(a, streaming_dim, b, c)`
 *
 * Requirements (panics otherwise):
 * - a program was loaded with `vollo_rt_load_program`
 * - `model_index < vollo_rt_num_models`
 * - `input_index < vollo_rt_model_num_inputs`
 */
int vollo_rt_model_input_streaming_dim(
  vollo_rt_context_t vollo, size_t model_index, size_t input_index);

/**
 * In a streaming model, the streaming dimension is not part of the shape.
 *
 * - It returns -1 when there is no streaming dimension
 * - It otherwise returns the dim index
 *   For example, for a shape `(a, b, c)` and streaming dim index 1, the full shape is:
 *   `(a, streaming_dim, b, c)`
 *
 * Requirements (panics otherwise):
 * - a program was loaded with `vollo_rt_load_program`
 * - `model_index < vollo_rt_num_models`
 * - `output_index < vollo_rt_model_num_outputs`
 */
int vollo_rt_model_output_streaming_dim(
  vollo_rt_context_t vollo, size_t model_index, size_t output_index);

Running inference

The interface returns results asynchronously so that inference requests can be made as fast as the system can support, without blocking on output data being returned. This way, it also supports running multiple requests concurrently. Before any compute is started a job with associated input and output buffers needs to be registered with the runtime using one of vollo_rt_add_job_bf16 or vollo_rt_add_job_fp32.

The bf16 variant uses bfloat16 which is effectively a cropped version of single precision floating point format fp32 (same exponent, smaller mantissa). Note: do NOT use C floating point literals for bf16 as it is simply a uint16_t in the API

A fp32 variant is also provided despite the Vollo accelerator expecting its inputs and outputs to be in fp16. If you are working with fp32, prefer this version instead of the bf16 variant as it is able to make the conversion while copying to/from DMA buffers, avoiding an extra copy.

/**
 * Sets up a computation on the vollo accelerator where the inputs and outputs are in brain-float 16
 * format.
 *
 * Note: The computation is only started on the next call to vollo_rt_poll. This way it is possible
 * to set up several computations that are kicked off at the same time.
 *
 * - vollo:
 *     the context that the computation should be run on
 * - model_index:
 *     the model to run
 * - user_ctx:
 *     a user context that will be returned on completion. This can be used to disambiguate when
 *     multiple models are running concurrently.
 *     NOTE: the jobs for a single model are guaranteed to come back in order, but the jobs for
 *     different models are not.
 * - input_data:
 *     a pointer to the start of an array with pointers to the start of the data to each input the
 *     number of inputs is given by `vollo_rt_model_num_inputs` each input length is the product of
 *     the shape given by `vollo_rt_model_input_shape`
 *     (or more convenient: `vollo_rt_model_input_num_elements`)
 *     lifetime:
 *       - The outer array only needs to live until `vollo_rt_add_job_bf16` returns
 *       - The input buffers need to live until `vollo_rt_poll` returns with the completion for
 *         this job
 * - output_data:
 *     a pointer to the start of an array with pointers to the start of the data to each output
 *     buffer the number of outputs is given by `vollo_rt_model_num_outputs` each output length is
 *     the product of the shape given by `vollo_rt_model_output_shape`
 *     (or more convenient: `vollo_rt_model_output_num_elements`)
 *     lifetime:
 *       - The outer array only needs to live until `vollo_rt_add_job_bf16` returns
 *       - The output buffers need to live until `vollo_rt_poll` returns with the completion for
 *         this job
 */
vollo_rt_error_t vollo_rt_add_job_bf16(
  vollo_rt_context_t vollo,
  size_t model_index,
  uint64_t user_ctx,
  const bf16* const* input_data,
  bf16* const* output_data);

vollo_rt_error_t vollo_rt_add_job_fp32(
  vollo_rt_context_t vollo,
  size_t model_index,
  uint64_t user_ctx,
  const float* const* input_data,
  float* const* output_data);

To actually start and later complete an inference you must use the vollo_rt_poll function multiple times. It is typically called in a loop (with a timeout) until some or all of the jobs are completed.

/**
 * Poll the vollo accelerator for completion.
 *
 * Note: Polling also initiates transfers for new jobs, so poll must be called
 * before any progress on these new jobs can be made.
 *
 *   num_completed: out: the number of completed user_ctx returned
 *   returned_user_ctx: buffer for the returned user_ctx of completed jobs, this will only be
 *                      valid until the next call to vollo_rt_poll.
 */
vollo_rt_error_t vollo_rt_poll(
  vollo_rt_context_t vollo, size_t* num_completed, const uint64_t** returned_user_ctx);