Add initial support for Alveo V80, further performance optimisations still outstanding
Add support for Napatech NT400D11
Add support for vfio-pci; use load-kernel-driver.sh vfio
to load it,
required for V80
Add lock to Vollo RT to prevent concurrent usage of the accelerator
Improve VM cycle count estimates for Agilex devices
Additional model support:
Add support for broadcasting non-constant tensors except along the data dimension
Add grouped convolution support to vollo_torch.nn.PaddedConv1d
Add support for reshape operations
Changes to the API of vollo_torch.nn.Scan
: the step
function now returns
an output tensor and a separate state tensor instead of a single tensor; the
forward
method now takes both an input_axis
and an output_axis
instead
of a single axis
argument
Update compatibility with newer IA420F boards (IA420F-0015)
Allow weights to be shared in multi-model programs
Add support for compiling models with multiple input tensors and multiple
output tensors
Improve accuracy of LSTM unit
Change behavior of VOLLO_FP32_ROUND in Vollo RT so that it's enabled by
default; set to 0 to truncate f32 inputs
Change vollo-tool reset --pci
to vollo-tool reset-pci
Expand supported PyTorch stacking and concatenating operations:
concatenate
, stack
, vstack
, hstack
, row_stack
, column_stack
,
dstack
Expand supported PyTorch transposition operations: permute
,
swapdims
, swapaxes
, t
, T
, mT
Initial support for torch.nn.LSTM
Performance improvements in VM simulation, especially for LSTMs
Improve error messages from vollo_torch.fx.nnir.to_nnir
for unsupported
field accesses (getattr
) in PyTorch model
Add f32_round
argument to vollo_compiler.VM.run
methods to choose whether
to round or truncate f32 inputs (previously always rounded)
Fix handling of non-contiguous input arrays/tensors in Vollo RT Python
bindings
Fix bug in streaming_transform
for tensor sum reductions
Runtime/bitstream optimisation for small inputs (using MMIO instead of DMA)
Scheduling and architecture optimisations
Add reset
subcommand to vollo-tool
Support ReLU via torch.relu
Separate bitstreams from Vollo SDK
Add c2b64d hw config to support models up to 8M parameters (bitstream and compiler)
Improve compiler error messages
Fix example build
Fix for incorrect vollo_rt_accelerator_num_cores
introduced in 20.0.1
vollo_rt_add_vm
to test the vollo-rt
API without an accelerator
vollo_rt_load_program_from_buffer
and vollo_compiler.Program.{save,load}_bytes
Add vollo_torch.nn.RecurrentStateLSTM
for modelling streaming LSTM models across forward passes
Codegen fix for vollo_torch.nn.Scan
Fix incorrect input validation for torch.sum
layers
Change vollo-rt example to compile with older C compilers
Add support for LayerNorm
Add support for RMSNorm
Add support for sqrt operations (torch.sqrt
and torch.rsqrt
)
Add support for summing over the data dimension
Add cycle_count_per_inference
and compute_duration_per_inference_us
Program methods
Add support for a wider range of torch arithmetic operation aliases
Downgrade glibc dependency to support systems with glibc >=2.17
Add support for torch.div
, torch.Tensor.div
Fix compiler code generation bug for division
Add support for scalars on the left of division
Add support for Reciprocal
node in ONNX frontend
Add support for division by non-constant tensors
Fix slicing in ONNX frontend
Fix compiler bug in constant folding
Add support for partial updates of input data on the accelerator
VM simulates Vollo accelerator bit-accurately: bf16_precision
argument
renamed to bit_accurate
and enabled by default
vollo-tool
includes license self-service
Performance improvements due to DMA optimization
Add optimize_transforms
option to the compiler to improve program schedule in some cases
Add fallback to Vollo RT and vollo-tool for when AVX is not available
Vollo RT support for using raw DMA buffers to skip IO copy
Vollo RT remove redundant/noisy warnings on error: it is the user's responsibility to check returned errors
Compiler optimization for Where nodes
Compiler scheduling optimizations
Vollo IP Core public documentation
Fix vollo-tool compatibility with older bitstreams
New DMA engine that reduces IO latencies by ~1.3us
Initial support for non-streaming LSTM
Vollo IP Core now available on request
Add C library for configuring IP Core: vollo-cfg
Support for slicing/concatenation in the middle of models
Support for BatchNorm nodes
Support for Scan/LSTMCell nodes
Add --io-only
option to vollo-onnx
Add program-metadata
command to vollo-tool
Fix compiler bug with transposing streaming dimension
Fix accelerator bug in initial state of streaming models
Support for filtering dropout layers
Instruction packing improvements
LSTM performance improvement
Improvements to weight sharing
Support for multi-model programs
Provide Python bindings to Vollo RT: vollo_rt
Improved support and error messages for tensor indexing in compiler
The unweave transform is now automatic
Support for LSTM nodes in ONNX frontend
Support for squeezing, unsqueezing, reduce sum, using unweave
transformation
Improved error reporting in vollo_torch
lowering to NNIR
vollo-torch
fix type hints being incompatible with Python 3.7/3.8
vollo-rt.h
fix namespacing issue (error_t
-> vollo_rt_error_t
)
Runtime optimisations
Added IO only benchmarks
Initial support for ONNX models in compiler
Support for LSTM nodes
Improved error reporting in compiler
Compiler API changes
New runtime API with access to model metadata
HW optimisations (pointwise operations)
IA840F support
Support for scalar (int
, float
) literals in pointwise operations in
vollo-torch
.
Architectural changes in bitstream to support compiler
Reduced latency from reduced core to core communication in the bitstream
Add general model compiler and VM simulation with Python bindings in
vollo-python
Add PyTorch frontend to model compiler in vollo-torch