Example 2: CNN

Vollo supports streaming 1D convolutional neural networks (CNNs), which might require you to make some changes to your model if you are currently using a non-streaming 1D CNN.

A streaming convolution applies the convolutional kernel to the most recent window of the input sequence as the data points in the input sequence arrive. This differs from a non-streaming convolution, which expects to receive a complete input sequence and applies its convolutional kernel to each window of that input.

Streaming convolutions will have much lower latency than non-streaming convolutions, but they have to maintain some state, namely the most recent window of input, making them unnatural to define in ML frameworks like PyTorch. To enable the use of of streaming convolutions, the Vollo compiler includes a streaming_transform which transforms a non-streaming CNN into a streaming CNN, as long as the non-streaming CNN meets certain constraints.

Using the streaming_transform

The model below is a non-streaming CNN taking an input sequence of length 5 and producing an output of length 1. (It can actually take any input sequence of length 5+n and produce an output of length 1+n, but we will only consider the minimal sequence length, since that is the length of the input context used by each of the output elements.)

import torch
import torch.nn as nn
import torch.nn.functional as F

class CNN(nn.Module):
    def __init__(self, in_channels, out_channels, hidden_channels, kernel_size=3):
        super().__init__()
        # Reduces sequence length by (kernel_size - 1) = 2
        self.conv1 = nn.Conv1d(in_channels, hidden_channels, kernel_size)
        # Reduces sequence length by (kernel_size - 1) = 2
        self.conv2 = nn.Conv1d(hidden_channels, out_channels, kernel_size)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        return x

# Instantiate the model
in_channels = 32
out_channels = 1
hidden_channels = 128
model = CNN(in_channels, out_channels, hidden_channels)

In order to apply the streaming_transform, the torch.nn.Conv1d layers need to be replaced with vollo_torch.nn.PaddedConv1d layers.

import torch
import torch.nn as nn
import torch.nn.functional as F

import vollo_torch.nn

class CNN(nn.Module):
    def __init__(self, in_channels, out_channels, hidden_channels, kernel_size=3):
        super().__init__()
        self.conv1 = vollo_torch.nn.PaddedConv1d(in_channels, hidden_channels, kernel_size)
        self.conv2 = vollo_torch.nn.PaddedConv1d(hidden_channels, out_channels, kernel_size)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        return x

# Instantiate the model
in_channels = 32
out_channels = 1
hidden_channels = 128
model = CNN(in_channels, out_channels, hidden_channels)

These PaddedConv1d layers are identical to torch.nn.Conv1d, but with left padding pre-applied to the input so as not to reduce the sequence length.

This PaddedConv1d model is still a non-streaming model, which now takes an input sequence of length 5 and produces an output of length 5. Its relationship to the original Conv1d model is that, given the same model parameters (weights, biases, etc.) and input sequence, the last element of the output sequence of the PaddedConv1d model will be equal to the last/only element of the output sequence of the Conv1d model.

The PaddedConv1d model can be lowered to NNIR and have the streaming_transform applied.

batch_size = 1
sequence_length = 5
input = torch.randn(batch_size, in_channels, sequence_length)
(model, expected_output) = vollo_torch.fx.prepare_shape(model, input)
nnir = vollo_torch.fx.nnir.to_nnir(model)

# Provide the streaming transform with index of the sequence axis
(nnir, output_axis) = nnir.streaming_transform(2)

The resulting NNIR graph represents a streaming CNN, i.e. containing state, that takes a single data point of a sequence as input and produces a single data point as output, updating its input window state in the process. Input sequences for the streaming CNN need to be fed in sequentially, e.g. in a loop. For example, using the VM:

import vollo_compiler

program = nnir.to_program(vollo_compiler.Config.ia_420f_c6b32())
vm = program.to_vm()

vm_outputs = []
for i in range(5):
    # Runs inference on one element of the input sequence, updating the
    # streaming CNN's state
    vm_outputs.append(vm.run(input[:, :, i].detach().numpy()))

torch.testing.assert_close(
    expected_output,
    torch.stack(
        [torch.from_numpy(output) for output in vm_outputs],
        axis=output_axis,
    ),
    atol = 1e-2,
    rtol = 1e-2
)

The streaming CNN satisfies the property that, given an input sequence, the i-th element of the output sequence of the non-streaming CNN will be equal to the output of the i-th iteration of feeding the input to the streaming CNN.

The streaming CNN can be saved and run on the accelerator like any other program:

program.save('cnn.vollo')