Long Short Term Memory (LSTM) networks
We benchmark an LSTM model consisting of a stack of LSTMs followed by a linear layer.
class LSTM(nn.Module):
def __init__(self, num_layers, input_size, hidden_size, output_size):
super().__init__()
assert num_layers >= 1
self.lstm = vollo_torch.nn.LSTM(
input_size, hidden_size, num_layers=num_layers
)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
x = self.lstm(x)
x = self.fc(x)
return x
For all the benchmarked models, the input size is the same as the hidden size for all models and the output size is set to 32. The layers refers to the number of LSTM layers in the stack. The batch size and sequence length are both set to 1 (i.e., we benchmark a single timestep). Consecutive inferences are run with spacing between them to minimise latency.
We have also had LSTM models benchmarked and audited as part of a STAC-ML submission where we held the lowest latency across all models. Please refer to our STAC-ML submissions for more details:
Note that Vollo's current performance, as shown in the tables below, is significantly improved over the STAC-ML submissions.
V80: 6 core, block size 32
V80 PCIe optimisations underway, improvements coming in the next release
| Model | Layers | Hidden size | Parameters | Mean latency (us) | 99th percentile latency (us) |
|---|---|---|---|---|---|
| lstm_tiny | 2 | 128 | 268K | 2.8 | 2.8 |
| lstm_small | 3 | 256 | 1.6M | 3.3 | 3.4 |
| lstm_med | 3 | 480 | 5.6M | 4.3 | 4.5 |
| lstm_med_deep | 6 | 320 | 4.9M | 4.5 | 4.8 |
| lstm_large | 3 | 960 | 22.2M | 8.5 | 8.7 |
IA-840F: 3 core, block size 64
| Model | Layers | Hidden size | Parameters | Mean latency (μs) | 99th Percentile latency (μs) |
|---|---|---|---|---|---|
| lstm_tiny | 2 | 128 | 266K | 1.9 | 2.0 |
| lstm_small | 3 | 256 | 1.6M | 3.0 | 3.1 |
| lstm_med | 3 | 480 | 5.5M | 4.2 | 4.4 |
| lstm_med_deep | 6 | 320 | 4.9M | 4.3 | 4.5 |
The large model is not supported on the IA-840F accelerator card as it is too large to fit in the accelerator memory.
IA-420F: 6 core, block size 32
| Model | Layers | Hidden size | Parameters | Mean latency (μs) | 99th Percentile latency (μs) |
|---|---|---|---|---|---|
| lstm_tiny | 2 | 128 | 266K | 2.2 | 2.3 |
| lstm_small | 3 | 256 | 1.6M | 4.2 | 4.4 |
The medium and large models are not supported on the IA-420F accelerator card as they are too large to fit in the accelerator memory.