IO Round Trip
The following IO round trip times are sampled by using a program with no compute on the Vollo accelerator from the Vollo runtime.
More specifically this Vollo accelerator program waits for the last input byte to arrive before it sends the first output byte back. This method takes into account some of the overheads (such as copying to the DMA buffer in the Vollo runtime) associated with IO and this test is set up to see how it scales with difference sizes of inputs and outputs.
The following tables shows the round trip times in μs on the V80 board (similar times
were observed on other boards), for different input and output sizes. Using fewer than
64 bytes gives the same times as 64 bytes.
To reproduce these values on your own hardware run the provided benchmark
script with environment variable RUN_IO_TEST=1.
User buffers
This includes copying data to/from DMA buffers.
mean:
| out\in | 64 B | 128 B | 256 B | 512 B | 1024 B | 2048 B | 4096 B | 8192 B | 16384 B |
|---|---|---|---|---|---|---|---|---|---|
| 64 B | 1.47 | 1.49 | 1.54 | 1.55 | 1.61 | 1.69 | 1.93 | 2.15 | 2.77 |
| 128 B | 1.54 | 1.54 | 1.59 | 1.65 | 1.74 | 1.75 | 1.93 | 2.28 | 2.90 |
| 256 B | 1.60 | 1.63 | 1.68 | 1.69 | 1.85 | 1.88 | 1.94 | 2.30 | 2.91 |
| 512 B | 1.70 | 1.74 | 1.76 | 1.87 | 1.93 | 1.93 | 2.06 | 2.32 | 2.96 |
| 1024 B | 1.71 | 1.85 | 1.80 | 1.87 | 1.93 | 1.93 | 2.08 | 2.42 | 3.01 |
| 2048 B | 1.84 | 1.92 | 1.93 | 1.93 | 1.93 | 2.01 | 2.17 | 2.45 | 3.10 |
| 4096 B | 2.01 | 2.05 | 2.09 | 2.13 | 2.20 | 2.31 | 2.39 | 2.68 | 3.38 |
| 8192 B | 2.51 | 2.53 | 2.58 | 2.58 | 2.62 | 2.66 | 2.90 | 3.20 | 3.88 |
| 16384 B | 3.45 | 3.44 | 3.51 | 3.55 | 3.67 | 3.72 | 3.81 | 3.89 | 4.83 |
p99:
| out\in | 64 B | 128 B | 256 B | 512 B | 1024 B | 2048 B | 4096 B | 8192 B | 16384 B |
|---|---|---|---|---|---|---|---|---|---|
| 64 B | 1.75 | 1.77 | 1.70 | 1.81 | 1.78 | 1.95 | 2.07 | 2.31 | 2.92 |
| 128 B | 1.71 | 1.79 | 1.90 | 1.91 | 1.97 | 2.05 | 2.13 | 2.52 | 3.11 |
| 256 B | 1.87 | 1.94 | 1.97 | 1.99 | 2.00 | 2.08 | 2.15 | 2.54 | 3.09 |
| 512 B | 1.94 | 1.98 | 1.99 | 2.00 | 2.00 | 2.10 | 2.32 | 2.50 | 3.26 |
| 1024 B | 2.03 | 2.00 | 2.02 | 2.04 | 2.08 | 2.11 | 2.33 | 2.73 | 3.28 |
| 2048 B | 2.09 | 2.08 | 2.10 | 2.08 | 2.06 | 2.28 | 2.44 | 2.71 | 3.35 |
| 4096 B | 2.29 | 2.32 | 2.35 | 2.38 | 2.44 | 2.52 | 2.67 | 2.96 | 3.65 |
| 8192 B | 2.75 | 2.92 | 2.77 | 2.79 | 3.10 | 3.16 | 3.08 | 3.64 | 4.13 |
| 16384 B | 3.83 | 3.83 | 3.87 | 3.91 | 3.92 | 4.04 | 4.15 | 4.20 | 5.35 |
Raw DMA buffers
This is using buffers allocated with vollo_rt_get_raw_buffer which lets the runtime skip IO copy.
mean:
| out\in | 64 B | 128 B | 256 B | 512 B | 1024 B | 2048 B | 4096 B | 8192 B | 16384 B |
|---|---|---|---|---|---|---|---|---|---|
| 64 B | 1.45 | 1.48 | 1.54 | 1.54 | 1.59 | 1.66 | 1.93 | 2.15 | 2.76 |
| 128 B | 1.47 | 1.49 | 1.54 | 1.54 | 1.59 | 1.66 | 1.93 | 2.15 | 2.76 |
| 256 B | 1.53 | 1.54 | 1.54 | 1.57 | 1.61 | 1.71 | 1.93 | 2.15 | 2.77 |
| 512 B | 1.54 | 1.56 | 1.61 | 1.64 | 1.66 | 1.75 | 1.93 | 2.21 | 2.83 |
| 1024 B | 1.59 | 1.61 | 1.65 | 1.66 | 1.72 | 1.78 | 1.93 | 2.22 | 2.90 |
| 2048 B | 1.66 | 1.66 | 1.72 | 1.74 | 1.78 | 1.93 | 2.01 | 2.32 | 2.91 |
| 4096 B | 1.88 | 1.93 | 1.93 | 1.93 | 1.93 | 2.00 | 2.15 | 2.45 | 3.09 |
| 8192 B | 2.08 | 2.12 | 2.15 | 2.15 | 2.21 | 2.26 | 2.42 | 2.77 | 3.35 |
| 16384 B | 2.65 | 2.69 | 2.71 | 2.77 | 2.77 | 2.91 | 3.00 | 3.33 | 3.92 |
p99:
| out\in | 64 B | 128 B | 256 B | 512 B | 1024 B | 2048 B | 4096 B | 8192 B | 16384 B |
|---|---|---|---|---|---|---|---|---|---|
| 64 B | 1.73 | 1.75 | 1.74 | 1.67 | 1.85 | 1.83 | 2.10 | 2.38 | 2.98 |
| 128 B | 1.74 | 1.75 | 1.71 | 1.67 | 1.85 | 1.89 | 2.10 | 2.37 | 2.97 |
| 256 B | 1.78 | 1.77 | 1.64 | 1.84 | 1.76 | 1.99 | 2.06 | 2.38 | 2.94 |
| 512 B | 1.61 | 1.84 | 1.80 | 1.91 | 1.75 | 2.02 | 2.02 | 2.44 | 3.10 |
| 1024 B | 1.85 | 1.82 | 1.88 | 1.77 | 1.97 | 2.04 | 1.99 | 2.60 | 3.11 |
| 2048 B | 1.74 | 1.94 | 1.94 | 2.02 | 2.05 | 2.07 | 2.28 | 2.54 | 3.01 |
| 4096 B | 2.09 | 2.08 | 2.05 | 2.04 | 2.00 | 2.28 | 2.34 | 2.71 | 3.35 |
| 8192 B | 2.29 | 2.40 | 2.33 | 2.34 | 2.46 | 2.42 | 2.55 | 2.92 | 3.63 |
| 16384 B | 2.90 | 2.96 | 2.92 | 2.95 | 2.92 | 3.10 | 3.28 | 3.50 | 4.20 |
MMIO
This is skipping DMA for the input (raw DMA buffers are used for the output).
It is configured via the VOLLO_MMIO_MAX_SIZE environment variable.
mean:
| out\in | 64 B | 128 B | 256 B | 512 B | 1024 B | 2048 B | 4096 B | 8192 B | 16384 B |
|---|---|---|---|---|---|---|---|---|---|
| 64 B | 0.89 | 0.89 | 0.88 | 0.91 | 0.97 | 1.09 | 1.35 | 1.85 | 2.88 |
| 128 B | 0.89 | 0.90 | 0.89 | 0.91 | 0.98 | 1.11 | 1.36 | 1.87 | 2.89 |
| 256 B | 0.91 | 0.92 | 0.91 | 0.94 | 1.01 | 1.13 | 1.39 | 1.89 | 2.91 |
| 512 B | 0.98 | 0.97 | 0.98 | 0.99 | 1.06 | 1.19 | 1.44 | 1.94 | 2.97 |
| 1024 B | 1.00 | 1.01 | 1.00 | 1.03 | 1.09 | 1.22 | 1.48 | 1.98 | 3.00 |
| 2048 B | 1.07 | 1.08 | 1.09 | 1.10 | 1.17 | 1.29 | 1.55 | 2.05 | 3.07 |
| 4096 B | 1.22 | 1.22 | 1.23 | 1.24 | 1.31 | 1.43 | 1.68 | 2.19 | 3.22 |
| 8192 B | 1.49 | 1.50 | 1.50 | 1.52 | 1.58 | 1.72 | 1.97 | 2.49 | 3.51 |
| 16384 B | 2.07 | 2.07 | 2.08 | 2.10 | 2.17 | 2.29 | 2.55 | 3.06 | 4.09 |
p99:
| out\in | 64 B | 128 B | 256 B | 512 B | 1024 B | 2048 B | 4096 B | 8192 B | 16384 B |
|---|---|---|---|---|---|---|---|---|---|
| 64 B | 0.91 | 0.90 | 0.90 | 0.92 | 0.98 | 1.10 | 1.37 | 1.86 | 2.90 |
| 128 B | 0.91 | 0.91 | 0.91 | 0.93 | 0.99 | 1.11 | 1.37 | 1.88 | 2.91 |
| 256 B | 0.93 | 0.93 | 0.93 | 0.95 | 1.02 | 1.14 | 1.40 | 1.90 | 2.93 |
| 512 B | 1.00 | 0.99 | 0.99 | 1.01 | 1.07 | 1.20 | 1.46 | 1.96 | 2.98 |
| 1024 B | 1.02 | 1.02 | 1.02 | 1.04 | 1.10 | 1.23 | 1.49 | 1.99 | 3.02 |
| 2048 B | 1.09 | 1.09 | 1.11 | 1.12 | 1.18 | 1.31 | 1.56 | 2.06 | 3.10 |
| 4096 B | 1.24 | 1.24 | 1.25 | 1.25 | 1.32 | 1.44 | 1.70 | 2.21 | 3.24 |
| 8192 B | 1.51 | 1.51 | 1.51 | 1.53 | 1.60 | 1.73 | 1.98 | 2.50 | 3.53 |
| 16384 B | 2.09 | 2.09 | 2.09 | 2.11 | 2.18 | 2.31 | 2.56 | 3.08 | 4.11 |