IO Round Trip

The following IO round trip times are sampled by using a program with no compute on the Vollo accelerator from the Vollo runtime.

More specifically this Vollo accelerator program waits for the last input byte to arrive before it sends the first output byte back. This method takes into account some of the overheads (such as copying to the DMA buffer in the Vollo runtime) associated with IO and this test is set up to see how it scales with difference sizes of inputs and outputs.

The following tables shows the round trip times in μs on the V80 board (similar times were observed on other boards), for different input and output sizes. Using fewer than 64 bytes gives the same times as 64 bytes.

To reproduce these values on your own hardware run the provided benchmark script with environment variable RUN_IO_TEST=1.

User buffers

This includes copying data to/from DMA buffers.

mean:

out\in64 B128 B256 B512 B1024 B2048 B4096 B8192 B16384 B
64 B1.471.491.541.551.611.691.932.152.77
128 B1.541.541.591.651.741.751.932.282.90
256 B1.601.631.681.691.851.881.942.302.91
512 B1.701.741.761.871.931.932.062.322.96
1024 B1.711.851.801.871.931.932.082.423.01
2048 B1.841.921.931.931.932.012.172.453.10
4096 B2.012.052.092.132.202.312.392.683.38
8192 B2.512.532.582.582.622.662.903.203.88
16384 B3.453.443.513.553.673.723.813.894.83

p99:

out\in64 B128 B256 B512 B1024 B2048 B4096 B8192 B16384 B
64 B1.751.771.701.811.781.952.072.312.92
128 B1.711.791.901.911.972.052.132.523.11
256 B1.871.941.971.992.002.082.152.543.09
512 B1.941.981.992.002.002.102.322.503.26
1024 B2.032.002.022.042.082.112.332.733.28
2048 B2.092.082.102.082.062.282.442.713.35
4096 B2.292.322.352.382.442.522.672.963.65
8192 B2.752.922.772.793.103.163.083.644.13
16384 B3.833.833.873.913.924.044.154.205.35

Raw DMA buffers

This is using buffers allocated with vollo_rt_get_raw_buffer which lets the runtime skip IO copy.

mean:

out\in64 B128 B256 B512 B1024 B2048 B4096 B8192 B16384 B
64 B1.451.481.541.541.591.661.932.152.76
128 B1.471.491.541.541.591.661.932.152.76
256 B1.531.541.541.571.611.711.932.152.77
512 B1.541.561.611.641.661.751.932.212.83
1024 B1.591.611.651.661.721.781.932.222.90
2048 B1.661.661.721.741.781.932.012.322.91
4096 B1.881.931.931.931.932.002.152.453.09
8192 B2.082.122.152.152.212.262.422.773.35
16384 B2.652.692.712.772.772.913.003.333.92

p99:

out\in64 B128 B256 B512 B1024 B2048 B4096 B8192 B16384 B
64 B1.731.751.741.671.851.832.102.382.98
128 B1.741.751.711.671.851.892.102.372.97
256 B1.781.771.641.841.761.992.062.382.94
512 B1.611.841.801.911.752.022.022.443.10
1024 B1.851.821.881.771.972.041.992.603.11
2048 B1.741.941.942.022.052.072.282.543.01
4096 B2.092.082.052.042.002.282.342.713.35
8192 B2.292.402.332.342.462.422.552.923.63
16384 B2.902.962.922.952.923.103.283.504.20

MMIO

This is skipping DMA for the input (raw DMA buffers are used for the output). It is configured via the VOLLO_MMIO_MAX_SIZE environment variable.

mean:

out\in64 B128 B256 B512 B1024 B2048 B4096 B8192 B16384 B
64 B0.890.890.880.910.971.091.351.852.88
128 B0.890.900.890.910.981.111.361.872.89
256 B0.910.920.910.941.011.131.391.892.91
512 B0.980.970.980.991.061.191.441.942.97
1024 B1.001.011.001.031.091.221.481.983.00
2048 B1.071.081.091.101.171.291.552.053.07
4096 B1.221.221.231.241.311.431.682.193.22
8192 B1.491.501.501.521.581.721.972.493.51
16384 B2.072.072.082.102.172.292.553.064.09

p99:

out\in64 B128 B256 B512 B1024 B2048 B4096 B8192 B16384 B
64 B0.910.900.900.920.981.101.371.862.90
128 B0.910.910.910.930.991.111.371.882.91
256 B0.930.930.930.951.021.141.401.902.93
512 B1.000.990.991.011.071.201.461.962.98
1024 B1.021.021.021.041.101.231.491.993.02
2048 B1.091.091.111.121.181.311.562.063.10
4096 B1.241.241.251.251.321.441.702.213.24
8192 B1.511.511.511.531.601.731.982.503.53
16384 B2.092.092.092.112.182.312.563.084.11