I've been doing some more investigation into our runai-streamer model loading performance. We manage to load Kimi-K2.5 in ~35-45 seconds, that's 620GB worth of weights ~9GB/s. We are connected to a VAST cluster via NFS4.1 over RDMA via ConnectX7 for a theoretical maximum of 50GB/s. In testing with fio I am able to reproduce 42GB/s when doing a massively parallel read over all files at once: 64 different files, 1 job per shard, 4 MiB blocks, iodepth 32, direct 1 (time-based): ~37 GiB/s (~40 GB/s)
That is still a fair bit off what I can achieve with fio
Logs from VLLM (deduplicated due to ray):
INFO file_streamer.py:66: [RunAI Streamer] Overall time to stream 69.3 GiB of all files to cpu: 41.52s, 1.7 GiB/s
fio benchmark details here
Details
```
[global]
ioengine=io_uring
direct=1
rw=read
bs=4m
iodepth=32
time_based=1
runtime=20
group_reporting=1
EOF
for i in $(seq -w 1 64); do
cat >> /tmp/multi64_4m_direct_1x_rt20.fio <> /tmp/multi64_4m_direct_1x_rt20.fio
done
fio /tmp/multi64_4m_direct_1x_rt20.fio'
Defaulted container "model-container" out of: model-container, batch-k8s-preflight-checks (init)
f01: (g=0): rw=read, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=io_uring, iodepth=32
[...] (abbreviated)
f64: (g=0): rw=read, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=io_uring, iodepth=32
fio-3.36
Starting 64 processes
f01: (groupid=0, jobs=64): err= 0: pid=71578: Wed Mar 18 23:36:21 2026
read: IOPS=9710, BW=37.9GiB/s (40.7GB/s)(769GiB/20276msec)
slat (nsec): min=270, max=2178.2k, avg=1876.29, stdev=6372.09
clat (msec): min=4, max=2252, avg=188.31, stdev=241.88
lat (msec): min=4, max=2252, avg=188.32, stdev=241.88
clat percentiles (msec):
| 1.00th=[ 19], 5.00th=[ 46], 10.00th=[ 55], 20.00th=[ 66],
| 30.00th=[ 86], 40.00th=[ 108], 50.00th=[ 126], 60.00th=[ 182],
| 70.00th=[ 222], 80.00th=[ 271], 90.00th=[ 305], 95.00th=[ 342],
| 99.00th=[ 1821], 99.50th=[ 1989], 99.90th=[ 2198], 99.95th=[ 2232],
| 99.99th=[ 2265]
bw ( MiB/s): min=10696, max=85688, per=100.00%, avg=47119.23, stdev=345.39, samples=2090
iops : min= 2674, max=21422, avg=11779.81, stdev=86.35, samples=2090
lat (msec) : 10=0.04%, 20=1.27%, 50=5.84%, 100=28.95%, 250=40.47%
lat (msec) : 500=21.37%, 750=0.30%, 2000=1.32%, >=2000=0.44%
cpu : usr=0.01%, sys=8.15%, ctx=223803, majf=0, minf=668
IO depths : 1=0.1%, 2=0.3%, 4=0.6%, 8=1.2%, 16=1.9%, 32=95.8%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=99.8%, 8=0.0%, 16=0.0%, 32=0.2%, 64=0.0%, >=64=0.0%
issued rwts: total=196896,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=37.9GiB/s (40.7GB/s), 37.9GiB/s-37.9GiB/s (40.7GB/s-40.7GB/s), io=769GiB (826GB), run=20276-20276msec
```
I was doing a little bit of an analysis with an LLM following the investigation with fio internally: https://gist.github.com/bbartels/947d464888895445499ba64ed85a1f90
Could be complete nonesense, but the bits in: https://gist.github.com/bbartels/947d464888895445499ba64ed85a1f90#highest-value-improvement-areas seemed fairly plausible.
Would love to dig into this a bit more, one of the reasons I become interested in investigating again is due to: https://github.com/scitix/InstantTensor/tree/main. Supposedly showing big improvements over runai streamer, that I was unable to reproduce
I've been doing some more investigation into our runai-streamer model loading performance. We manage to load Kimi-K2.5 in ~35-45 seconds, that's 620GB worth of weights ~9GB/s. We are connected to a VAST cluster via NFS4.1 over RDMA via ConnectX7 for a theoretical maximum of 50GB/s. In testing with
fioI am able to reproduce 42GB/s when doing a massively parallel read over all files at once:64 different files, 1 job per shard, 4 MiB blocks, iodepth 32, direct 1 (time-based): ~37 GiB/s(~40 GB/s)That is still a fair bit off what I can achieve with
fioLogs from VLLM (deduplicated due to ray):
fio benchmark details here
Details
``` [global] ioengine=io_uring direct=1 rw=read bs=4m iodepth=32 time_based=1 runtime=20 group_reporting=1 EOF for i in $(seq -w 1 64); do cat >> /tmp/multi64_4m_direct_1x_rt20.fio <> /tmp/multi64_4m_direct_1x_rt20.fio done fio /tmp/multi64_4m_direct_1x_rt20.fio' Defaulted container "model-container" out of: model-container, batch-k8s-preflight-checks (init) f01: (g=0): rw=read, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=io_uring, iodepth=32 [...] (abbreviated) f64: (g=0): rw=read, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=io_uring, iodepth=32 fio-3.36 Starting 64 processes f01: (groupid=0, jobs=64): err= 0: pid=71578: Wed Mar 18 23:36:21 2026 read: IOPS=9710, BW=37.9GiB/s (40.7GB/s)(769GiB/20276msec) slat (nsec): min=270, max=2178.2k, avg=1876.29, stdev=6372.09 clat (msec): min=4, max=2252, avg=188.31, stdev=241.88 lat (msec): min=4, max=2252, avg=188.32, stdev=241.88 clat percentiles (msec): | 1.00th=[ 19], 5.00th=[ 46], 10.00th=[ 55], 20.00th=[ 66], | 30.00th=[ 86], 40.00th=[ 108], 50.00th=[ 126], 60.00th=[ 182], | 70.00th=[ 222], 80.00th=[ 271], 90.00th=[ 305], 95.00th=[ 342], | 99.00th=[ 1821], 99.50th=[ 1989], 99.90th=[ 2198], 99.95th=[ 2232], | 99.99th=[ 2265] bw ( MiB/s): min=10696, max=85688, per=100.00%, avg=47119.23, stdev=345.39, samples=2090 iops : min= 2674, max=21422, avg=11779.81, stdev=86.35, samples=2090 lat (msec) : 10=0.04%, 20=1.27%, 50=5.84%, 100=28.95%, 250=40.47% lat (msec) : 500=21.37%, 750=0.30%, 2000=1.32%, >=2000=0.44% cpu : usr=0.01%, sys=8.15%, ctx=223803, majf=0, minf=668 IO depths : 1=0.1%, 2=0.3%, 4=0.6%, 8=1.2%, 16=1.9%, 32=95.8%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=99.8%, 8=0.0%, 16=0.0%, 32=0.2%, 64=0.0%, >=64=0.0% issued rwts: total=196896,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): READ: bw=37.9GiB/s (40.7GB/s), 37.9GiB/s-37.9GiB/s (40.7GB/s-40.7GB/s), io=769GiB (826GB), run=20276-20276msec ```I was doing a little bit of an analysis with an LLM following the investigation with
fiointernally: https://gist.github.com/bbartels/947d464888895445499ba64ed85a1f90Could be complete nonesense, but the bits in: https://gist.github.com/bbartels/947d464888895445499ba64ed85a1f90#highest-value-improvement-areas seemed fairly plausible.
Would love to dig into this a bit more, one of the reasons I become interested in investigating again is due to: https://github.com/scitix/InstantTensor/tree/main. Supposedly showing big improvements over runai streamer, that I was unable to reproduce