Distributed Streamer Performance over a networked fileshare

I've been doing some more investigation into our runai-streamer model loading performance. We manage to load Kimi-K2.5 in ~35-45 seconds, that's 620GB worth of weights ~9GB/s. We are connected to a VAST cluster via NFS4.1 over RDMA via ConnectX7 for a theoretical maximum of 50GB/s. In testing with `fio` I am able to reproduce 42GB/s when doing a massively parallel read over all files at once: `64 different files, 1 job per shard, 4 MiB blocks, iodepth 32, direct 1 (time-based): ~`37 GiB/s` (~40 GB/s)`
That is still a fair bit off what I can achieve with `fio`

Logs from VLLM (deduplicated due to ray):
```
INFO file_streamer.py:66: [RunAI Streamer] Overall time to stream 69.3 GiB of all files to cpu: 41.52s, 1.7 GiB/s
```


fio benchmark details here
<details>
```
[global]
ioengine=io_uring
direct=1
rw=read
bs=4m
iodepth=32
time_based=1
runtime=20
group_reporting=1
EOF
for i in $(seq -w 1 64); do
  cat >> /tmp/multi64_4m_direct_1x_rt20.fio <<EOF
[f${i}]
filename=${base}/model-000${i}-of-000064.safetensors
numjobs=1
EOF
  echo >> /tmp/multi64_4m_direct_1x_rt20.fio
done
fio /tmp/multi64_4m_direct_1x_rt20.fio'
Defaulted container "model-container" out of: model-container, batch-k8s-preflight-checks (init)
f01: (g=0): rw=read, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=io_uring, iodepth=32
[...] (abbreviated)
f64: (g=0): rw=read, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=io_uring, iodepth=32
fio-3.36
Starting 64 processes
f01: (groupid=0, jobs=64): err= 0: pid=71578: Wed Mar 18 23:36:21 2026
  read: IOPS=9710, BW=37.9GiB/s (40.7GB/s)(769GiB/20276msec)
    slat (nsec): min=270, max=2178.2k, avg=1876.29, stdev=6372.09
    clat (msec): min=4, max=2252, avg=188.31, stdev=241.88
     lat (msec): min=4, max=2252, avg=188.32, stdev=241.88
    clat percentiles (msec):
     |  1.00th=[   19],  5.00th=[   46], 10.00th=[   55], 20.00th=[   66],
     | 30.00th=[   86], 40.00th=[  108], 50.00th=[  126], 60.00th=[  182],
     | 70.00th=[  222], 80.00th=[  271], 90.00th=[  305], 95.00th=[  342],
     | 99.00th=[ 1821], 99.50th=[ 1989], 99.90th=[ 2198], 99.95th=[ 2232],
     | 99.99th=[ 2265]
   bw (  MiB/s): min=10696, max=85688, per=100.00%, avg=47119.23, stdev=345.39, samples=2090
   iops        : min= 2674, max=21422, avg=11779.81, stdev=86.35, samples=2090
  lat (msec)   : 10=0.04%, 20=1.27%, 50=5.84%, 100=28.95%, 250=40.47%
  lat (msec)   : 500=21.37%, 750=0.30%, 2000=1.32%, >=2000=0.44%
  cpu          : usr=0.01%, sys=8.15%, ctx=223803, majf=0, minf=668
  IO depths    : 1=0.1%, 2=0.3%, 4=0.6%, 8=1.2%, 16=1.9%, 32=95.8%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.8%, 8=0.0%, 16=0.0%, 32=0.2%, 64=0.0%, >=64=0.0%
     issued rwts: total=196896,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
   READ: bw=37.9GiB/s (40.7GB/s), 37.9GiB/s-37.9GiB/s (40.7GB/s-40.7GB/s), io=769GiB (826GB), run=20276-20276msec
```
</details>


I was doing a little bit of an analysis with an LLM following the investigation with `fio` internally: https://gist.github.com/bbartels/947d464888895445499ba64ed85a1f90
Could be complete nonesense, but the bits in: https://gist.github.com/bbartels/947d464888895445499ba64ed85a1f90#highest-value-improvement-areas seemed fairly plausible. 

Would love to dig into this a bit more, one of the reasons I become interested in investigating again is due to: https://github.com/scitix/InstantTensor/tree/main. Supposedly showing big improvements over runai streamer, that I was unable to reproduce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Streamer Performance over a networked fileshare #135

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Distributed Streamer Performance over a networked fileshare #135

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions