Better tmp file handling (#2) by Prgrmman · Pull Request #600 · llm-d/llm-d-kv-cache

Prgrmman · 2026-05-20T20:45:49Z

Give fs connector more robust temporary file handling

For our distributed file system, we found that using O_TMPFILE provides less synchronization overhead when creating files (because the inode is anonymous) and allows better offloading performance. Additionally, we moved away from C++ ofstream user-space buffering because:

C++ std does not expose file descriptors needed for O_TMPFILE
We achieve better performance by skipping an intermediate user space buffer; this is justified because kv cache offload files typically exceed multiple MiB.

Because we switched out the default write operation, I wrote a custom micro-benchmark (which I can share upon request) to make sure that there was no performance degradation in the regular path (not using O_TMPFILE)
To share the micobenchmarking results:

# c++ ofstream is implemented as a copy of the current llm-d implementation
# 5 iterations means each thread writes 5 files
# The chunk size means that for code using the write() system call, we break the write down into 1MiB write() calls
# stdio buffering is the size of the custom user space buffer passed to fwrite (see setvbuf()) and c++ ofstream (see pubsetbuf())
# The test was completed on IBM's distributed file system (GPFS) on a single client


I/O PERFORMANCE BENCHMARK
=============================================================================
Output directory: /ibm/fs1-remote/kv-cache
File size: 50 MiB
Iterations: 5
Chunk size: 1048576 bytes (1 MiB; stdio buffering is also 1MiB)
=============================================================================

SINGLE-THREADED BASELINE
-----------------------------------------------------------------------------
Method                       Latency           Throughput
-----------------------------------------------------------------------------
C++ ofstream           0.0121 s       4129.39 MB/s
C fwrite               0.0122 s       4108.06 MB/s
Unbuffered write       0.0112 s       4457.81 MB/s

=============================================================================
MULTI-THREADED SCALING
=============================================================================

Running 12 tests...
  [1/12] Testing C++ ofstream with 1 thread(s)... 3987.14 MB/s
  [2/12] Testing C++ ofstream with 4 thread(s)... 9250.89 MB/s
  [3/12] Testing C++ ofstream with 16 thread(s)... 46759.7 MB/s
  [4/12] Testing C++ ofstream with 64 thread(s)... 76768 MB/s
  [5/12] Testing C fwrite with 1 thread(s)... 4174.02 MB/s
  [6/12] Testing C fwrite with 4 thread(s)... 9420.42 MB/s
  [7/12] Testing C fwrite with 16 thread(s)... 46842 MB/s
  [8/12] Testing C fwrite with 64 thread(s)... 51335.1 MB/s
  [9/12] Testing Unbuffered write with 1 thread(s)... 4467 MB/s
  [10/12] Testing Unbuffered write with 4 thread(s)... 17311.7 MB/s
  [11/12] Testing Unbuffered write with 16 thread(s)... 51224.4 MB/s
  [12/12] Testing Unbuffered write with 64 thread(s)... 83523.4 MB/s

In addition, I tested the pytest throughput tests with O_TMP enabled, and saw a small improvement when testing on a single gpu:

jterner3@css-host-191:~/dev/write_offloading_options$ cat buffered_cpp_throughput.txt | ./write_buffer_to_file_avg.sh
24.8448
jterner3@css-host-191:~/dev/write_offloading_options$ cat no_buffer_1MiB_chunk.txt | ./write_buffer_to_file_avg.sh
21.3199
jterner3@css-host-191:~/dev/write_offloading_options$ cat no_buffer_otmp_1MiB_chunk.txt | ./write_buffer_to_file_avg.sh
19.7344

Code for write_buffer_to_file_avg.sh:

#!/bin/bash
grep write_buffer_to_file | perl -ne '/took (\d+.\d+)/ && print "$1\n";'  | awk '{ sum += $1 } END { print (NR > 0 ? sum / NR : 0) }'

where these files were produced with python -m tests.performance.test_throughput --model Qwen/Qwen3-8b --num-requests 300 --num-tokens 12000 --tp-size 1 --log-level trace --cpu-block-size 256 2>&1 --storage-path /ibm/fs1-remote/kvc-cache

Give fs connector more robust temporary file handling This commit contains two main parts: 1. An RAII pattern for tmpfile handling to prevent leaking of temporary files on standard error paths 2. Optional use of the O_TMPFILE flag for supported file systems. 3. Moving write offloading away from buffered C++ to using direct write calls for better performance. For our distributed file system, we found that using O_TMPFILE provides less synchronization overhead when creating files (because the inode is anonymous) and allows better offloading performance. Additionally, we moved away from C++ ofstream user-space buffering because: - C++ std does not expose file descriptors needed for O_TMPFILE - We achieve better performance by skipping an intermediate user space buffer; this is justified because kv cache offload files typically exceed multiple MiB.

github-actions · 2026-05-20T20:46:02Z

Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

kfirtoledo · 2026-05-21T04:04:27Z

Thanks @Prgrmman for the PR. Can you show the evaluation of the throughput test for different models (e.g., qwen3-32b, llama3.1-8b, llama3.1-70b, gpt-oss-120b, etc.): before the change, and after it, with temp_file set to both false and true?

Prgrmman · 2026-05-26T14:43:26Z

jterner3@css-host-191:~/dev/write_offloading_options$ for file in *summary.txt; do printf "\t\t--- $file ---\n"; cat $file; echo "";  done
                --- gpt-oss-120b_summary.txt ---
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/gpt-oss-120b-output.txt 30.6197
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/gpt-oss-120b-output.txt 30.952
>>> custom_wheels/base_cpp_buffered/gpt-oss-120b-output.txt 31.9179
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/gpt-oss-120b-output_otmp.txt 32.0764
>>> custom_wheels/simple_otmp_fstream/gpt-oss-120b-output.txt 32.1085
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/gpt-oss-120b-output_otmp.txt 34.8719
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/gpt-oss-120b-output.txt 35.9172
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/gpt-oss-120b-output_otmp.txt 52.0499
>>> custom_wheels/simple_otmp_fstream/gpt-oss-120b-output_otmp.txt 301.65

                --- Llama-3.1-70B_summary.txt ---
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/Llama-3.1-70B-output.txt 52.1441
>>> custom_wheels/base_cpp_buffered/Llama-3.1-70B-output.txt 58.3339
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/Llama-3.1-70B-output.txt 58.3658
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/Llama-3.1-70B-output_otmp.txt 60.1421
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/Llama-3.1-70B-output_otmp.txt 61.9429
>>> custom_wheels/simple_otmp_fstream/Llama-3.1-70B-output.txt 79.6349
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/Llama-3.1-70B-output_otmp.txt 89.2605
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/Llama-3.1-70B-output.txt 107.536
>>> custom_wheels/simple_otmp_fstream/Llama-3.1-70B-output_otmp.txt 201.758

                --- Llama-3.1-8B_summary.txt ---
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/Llama-3.1-8B-output_otmp.txt 16.6497
>>> custom_wheels/simple_otmp_fstream/Llama-3.1-8B-output_otmp.txt 16.7494
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/Llama-3.1-8B-output_otmp.txt 19.763
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/Llama-3.1-8B-output.txt 26.6713
>>> custom_wheels/base_cpp_buffered/Llama-3.1-8B-output.txt 42.061
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/Llama-3.1-8B-output.txt 43.6438
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/Llama-3.1-8B-output_otmp.txt 64.3323
>>> custom_wheels/simple_otmp_fstream/Llama-3.1-8B-output.txt 120.543
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/Llama-3.1-8B-output.txt 170.466

                --- Qwen3-32b_summary.txt ---
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/Qwen3-32b-output.txt 27.5556
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/Qwen3-32b-output_otmp.txt 34.0773
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/Qwen3-32b-output_otmp.txt 38.9106
>>> custom_wheels/base_cpp_buffered/Qwen3-32b-output.txt 38.9718
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/Qwen3-32b-output_otmp.txt 39.7922
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/Qwen3-32b-output.txt 46.8794
>>> custom_wheels/simple_otmp_fstream/Qwen3-32b-output_otmp.txt 47.2785
>>> custom_wheels/simple_otmp_fstream/Qwen3-32b-output.txt 82.5099
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/Qwen3-32b-output.txt 95.3136

                --- Qwen3-8b_summary.txt ---
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/Qwen3-8b-output_otmp.txt 29.559
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/Qwen3-8b-output_otmp.txt 50.8258
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/Qwen3-8b-output.txt 59.7917
>>> custom_wheels/simple_otmp_fstream/Qwen3-8b-output.txt 64.0272
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/Qwen3-8b-output.txt 67.5594
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/Qwen3-8b-output_otmp.txt 84.2553
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/Qwen3-8b-output.txt 84.6751
>>> custom_wheels/simple_otmp_fstream/Qwen3-8b-output_otmp.txt 126.973
>>> custom_wheels/base_cpp_buffered/Qwen3-8b-output.txt 277.001

I had a custom script that ran through these various wheels with the following settings:

export BACKEND="storage"
export NUM_TOKENS=12000
export NUM_REQUESTS=300

export STORAGE_BLOCK_SIZE=256
export THREADS_PER_GPU=64
export STORAGE_LOG_LEVEL="trace"
...
run_test()
{
        local _wheel=$1
        local _model_name=$2
        local _o_tmp=${3:-$NO_O_TMP} # hint: set to --o-tmpfile if you want it
        local _tensor_parallel=${4:-1}

        # set connector install to new install
        uninstall_connector
        install_connector "$_wheel"
        (
        cd "$LLMD_DIR"
        run_cmd $PYTHON -m "tests.performance.test_throughput" \
                --model "$_model_name" \
                --num-requests "$NUM_REQUESTS" \
                --num-tokens "$NUM_TOKENS" \
                --tp-size "$_tensor_parallel" \
                --log-level "$STORAGE_LOG_LEVEL" \
                --threads-per-gpu "$THREADS_PER_GPU" \
                --storage-block-size "$STORAGE_BLOCK_SIZE" \
                $_o_tmp \
                --storage-path "$KVC_DIR" 2>&1
        )
}
export -f run_test

The script that measures write latency from the logs is just:

#!/bin/bash
grep write_buffer_to_file | perl -ne '/took (\d+.\d+)/ && print "$1\n";'  | awk '{ sum += $1 } END { print (NR > 0 ? sum / NR : 0) }'

There's a couple of things to point out:

The benchmark is more read-oriented, and it's not necessarily waiting for the writes to complete. I think this does introduce some noise. There's also a lack of fsync/O_DIRECT here, so there can be some high variance on the writes.
The writing 1MiB or 4MiB blocks tends to beat out the C++ implementation. This agrees with the micro benchmark that I wrote

I'll update my micro-benchmark to add rename overhead as well.

Edit: another testing artifact: I test the regular std::rename prior to the iteration for O_TMPFILE.
It is possible that kernel is flushing some data from the previous run which is causing some interference

Prgrmman requested review from dannyharnik, kfirtoledo, liu-cong and vMaroon as code owners May 20, 2026 20:45

github-actions Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 20, 2026

github-actions Bot requested review from hyeongyun0916, sagearc and yankay May 20, 2026 20:46

add fix for existing files with O_TMP

2b4f8ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better tmp file handling (#2)#600

Better tmp file handling (#2)#600
Prgrmman wants to merge 2 commits into
llm-d:mainfrom
Prgrmman:o_tmp_unbuffered_gold_master

Prgrmman commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

kfirtoledo commented May 21, 2026

Uh oh!

Prgrmman commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Prgrmman commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

kfirtoledo commented May 21, 2026

Uh oh!

Prgrmman commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Prgrmman commented May 26, 2026 •

edited

Loading