Skip to content

Better tmp file handling (#2)#600

Open
Prgrmman wants to merge 2 commits into
llm-d:mainfrom
Prgrmman:o_tmp_unbuffered_gold_master
Open

Better tmp file handling (#2)#600
Prgrmman wants to merge 2 commits into
llm-d:mainfrom
Prgrmman:o_tmp_unbuffered_gold_master

Conversation

@Prgrmman
Copy link
Copy Markdown

Give fs connector more robust temporary file handling

For our distributed file system, we found that using O_TMPFILE provides less synchronization overhead when creating files (because the inode is anonymous) and allows better offloading performance. Additionally, we moved away from C++ ofstream user-space buffering because:

  • C++ std does not expose file descriptors needed for O_TMPFILE
  • We achieve better performance by skipping an intermediate user space buffer; this is justified because kv cache offload files typically exceed multiple MiB.

Because we switched out the default write operation, I wrote a custom micro-benchmark (which I can share upon request) to make sure that there was no performance degradation in the regular path (not using O_TMPFILE)
To share the micobenchmarking results:

# c++ ofstream is implemented as a copy of the current llm-d implementation
# 5 iterations means each thread writes 5 files
# The chunk size means that for code using the write() system call, we break the write down into 1MiB write() calls
# stdio buffering is the size of the custom user space buffer passed to fwrite (see setvbuf()) and c++ ofstream (see pubsetbuf())
# The test was completed on IBM's distributed file system (GPFS) on a single client


I/O PERFORMANCE BENCHMARK
=============================================================================
Output directory: /ibm/fs1-remote/kv-cache
File size: 50 MiB
Iterations: 5
Chunk size: 1048576 bytes (1 MiB; stdio buffering is also 1MiB)
=============================================================================

SINGLE-THREADED BASELINE
-----------------------------------------------------------------------------
Method                       Latency           Throughput
-----------------------------------------------------------------------------
C++ ofstream           0.0121 s       4129.39 MB/s
C fwrite               0.0122 s       4108.06 MB/s
Unbuffered write       0.0112 s       4457.81 MB/s

=============================================================================
MULTI-THREADED SCALING
=============================================================================

Running 12 tests...
  [1/12] Testing C++ ofstream with 1 thread(s)... 3987.14 MB/s
  [2/12] Testing C++ ofstream with 4 thread(s)... 9250.89 MB/s
  [3/12] Testing C++ ofstream with 16 thread(s)... 46759.7 MB/s
  [4/12] Testing C++ ofstream with 64 thread(s)... 76768 MB/s
  [5/12] Testing C fwrite with 1 thread(s)... 4174.02 MB/s
  [6/12] Testing C fwrite with 4 thread(s)... 9420.42 MB/s
  [7/12] Testing C fwrite with 16 thread(s)... 46842 MB/s
  [8/12] Testing C fwrite with 64 thread(s)... 51335.1 MB/s
  [9/12] Testing Unbuffered write with 1 thread(s)... 4467 MB/s
  [10/12] Testing Unbuffered write with 4 thread(s)... 17311.7 MB/s
  [11/12] Testing Unbuffered write with 16 thread(s)... 51224.4 MB/s
  [12/12] Testing Unbuffered write with 64 thread(s)... 83523.4 MB/s

In addition, I tested the pytest throughput tests with O_TMP enabled, and saw a small improvement when testing on a single gpu:

jterner3@css-host-191:~/dev/write_offloading_options$ cat buffered_cpp_throughput.txt | ./write_buffer_to_file_avg.sh
24.8448
jterner3@css-host-191:~/dev/write_offloading_options$ cat no_buffer_1MiB_chunk.txt | ./write_buffer_to_file_avg.sh
21.3199
jterner3@css-host-191:~/dev/write_offloading_options$ cat no_buffer_otmp_1MiB_chunk.txt | ./write_buffer_to_file_avg.sh
19.7344

Code for write_buffer_to_file_avg.sh:

#!/bin/bash
grep write_buffer_to_file | perl -ne '/took (\d+.\d+)/ && print "$1\n";'  | awk '{ sum += $1 } END { print (NR > 0 ? sum / NR : 0) }'

where these files were produced with python -m tests.performance.test_throughput --model Qwen/Qwen3-8b --num-requests 300 --num-tokens 12000 --tp-size 1 --log-level trace --cpu-block-size 256 2>&1 --storage-path /ibm/fs1-remote/kvc-cache

Give fs connector more robust temporary file handling

This commit contains two main parts:
1. An RAII pattern for tmpfile handling to prevent leaking of temporary files on standard error paths
2. Optional use of the O_TMPFILE flag for supported file systems.
3. Moving write offloading away from buffered C++ to using direct write calls for better performance. 

For our distributed file system, we found that using O_TMPFILE provides less synchronization overhead when creating files (because the inode is anonymous) and allows better offloading performance.
Additionally, we moved away from C++ ofstream user-space buffering because: 
- C++ std does not expose file descriptors needed for O_TMPFILE
- We achieve better performance by skipping an intermediate user space buffer; this is justified because kv cache offload files typically exceed multiple MiB.
@github-actions github-actions Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 20, 2026
@github-actions
Copy link
Copy Markdown

Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

@kfirtoledo
Copy link
Copy Markdown
Collaborator

Thanks @Prgrmman for the PR. Can you show the evaluation of the throughput test for different models (e.g., qwen3-32b, llama3.1-8b, llama3.1-70b, gpt-oss-120b, etc.): before the change, and after it, with temp_file set to both false and true?

@Prgrmman
Copy link
Copy Markdown
Author

Prgrmman commented May 26, 2026

jterner3@css-host-191:~/dev/write_offloading_options$ for file in *summary.txt; do printf "\t\t--- $file ---\n"; cat $file; echo "";  done
                --- gpt-oss-120b_summary.txt ---
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/gpt-oss-120b-output.txt 30.6197
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/gpt-oss-120b-output.txt 30.952
>>> custom_wheels/base_cpp_buffered/gpt-oss-120b-output.txt 31.9179
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/gpt-oss-120b-output_otmp.txt 32.0764
>>> custom_wheels/simple_otmp_fstream/gpt-oss-120b-output.txt 32.1085
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/gpt-oss-120b-output_otmp.txt 34.8719
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/gpt-oss-120b-output.txt 35.9172
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/gpt-oss-120b-output_otmp.txt 52.0499
>>> custom_wheels/simple_otmp_fstream/gpt-oss-120b-output_otmp.txt 301.65

                --- Llama-3.1-70B_summary.txt ---
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/Llama-3.1-70B-output.txt 52.1441
>>> custom_wheels/base_cpp_buffered/Llama-3.1-70B-output.txt 58.3339
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/Llama-3.1-70B-output.txt 58.3658
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/Llama-3.1-70B-output_otmp.txt 60.1421
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/Llama-3.1-70B-output_otmp.txt 61.9429
>>> custom_wheels/simple_otmp_fstream/Llama-3.1-70B-output.txt 79.6349
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/Llama-3.1-70B-output_otmp.txt 89.2605
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/Llama-3.1-70B-output.txt 107.536
>>> custom_wheels/simple_otmp_fstream/Llama-3.1-70B-output_otmp.txt 201.758

                --- Llama-3.1-8B_summary.txt ---
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/Llama-3.1-8B-output_otmp.txt 16.6497
>>> custom_wheels/simple_otmp_fstream/Llama-3.1-8B-output_otmp.txt 16.7494
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/Llama-3.1-8B-output_otmp.txt 19.763
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/Llama-3.1-8B-output.txt 26.6713
>>> custom_wheels/base_cpp_buffered/Llama-3.1-8B-output.txt 42.061
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/Llama-3.1-8B-output.txt 43.6438
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/Llama-3.1-8B-output_otmp.txt 64.3323
>>> custom_wheels/simple_otmp_fstream/Llama-3.1-8B-output.txt 120.543
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/Llama-3.1-8B-output.txt 170.466

                --- Qwen3-32b_summary.txt ---
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/Qwen3-32b-output.txt 27.5556
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/Qwen3-32b-output_otmp.txt 34.0773
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/Qwen3-32b-output_otmp.txt 38.9106
>>> custom_wheels/base_cpp_buffered/Qwen3-32b-output.txt 38.9718
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/Qwen3-32b-output_otmp.txt 39.7922
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/Qwen3-32b-output.txt 46.8794
>>> custom_wheels/simple_otmp_fstream/Qwen3-32b-output_otmp.txt 47.2785
>>> custom_wheels/simple_otmp_fstream/Qwen3-32b-output.txt 82.5099
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/Qwen3-32b-output.txt 95.3136

                --- Qwen3-8b_summary.txt ---
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/Qwen3-8b-output_otmp.txt 29.559
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/Qwen3-8b-output_otmp.txt 50.8258
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/Qwen3-8b-output.txt 59.7917
>>> custom_wheels/simple_otmp_fstream/Qwen3-8b-output.txt 64.0272
>>> custom_wheels/simple_otmp_no_buffered_1MiB_chunks/Qwen3-8b-output.txt 67.5594
>>> custom_wheels/simle_omtp_no_buffer_no_chunk/Qwen3-8b-output_otmp.txt 84.2553
>>> custom_wheels/simple_otmp_no_buffered_4MiB_chunks/Qwen3-8b-output.txt 84.6751
>>> custom_wheels/simple_otmp_fstream/Qwen3-8b-output_otmp.txt 126.973
>>> custom_wheels/base_cpp_buffered/Qwen3-8b-output.txt 277.001

I had a custom script that ran through these various wheels with the following settings:

export BACKEND="storage"
export NUM_TOKENS=12000
export NUM_REQUESTS=300

export STORAGE_BLOCK_SIZE=256
export THREADS_PER_GPU=64
export STORAGE_LOG_LEVEL="trace"
...
run_test()
{
        local _wheel=$1
        local _model_name=$2
        local _o_tmp=${3:-$NO_O_TMP} # hint: set to --o-tmpfile if you want it
        local _tensor_parallel=${4:-1}

        # set connector install to new install
        uninstall_connector
        install_connector "$_wheel"
        (
        cd "$LLMD_DIR"
        run_cmd $PYTHON -m "tests.performance.test_throughput" \
                --model "$_model_name" \
                --num-requests "$NUM_REQUESTS" \
                --num-tokens "$NUM_TOKENS" \
                --tp-size "$_tensor_parallel" \
                --log-level "$STORAGE_LOG_LEVEL" \
                --threads-per-gpu "$THREADS_PER_GPU" \
                --storage-block-size "$STORAGE_BLOCK_SIZE" \
                $_o_tmp \
                --storage-path "$KVC_DIR" 2>&1
        )
}
export -f run_test

The script that measures write latency from the logs is just:

#!/bin/bash
grep write_buffer_to_file | perl -ne '/took (\d+.\d+)/ && print "$1\n";'  | awk '{ sum += $1 } END { print (NR > 0 ? sum / NR : 0) }'

There's a couple of things to point out:

  1. The benchmark is more read-oriented, and it's not necessarily waiting for the writes to complete. I think this does introduce some noise. There's also a lack of fsync/O_DIRECT here, so there can be some high variance on the writes.
  2. The writing 1MiB or 4MiB blocks tends to beat out the C++ implementation. This agrees with the micro benchmark that I wrote

I'll update my micro-benchmark to add rename overhead as well.

Edit: another testing artifact: I test the regular std::rename prior to the iteration for O_TMPFILE.
It is possible that kernel is flushing some data from the previous run which is causing some interference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants