Better tmp file handling (#2)#600
Conversation
Give fs connector more robust temporary file handling This commit contains two main parts: 1. An RAII pattern for tmpfile handling to prevent leaking of temporary files on standard error paths 2. Optional use of the O_TMPFILE flag for supported file systems. 3. Moving write offloading away from buffered C++ to using direct write calls for better performance. For our distributed file system, we found that using O_TMPFILE provides less synchronization overhead when creating files (because the inode is anonymous) and allows better offloading performance. Additionally, we moved away from C++ ofstream user-space buffering because: - C++ std does not expose file descriptors needed for O_TMPFILE - We achieve better performance by skipping an intermediate user space buffer; this is justified because kv cache offload files typically exceed multiple MiB.
|
Unsigned commits detected! Please sign your commits. For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation. |
|
Thanks @Prgrmman for the PR. Can you show the evaluation of the throughput test for different models (e.g., qwen3-32b, llama3.1-8b, llama3.1-70b, gpt-oss-120b, etc.): before the change, and after it, with temp_file set to both false and true? |
I had a custom script that ran through these various wheels with the following settings: export BACKEND="storage"
export NUM_TOKENS=12000
export NUM_REQUESTS=300
export STORAGE_BLOCK_SIZE=256
export THREADS_PER_GPU=64
export STORAGE_LOG_LEVEL="trace"
...
run_test()
{
local _wheel=$1
local _model_name=$2
local _o_tmp=${3:-$NO_O_TMP} # hint: set to --o-tmpfile if you want it
local _tensor_parallel=${4:-1}
# set connector install to new install
uninstall_connector
install_connector "$_wheel"
(
cd "$LLMD_DIR"
run_cmd $PYTHON -m "tests.performance.test_throughput" \
--model "$_model_name" \
--num-requests "$NUM_REQUESTS" \
--num-tokens "$NUM_TOKENS" \
--tp-size "$_tensor_parallel" \
--log-level "$STORAGE_LOG_LEVEL" \
--threads-per-gpu "$THREADS_PER_GPU" \
--storage-block-size "$STORAGE_BLOCK_SIZE" \
$_o_tmp \
--storage-path "$KVC_DIR" 2>&1
)
}
export -f run_testThe script that measures write latency from the logs is just: #!/bin/bash
grep write_buffer_to_file | perl -ne '/took (\d+.\d+)/ && print "$1\n";' | awk '{ sum += $1 } END { print (NR > 0 ? sum / NR : 0) }'There's a couple of things to point out:
I'll update my micro-benchmark to add rename overhead as well. Edit: another testing artifact: I test the regular |
Give fs connector more robust temporary file handling
For our distributed file system, we found that using O_TMPFILE provides less synchronization overhead when creating files (because the inode is anonymous) and allows better offloading performance. Additionally, we moved away from C++ ofstream user-space buffering because:
Because we switched out the default write operation, I wrote a custom micro-benchmark (which I can share upon request) to make sure that there was no performance degradation in the regular path (not using O_TMPFILE)
To share the micobenchmarking results:
In addition, I tested the pytest throughput tests with O_TMP enabled, and saw a small improvement when testing on a single gpu:
Code for write_buffer_to_file_avg.sh:
where these files were produced with
python -m tests.performance.test_throughput --model Qwen/Qwen3-8b --num-requests 300 --num-tokens 12000 --tp-size 1 --log-level trace --cpu-block-size 256 2>&1 --storage-path /ibm/fs1-remote/kvc-cache