Modular refactor by hazemawadalla · Pull Request #252 · mlcommons/storage

hazemawadalla · 2026-02-20T23:12:34Z

Pushed a fix for the benchmark stall that occurs when running NVMe-only configurations (cpu=0, gpu=0) with preconditioning enabled. The root cause was a thread race in eviction that double-decremented the memory tracker until it thought the disk was empty, which disabled further eviction and filled the filesystem and caused the threads to stall compounded by capacity guards that assumed a next tier always exists for cascade, and no failure handling in the preconditioning loop. The fix adds an existence check before decrementing, detects when NVMe is the terminal tier and relaxes the guards accordingly, bails out of preconditioning after 50 consecutive failures. I also fixed a performance issue in the eviction loop: previously, every time we needed to evict one entry, we re-scanned and re-sorted the entire cache to find the oldest item so evicting 100 entries meant 100 full sorts of 60k entries. Now we sort once, walk through the list with an index, and skip any entries another thread already removed. Same eviction order, ~100x less CPU work at scale. Also fixed a TOCTOU race in NVMe file deletion and switched to os.statvfs for more accurate capacity detection.

All 206 tests pass, including new test classes covering 3-tier cascade (GPU→CPU→NVMe→delete), NVMe-only eviction with concurrent threads, and an educational 7-part test that traces the full request flow from user simulation through KV cache sizing, the 4-level latency hierarchy, .npy file I/O, and waterfall eviction.

pytest tests/test_kv_cache.py -v -k "TestThreeTierEvictionCascade"
pytest tests/test_kv_cache.py -v -k "TestNVMeOnlyEviction"
pytest tests/test_kv_cache.py -v -s --log-cli-level=DEBUG -k "TestVisualizeUserRequestFlow"

Benchmark stalls when all I/O targets NVMe (cpu=0, gpu=0) with preconditioning enabled. Three root causes fixed, plus an O(n²) eviction optimization: 1. Thread race in eviction concurrent threads evict the same LRU entry, double-decrementing nvme_memory_used until it hits ~0. Fix: check entry existence under metadata_lock before decrementing; use live size from cache_entries; clean up entry_locks for evicted keys. 2. Eviction guards reject writes on the terminal tier the 95% size cap, 80% target, and low-data bailout all assume a next tier exists. Fix: detect terminal tier (is_last_tier) and relax all three guards. 3. Preconditioning spins forever — failed allocations never increment written_bytes. Fix: consecutive-failure bailout (50) with backoff. 4. O(n²) LRU scan — each eviction re-scanned and re-sorted the full entry list. Fix: single sorted snapshot with index walk; refresh only if exhausted (2 scans max instead of thousands). Supporting fixes: - os.statvfs for NVMe capacity (f_bavail excludes reserved blocks) - path.unlink(missing_ok=True) for NVMe delete TOCTOU race - Fallback "all tiers full" path now tracks nvme_memory_used Tests: New test classes TestThreeTierEvictionCascade (3 tests: GPU→CPU→NVMe→delete cascade via fake GPU backend), TestNVMeOnlyEviction (4 tests: allocation, file deletion, no negative drift, concurrent threads), TestVisualizeUserRequestFlow (7 tests: educational trace of full request pipeline). Model config count updated 5→9 with deepseek-v3, qwen3-32b, gpt-oss-120b, gpt-oss-20b. Docs: Move MLperf proposal and sources.md into docs/ subdirectory. Files changed: kv_cache/cache.py — eviction logic, capacity detection, fallback tracking kv_cache/benchmark.py — preconditioning stall protection kv_cache/backends.py — NVMe delete race fix tests/test_kv_cache.py — model configs, 3 new test classes docs/ — moved from project root

github-actions · 2026-02-20T23:12:42Z

MLCommons CLA bot:
Thank you very much for your submission, we really appreciate it. Before we can accept your contribution, we ask that you sign the MLCommons CLA (Apache 2). Please use this [Google form] (https://forms.gle/Ew1KkBVpyeJDuRw67) to initiate authorization. If you are from an MLCommons member organization, we will request that you be added to the CLA. If you are not from a member organization, we will email you a CLA to sign. For any questions, please contact support@mlcommons.org.
0 out of 1 committers have signed the MLCommons CLA.
❌ @HazemAwadallah
_{You can retrigger this bot by commenting recheck in this Pull Request}

HazemAwadallah added 2 commits February 20, 2026 14:48

Add back proposal and sources to docs/

b71af0d

hazemawadalla requested a review from a team February 20, 2026 23:12

hazemawadalla requested a review from a team as a code owner February 20, 2026 23:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Modular refactor#252

Modular refactor#252
hazemawadalla wants to merge 2 commits intomlcommons:TF_KVCachefrom
hazemawadalla:modular-refactor

hazemawadalla commented Feb 20, 2026

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

hazemawadalla commented Feb 20, 2026

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants