Open
Conversation
Benchmark stalls when all I/O targets NVMe (cpu=0, gpu=0) with preconditioning enabled. Three root causes fixed, plus an O(n²) eviction optimization: 1. Thread race in eviction concurrent threads evict the same LRU entry, double-decrementing nvme_memory_used until it hits ~0. Fix: check entry existence under metadata_lock before decrementing; use live size from cache_entries; clean up entry_locks for evicted keys. 2. Eviction guards reject writes on the terminal tier the 95% size cap, 80% target, and low-data bailout all assume a next tier exists. Fix: detect terminal tier (is_last_tier) and relax all three guards. 3. Preconditioning spins forever — failed allocations never increment written_bytes. Fix: consecutive-failure bailout (50) with backoff. 4. O(n²) LRU scan — each eviction re-scanned and re-sorted the full entry list. Fix: single sorted snapshot with index walk; refresh only if exhausted (2 scans max instead of thousands). Supporting fixes: - os.statvfs for NVMe capacity (f_bavail excludes reserved blocks) - path.unlink(missing_ok=True) for NVMe delete TOCTOU race - Fallback "all tiers full" path now tracks nvme_memory_used Tests: New test classes TestThreeTierEvictionCascade (3 tests: GPU→CPU→NVMe→delete cascade via fake GPU backend), TestNVMeOnlyEviction (4 tests: allocation, file deletion, no negative drift, concurrent threads), TestVisualizeUserRequestFlow (7 tests: educational trace of full request pipeline). Model config count updated 5→9 with deepseek-v3, qwen3-32b, gpt-oss-120b, gpt-oss-20b. Docs: Move MLperf proposal and sources.md into docs/ subdirectory. Files changed: kv_cache/cache.py — eviction logic, capacity detection, fallback tracking kv_cache/benchmark.py — preconditioning stall protection kv_cache/backends.py — NVMe delete race fix tests/test_kv_cache.py — model configs, 3 new test classes docs/ — moved from project root
|
MLCommons CLA bot: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pushed a fix for the benchmark stall that occurs when running NVMe-only configurations (cpu=0, gpu=0) with preconditioning enabled. The root cause was a thread race in eviction that double-decremented the memory tracker until it thought the disk was empty, which disabled further eviction and filled the filesystem and caused the threads to stall compounded by capacity guards that assumed a next tier always exists for cascade, and no failure handling in the preconditioning loop. The fix adds an existence check before decrementing, detects when NVMe is the terminal tier and relaxes the guards accordingly, bails out of preconditioning after 50 consecutive failures. I also fixed a performance issue in the eviction loop: previously, every time we needed to evict one entry, we re-scanned and re-sorted the entire cache to find the oldest item so evicting 100 entries meant 100 full sorts of 60k entries. Now we sort once, walk through the list with an index, and skip any entries another thread already removed. Same eviction order, ~100x less CPU work at scale. Also fixed a TOCTOU race in NVMe file deletion and switched to os.statvfs for more accurate capacity detection.
All 206 tests pass, including new test classes covering 3-tier cascade (GPU→CPU→NVMe→delete), NVMe-only eviction with concurrent threads, and an educational 7-part test that traces the full request flow from user simulation through KV cache sizing, the 4-level latency hierarchy, .npy file I/O, and waterfall eviction.
pytest tests/test_kv_cache.py -v -k "TestThreeTierEvictionCascade"
pytest tests/test_kv_cache.py -v -k "TestNVMeOnlyEviction"
pytest tests/test_kv_cache.py -v -s --log-cli-level=DEBUG -k "TestVisualizeUserRequestFlow"