llmd fsconnector metadata cache#621
Closed
saikat-royc wants to merge 2 commits into
Closed
Conversation
fix CUDA version mismatch and dev headers symlink - Update default CUDA_TOOLKIT_PKG to cuda-toolkit-13-0 to match the CUDA 13.0 base image and prevent PyTorch compilation version mismatch. - Explicitly parse and update the standard /usr/local/cuda symlink after GKE package installation to resolve missing dev headers (cusparse.h) during compilation Signed-off-by: Saikat Roychowdhury <saikat.royc85@gmail.com>
8edee4e to
b156c39
Compare
a2ce2c6 to
124f262
Compare
1. metadata cache implementation 2. metadata cache metrics instrumentation 3. unit tests Signed-off-by: Saikat Roychowdhury <saikat.royc85@gmail.com>
124f262 to
6577720
Compare
Contributor
Author
|
/cc @miroslavln request a first pass review for the changes. note: commit in this PR will be removed, once #620 is submitted |
Contributor
Author
|
/cc @kfirtoledo request a review for this PR |
Contributor
Author
|
Closing this PR, since there is PR metadata cache for multi tier offloading connector vllm-project/vllm#44193 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
1. Overview
This PR implements a client-side metadata cache for llmd_fs_backend to replace direct filesystem calls ( os.path.exists ) during the scheduler's lookup phase. Such a cache can help reduce the lookup latency stemming from the underlying storage system at scale. To address the eventual consistency challenges introduced by the asynchronous external eviction of files (e.g pvc_evictor), the cache implements a hard Time-To-Live (TTL) positive expiration policy.
2. Key Design and Features
A. Tiered positive Caching & Filesystem Fallback ( metadata_cache.py , manager.py )
• Metadata Cache Layer: Introduces an in-memory MetadataCache utilizing an OrderedDict structure to index block keys confirmed to exist on the filesystem.
• Fallback Resolution Workflow:
• Lookups check the cache first (Tier 1 hit).
• On a cache miss, the manager falls back to physical verification checks ( os.path.exists ). If confirmed, it back-fills the positive cache to accelerate subsequent queries.
• Stateless I/O and Safety: Writes (prepare/complete store actions) insert records back into the cache. Load paths bypass the cache entirely and read directly from physical layout paths, preventing stale lookup anomalies from corrupting active reads.
B. Bounded Hard TTL Expiration Strategy
• Eventual Consistency Window: Stored positive records save a monotonic timestamp ( time.monotonic() ). In contains and batch_contains queries, entries exceeding the configured lifespan are automatically popped, returning a Cache Miss (resulting in a fallback to filesystem tier)
• Hard TTL Boundaries: To prevent long-lived hot keys from extending their expiration window indefinitely (potentially hiding files deleted by an external evictor), subsequent insertion updates on pre-existing keys preserve their original timestamp instead of resetting or extending their lifetimes.
• Infinite TTL ( -1 ): Allows configuring metadata_cache_ttl_secs = -1 to disable time-bound expiration entirely, keeping positive keys cached forever (subject to standard LRU size boundaries) for use when the background evictor is completely disabled. This is suitable for a setup where we do not have any external eviction
C. Prometheus Metrics Instrumentation ( metrics.py , manager.py , metadata_cache.py )
• vllm_llmd_fs_metadata_cache_lookup_duration_seconds (Histogram): Tracks single-key metadata lookup latency metrics.
• vllm_llmd_fs_metadata_cache_lookup_blocks (Counter labeled result=["mem_hit", "fs_hit", "fs_miss"] ): Categorizes manager lookup step outcomes.
• vllm_llmd_fs_metadata_cache_entries (Gauge): Monitors positive cache in-memory capacity.
• vllm_llmd_fs_metadata_cache_evictions (Counter labeled type=["lru", "ttl"] ): Differentiates capacity LRU pop evictions from time-bound dynamic TTL prunes.
Verification: