fix/issue 656 default block size factor#1
Closed
miroslavln wants to merge 22 commits into
Closed
Conversation
…lm-d#613) The test stubs vllm.v1.kv_offload.base to load the real manager module in isolation. Commit 8cf550e added get_offload_block_hash to manager.py's imports, but the stub wasn't updated, so collection of tests/test_storage_events.py failed with "cannot import name 'get_offload_block_hash' from 'vllm.v1.kv_offload.base' (unknown location)". Add an identity stub so the import resolves and the existing assertions (which pass plain ints as keys) still hold. Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>
* pvc_evictor: walk current FileMapper layout in crawler After llm-d#585 collapsed the FileMapper layout to `<root>/<safe_model_name>_<sha256-12>_r<rank>/<hhh>/<hh>_g<group_idx>/*.bin`, the crawler still walked the pre-llm-d#585 deep tree (`block_size_*/tp_*_pp_size_*/rank_*/<dtype>/...`). Those paths no longer exist on disk, so the crawler discovered zero files and the evictor never freed any blocks. This rewrites `stream_cache_files_with_mapper` to: * find rank directories by their `_r<digits>` suffix anywhere under the cache root, instead of pattern-matching the deprecated deep tree; * iterate the first-level hex bucket ({hhh}) and apply the existing hex_modulo_range sharding; * yield .bin files from any second-level bucket underneath, leaving the `_g<group_idx>` encoding opaque so the walker doesn't need to understand kv-cache groups. The new walker no longer instantiates `FileMapper` (it only inspects on-disk names), so the `FILEMAPPER_AVAILABLE` import guard and the matching early-exit in `crawler_process` are removed. This also unblocks running the evictor container without vllm installed, since the walker was the only consumer of the `from llmd_fs_backend.file_mapper import FileMapper` import — which transitively required vllm. `parse_filemapper_params` is dropped along with its sole caller. Signed-off-by: Miro <mironikolov@google.com> * feat: parameterize first-level hex bucket directory length in pvc crawler Signed-off-by: Miro <mironikolov@google.com> --------- Signed-off-by: Miro <mironikolov@google.com>
* parse hma kv event metadata Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> Co-authored-by: Kapil Jain <16477749+kapiljain1989@users.noreply.github.com> * reduce noise Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> --------- Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> Co-authored-by: Kapil Jain <16477749+kapiljain1989@users.noreply.github.com>
fix CUDA version mismatch and dev headers symlink - Update default CUDA_TOOLKIT_PKG to cuda-toolkit-13-0 to match the CUDA 13.0 base image and prevent PyTorch compilation version mismatch. - Explicitly parse and update the standard /usr/local/cuda symlink after GKE package installation to resolve missing dev headers (cusparse.h) during compilation Signed-off-by: Saikat Roychowdhury <saikat.royc85@gmail.com>
* feat: Emit BlockRemoved events in PVC evictor Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> * test: Add deleter and BlockRemoved tests Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> * chore: Refactor after FileMapper changes Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> * chore: Minor changes Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> * chore: Fix lint issues Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> * fix: Use only valid paths for events Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> * fix: Emit events on shutdown Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> --------- Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
* ci: Wire fs_backend Python tests into CI Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> * chore: Clean up branch Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> * fix: Use venv for llmd_fs_backend test Python deps Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> * chore: Support PVC evictor events Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> --------- Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
…pdates (llm-d#630) Bumps the go-dependencies group with 10 updates in the / directory: | Package | From | To | | --- | --- | --- | | [github.com/alicebob/miniredis/v2](https://github.com/alicebob/miniredis) | `2.35.0` | `2.38.0` | | [github.com/dgraph-io/ristretto/v2](https://github.com/dgraph-io/ristretto) | `2.3.0` | `2.4.0` | | [github.com/docker/docker](https://github.com/docker/docker) | `28.5.1+incompatible` | `28.5.2+incompatible` | | [github.com/fxamacker/cbor/v2](https://github.com/fxamacker/cbor) | `2.7.0` | `2.9.2` | | [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) | `1.22.0` | `1.23.2` | | [github.com/redis/go-redis/v9](https://github.com/redis/go-redis) | `9.7.3` | `9.20.0` | | [github.com/testcontainers/testcontainers-go](https://github.com/testcontainers/testcontainers-go) | `0.40.0` | `0.42.0` | | [go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc](https://github.com/open-telemetry/opentelemetry-go-contrib) | `0.63.0` | `0.69.0` | | [go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc](https://github.com/open-telemetry/opentelemetry-go) | `1.39.0` | `1.44.0` | | [go.uber.org/zap](https://github.com/uber-go/zap) | `1.27.0` | `1.28.0` | Updates `github.com/alicebob/miniredis/v2` from 2.35.0 to 2.38.0 - [Release notes](https://github.com/alicebob/miniredis/releases) - [Changelog](https://github.com/alicebob/miniredis/blob/master/CHANGELOG.md) - [Commits](alicebob/miniredis@v2.35.0...v2.38.0) Updates `github.com/dgraph-io/ristretto/v2` from 2.3.0 to 2.4.0 - [Release notes](https://github.com/dgraph-io/ristretto/releases) - [Changelog](https://github.com/dgraph-io/ristretto/blob/main/CHANGELOG.md) - [Commits](dgraph-io/ristretto@v2.3.0...v2.4.0) Updates `github.com/docker/docker` from 28.5.1+incompatible to 28.5.2+incompatible - [Release notes](https://github.com/docker/docker/releases) - [Commits](moby/moby@v28.5.1...v28.5.2) Updates `github.com/fxamacker/cbor/v2` from 2.7.0 to 2.9.2 - [Release notes](https://github.com/fxamacker/cbor/releases) - [Commits](fxamacker/cbor@v2.7.0...v2.9.2) Updates `github.com/prometheus/client_golang` from 1.22.0 to 1.23.2 - [Release notes](https://github.com/prometheus/client_golang/releases) - [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md) - [Commits](prometheus/client_golang@v1.22.0...v1.23.2) Updates `github.com/prometheus/client_model` from 0.6.1 to 0.6.2 - [Release notes](https://github.com/prometheus/client_model/releases) - [Commits](prometheus/client_model@v0.6.1...v0.6.2) Updates `github.com/redis/go-redis/v9` from 9.7.3 to 9.20.0 - [Release notes](https://github.com/redis/go-redis/releases) - [Changelog](https://github.com/redis/go-redis/blob/master/RELEASE-NOTES.md) - [Commits](redis/go-redis@v9.7.3...v9.20.0) Updates `github.com/testcontainers/testcontainers-go` from 0.40.0 to 0.42.0 - [Release notes](https://github.com/testcontainers/testcontainers-go/releases) - [Commits](testcontainers/testcontainers-go@v0.40.0...v0.42.0) Updates `go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc` from 0.63.0 to 0.69.0 - [Release notes](https://github.com/open-telemetry/opentelemetry-go-contrib/releases) - [Changelog](https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/CHANGELOG.md) - [Commits](open-telemetry/opentelemetry-go-contrib@zpages/v0.63.0...zpages/v0.69.0) Updates `go.opentelemetry.io/otel` from 1.43.0 to 1.44.0 - [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases) - [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md) - [Commits](open-telemetry/opentelemetry-go@v1.43.0...v1.44.0) Updates `go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc` from 1.39.0 to 1.44.0 - [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases) - [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md) - [Commits](open-telemetry/opentelemetry-go@v1.39.0...v1.44.0) Updates `go.opentelemetry.io/otel/sdk` from 1.43.0 to 1.44.0 - [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases) - [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md) - [Commits](open-telemetry/opentelemetry-go@v1.43.0...v1.44.0) Updates `go.opentelemetry.io/otel/trace` from 1.43.0 to 1.44.0 - [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases) - [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md) - [Commits](open-telemetry/opentelemetry-go@v1.43.0...v1.44.0) Updates `go.uber.org/zap` from 1.27.0 to 1.28.0 - [Release notes](https://github.com/uber-go/zap/releases) - [Changelog](https://github.com/uber-go/zap/blob/master/CHANGELOG.md) - [Commits](uber-go/zap@v1.27.0...v1.28.0) Updates `google.golang.org/grpc` from 1.77.0 to 1.81.1 - [Release notes](https://github.com/grpc/grpc-go/releases) - [Commits](grpc/grpc-go@v1.77.0...v1.81.1) Updates `google.golang.org/protobuf` from 1.36.10 to 1.36.11 --- updated-dependencies: - dependency-name: github.com/alicebob/miniredis/v2 dependency-version: 2.38.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: github.com/dgraph-io/ristretto/v2 dependency-version: 2.4.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: github.com/docker/docker dependency-version: 28.5.2+incompatible dependency-type: direct:production update-type: version-update:semver-patch dependency-group: go-dependencies - dependency-name: github.com/fxamacker/cbor/v2 dependency-version: 2.9.2 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: github.com/prometheus/client_golang dependency-version: 1.23.2 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: github.com/prometheus/client_model dependency-version: 0.6.2 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: go-dependencies - dependency-name: github.com/redis/go-redis/v9 dependency-version: 9.20.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: github.com/testcontainers/testcontainers-go dependency-version: 0.42.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc dependency-version: 0.69.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: go.opentelemetry.io/otel dependency-version: 1.44.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc dependency-version: 1.44.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: go.opentelemetry.io/otel/sdk dependency-version: 1.44.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: go.opentelemetry.io/otel/trace dependency-version: 1.44.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: go.uber.org/zap dependency-version: 1.28.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: google.golang.org/grpc dependency-version: 1.81.1 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: google.golang.org/protobuf dependency-version: 1.36.11 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: go-dependencies ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…lm-d#589) The 16 MB floor on staging buffers caused every per-file write to pad the buffer via cudaHostAlloc, then write the full padded size to disk. For small-block models (e.g. Llama 3.1 8B where block size is ~7 MB) this more than doubles the on-disk footprint (7 MB -> 16 MB per file) and degrades TTFT from 2.8 s to 6.5 s at 40k-token prefix length. Remove the constant so allocate_staging_buffer uses exactly the size that calc_staging_bytes computes. Fixes llm-d#454 Signed-off-by: Jonathan Wrede <wrede.jonathan00@gmail.com>
The vllm 0.22 KV-offload Python API (base, worker, OffloadPromMetrics) is unchanged from 0.21, so no import/signature changes are needed. The one functional change is in spec.py: adopt the OffloadingSpec base class's $self.hash_block_size$ (resolved via resolve_kv_cache_block_sizes) instead of $vllm_config.cache_config.block_size$ when computing gpu_blocks_per_file. The two are equal for standard single-group models, but cache_config.block_size can be larger on hybrid models, so the base value is the correct hash granularity. Single-group only (no HMA). Bump pins: vllm 0.21.0 -> 0.22.0, package version 0.21 -> 0.22. README, Dockerfile.dev base image, and vllm-storage.yaml deployment example updated accordingly. Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>
…lm-d#607) * csrc: batch KV block copies via cudaMemcpyBatchAsync Submit all per-(block, layer) copies in one driver call instead of N cudaMemcpyAsync calls. Enabled by default; toggle off with USE_BATCH_MEMCPY_READ / USE_BATCH_MEMCPY_WRITE=0. Requires CUDA 12.8+. Speeds up KV-cache offload writes/reads when per-layer DMA sizes are small enough that driver dispatch dominates. Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com> * csrc: fall back to per-call DMA on CUDA < 12.8 cudaMemcpyBatchAsync was introduced in CUDA 12.8 — guard the batch path with #if CUDA_VERSION >= 12080 and route to the per-call cudaMemcpyAsync loop below that. Default USE_BATCH_MEMCPY_* off on older toolchains so the env knob still makes sense. Also drop thread_local on the attrs/attrs_idx inputs (never mutated, no per-thread duplication needed) and move the copy_blocks dispatcher below the helpers it dispatches to. Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com> --------- Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>
…lm-d#632) llmd-fs-connector==0.22 (llm-d v0.8 / vLLM v0.22) is the final release of the standalone llm-d FS connector. The filesystem offloading logic is now upstreamed into vLLM as the FS tier of the multi-tier offloading connector (TieringOffloadingSpec); all new features and support continue there. Add an [!IMPORTANT] banner to the connector README and a short note in the root README's Connectors & Utilities list, linking the vLLM KV offloading guide (vllm-project/vllm#44415). Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>
Signed-off-by: Dong Ma <winterma.dong@gmail.com>
Add a `latest` input (default `false`) to the shared docker-build-and-push action and wire it up from `ci-release.yaml` and `ci-release-uds-tokenizer.yaml`. The callers set it to `true` only when the triggering event is a non-prerelease GitHub Release. Previously the release workflows only pushed the immutable `vX.Y.Z` tag (and `vllm-v*` for the UDS tokenizer), so the floating `:latest` tag on ghcr.io was never refreshed after the initial manual push. That left `ghcr.io/llm-d/llm-d-uds-tokenizer:latest` pointing at a 29-day-old build while `v0.8.0` had been published 9 days earlier, causing version skew with sibling components whose `latest` tag stayed current. Dev / PR / pre-release / workflow_dispatch builds intentionally keep the default (`false`) so the floating tag is never bumped by a non-release artifact. Signed-off-by: Kay Yan <kay.yan@daocloud.io>
…lm-d#619) * feat(evictor): background empty directory cleanup Empty cache directories accumulate as files are evicted. This adds a background folder-cleaner process (P(N+3), gated by ENABLE_DIR_CLEANUP) that removes them. How it works: - The crawler detects empty rank/{hhh}/{hh} directories during its sweep and the deleter offers each freshly-emptied parent directory after a batch delete. Both feed a shared folder_queue. - The folder cleaner pulls paths off the queue and removes them with os.rmdir, which is inherently safe: it is a no-op if a file has landed in the directory in the meantime. Safety: - queue_folder skips directories modified within DIR_CLEANUP_TTL_SECONDS (default 120s) so we don't race a writer that just created a bucket and is about to populate it. This is defense-in-depth on top of rmdir's empty-only semantics. Config / Helm: - New ENABLE_DIR_CLEANUP (default true) and DIR_CLEANUP_TTL_SECONDS (default 120) env vars, wired through config.py, the Helm values and Deployment template, and documented in CONFIGURATION.md. Reporting: - The folder cleaner reports folders_purged via a new folder_cleaner_stats channel surfaced in the aggregated log. The crawler's per-sweep counter is named empty_folders_queued to reflect that it counts directories handed to the cleaner, not directories it deleted itself. The deleter's progress/done result-queue protocol is left unchanged. Signed-off-by: Miro <mironikolov@google.com> * Fix pvc_evictor unit tests due to delete_file_batch signature update Signed-off-by: Miro <mironikolov@google.com> * test: merge empty directory cleanup unit tests from PR llm-d#625 Signed-off-by: Miro <mironikolov@google.com> * fix(test): use moby container package instead of docker docker package to fix testcontainers-go build error Signed-off-by: Miro <mironikolov@google.com> * fix(ci): update golangci-lint configuration format Signed-off-by: Miro <mironikolov@google.com> * fix(kvblock): resolve ZRevRange deprecation and lll warnings Signed-off-by: Miro <mironikolov@google.com> * fix(test): resolve lll and unused gocritic lint warnings in uds_e2e_suite_test.go Signed-off-by: Miro <mironikolov@google.com> * Fix the linting errors Signed-off-by: Miro <mironikolov@google.com> --------- Signed-off-by: Miro <mironikolov@google.com>
* add hma group identity to kvblock entries Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * learn hma groups from kv events Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * expose hma group catalog Expose the learned group catalog so scorer follow-up work can use the event-derived metadata. Co-authored-by: Kapil Jain <kapiljain1989@gmail.com> Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * fix redis pod entry encoding Handle JSON encoding errors and store PodEntry directly for runtime Redis index state. Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * test namings Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * use only redis field Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * encode decode func namings for redis Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * remove redundant TestPodEntryString after redis keys change Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * remove noise from git diff Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> --------- Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> Co-authored-by: Kapil Jain <kapiljain1989@gmail.com>
feat: Add HMA support to the fs connector --------- Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>
…m-d#618) * test(pvc_evictor): add crawler tests, CI, and docs after llm-d#611 Add pytest for stream_cache_files_with_mapper, pvc-evictor CI workflow, dev Makefile/requirements, Dockerfile comments for llm-d#605 storage events, and docs for the flat fs_backend layout. Keeps llmd_fs_backend in the image for upcoming llm-d#605; crawler stays path-only per llm-d#611. Follow-up to llm-d#601 / llm-d#611. Signed-off-by: Guy Girmonsky <guygir@gmail.com> * fix(pvc_evictor): ruff import order in test conftest Signed-off-by: Guy Girmonsky <guygir@gmail.com> --------- Signed-off-by: Guy Girmonsky <guygir@gmail.com>
When block_size is absent from kv_connector_extra_config, the fs
backend defaults offloaded_block_size to 256 tokens, but vLLM's
OffloadingSpec base class only derives block_size_factor when
block_size is explicitly present, leaving it at 1. The scheduler then
emits one offload key per GPU block while the worker consumes one key
per file (gpu_blocks_per_file blocks), so on hybrid models (e.g. Gemma
with sliding-window + full-attention KV cache groups) the second
group's key slice lands inside the first group's keys and every
transfer fails with:
AssertionError: Expected group_idx=1 but key encodes 0
Set block_size_factor = gpu_blocks_per_file in
SharedStorageOffloadingSpec so the scheduler and worker always agree on
the per-file key granularity, regardless of whether block_size is
configured explicitly. On single-group models the old mismatch did not
assert but silently named files after the wrong block hashes, crippling
the offload hit rate.
Fixes llm-d#656
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Miro <mironikolov@google.com>
vLLM's OffloadingSpec base class derives block_size_factor from
extra_config["block_size"] and asserts that all KV cache groups share
one GPU block size to do so. Hybrid models like Gemma 4 have groups
with different block sizes, so explicitly configuring "block_size"
crashed at startup with:
AssertionError: If 'block_size' is specified in
kv_connector_extra_config, there must be at least one KV cache
group, and all groups must have the same block size.
The fs backend does not need that uniformity: it sizes files in
hash_block_size (GCD of group block sizes) granularity and already
derives block_size_factor itself. Hide "block_size" from the base
class during super().__init__() (restoring it afterwards) so the
uniformity assert is never reached and explicit and default block_size
configurations behave identically.
Fixes llm-d#657
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Miro <mironikolov@google.com>
|
Unsigned commits detected! Please sign your commits. For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fs_backend: add get_offload_block_hash to storage_events test stub (fs_backend: add get_offload_block_hash to storage_events test stub llm-d/llm-d-kv-cache#613)
The test stubs vllm.v1.kv_offload.base to load the real manager
module in isolation. Commit 8cf550e added get_offload_block_hash
to manager.py's imports, but the stub wasn't updated, so collection
of tests/test_storage_events.py failed with "cannot import name
'get_offload_block_hash' from 'vllm.v1.kv_offload.base' (unknown
location)". Add an identity stub so the import resolves and the
existing assertions (which pass plain ints as keys) still hold.
Signed-off-by: Kfir Toledo kfir.toledo@ibm.com
fix(pvc_evictor): walk current FileMapper layout in crawler (fix(pvc_evictor): walk current FileMapper layout in crawler llm-d/llm-d-kv-cache#611)
After fs_backend: collapse FileMapper path + add config.json audit llm-d/llm-d-kv-cache#585 collapsed the FileMapper layout to
<root>/<safe_model_name>_<sha256-12>_r<rank>/<hhh>/<hh>_g<group_idx>/*.bin,the crawler still walked the pre-fs_backend: collapse FileMapper path + add config.json audit llm-d/llm-d-kv-cache#585 deep tree
(
block_size_*/tp_*_pp_size_*/rank_*/<dtype>/...). Those paths nolonger exist on disk, so the crawler discovered zero files and the
evictor never freed any blocks.
This rewrites
stream_cache_files_with_mapperto:_r<digits>suffix anywhere underthe cache root, instead of pattern-matching the deprecated deep tree;
hex_modulo_range sharding;
the
_g<group_idx>encoding opaque so the walker doesn't need tounderstand kv-cache groups.
The new walker no longer instantiates
FileMapper(it only inspectson-disk names), so the
FILEMAPPER_AVAILABLEimport guard and thematching early-exit in
crawler_processare removed. This alsounblocks running the evictor container without vllm installed, since
the walker was the only consumer of the
from llmd_fs_backend.file_mapper import FileMapperimport — which transitively required vllm.parse_filemapper_paramsis dropped along with its sole caller.Signed-off-by: Miro mironikolov@google.com
Signed-off-by: Miro mironikolov@google.com
Signed-off-by: Miro mironikolov@google.com
fix(pvc_evictor): remove unused re import after fix(pvc_evictor): walk current FileMapper layout in crawler llm-d/llm-d-kv-cache#611 (fix(pvc_evictor): remove unused re import after #611 llm-d/llm-d-kv-cache#615)
Signed-off-by: Guy Girmonsky guygir@gmail.com
feat(kvevents): parse HMA KV event metadata (feat(kvevents): parse HMA KV event metadata llm-d/llm-d-kv-cache#612)
Signed-off-by: Sage Ahrac sagiahrak@gmail.com
Co-authored-by: Kapil Jain 16477749+kapiljain1989@users.noreply.github.com
Signed-off-by: Sage Ahrac sagiahrak@gmail.com
Signed-off-by: Sage Ahrac sagiahrak@gmail.com
Co-authored-by: Kapil Jain 16477749+kapiljain1989@users.noreply.github.com
fix Dockerfile.dev breakage (fix Dockerfile.dev breakage for llmd_fs_backend llm-d/llm-d-kv-cache#620)
fix CUDA version mismatch and dev headers symlink
match the CUDA 13.0 base image and prevent PyTorch compilation
version mismatch.
after GKE package installation to resolve missing dev headers
(cusparse.h) during compilation
Signed-off-by: Saikat Roychowdhury saikat.royc85@gmail.com
feat: Add PVC evictor BlockRemoved events (feat: Add PVC evictor BlockRemoved events llm-d/llm-d-kv-cache#605)
Signed-off-by: Alberto Perdomo aperdomo@redhat.com
Signed-off-by: Alberto Perdomo aperdomo@redhat.com
Signed-off-by: Alberto Perdomo aperdomo@redhat.com
Signed-off-by: Alberto Perdomo aperdomo@redhat.com
Signed-off-by: Alberto Perdomo aperdomo@redhat.com
Signed-off-by: Alberto Perdomo aperdomo@redhat.com
Signed-off-by: Alberto Perdomo aperdomo@redhat.com
Signed-off-by: Alberto Perdomo aperdomo@redhat.com
ci: Wire fs_backend Python tests into CI (ci: Wire fs_backend Python tests into CI llm-d/llm-d-kv-cache#578)
Signed-off-by: Alberto Perdomo aperdomo@redhat.com
Signed-off-by: Alberto Perdomo aperdomo@redhat.com
Signed-off-by: Alberto Perdomo aperdomo@redhat.com
Signed-off-by: Alberto Perdomo aperdomo@redhat.com
Signed-off-by: Alberto Perdomo aperdomo@redhat.com
deps(go): bump the go-dependencies group across 1 directory with 16 updates (deps(go): bump the go-dependencies group across 1 directory with 16 updates llm-d/llm-d-kv-cache#630)
Bumps the go-dependencies group with 10 updates in the / directory:
2.35.02.38.02.3.02.4.028.5.1+incompatible28.5.2+incompatible2.7.02.9.21.22.01.23.29.7.39.20.00.40.00.42.00.63.00.69.01.39.01.44.01.27.01.28.0Updates
github.com/alicebob/miniredis/v2from 2.35.0 to 2.38.0Updates
github.com/dgraph-io/ristretto/v2from 2.3.0 to 2.4.0Updates
github.com/docker/dockerfrom 28.5.1+incompatible to 28.5.2+incompatibleUpdates
github.com/fxamacker/cbor/v2from 2.7.0 to 2.9.2Updates
github.com/prometheus/client_golangfrom 1.22.0 to 1.23.2Updates
github.com/prometheus/client_modelfrom 0.6.1 to 0.6.2Updates
github.com/redis/go-redis/v9from 9.7.3 to 9.20.0Updates
github.com/testcontainers/testcontainers-gofrom 0.40.0 to 0.42.0Updates
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpcfrom 0.63.0 to 0.69.0Updates
go.opentelemetry.io/otelfrom 1.43.0 to 1.44.0Updates
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpcfrom 1.39.0 to 1.44.0Updates
go.opentelemetry.io/otel/sdkfrom 1.43.0 to 1.44.0Updates
go.opentelemetry.io/otel/tracefrom 1.43.0 to 1.44.0Updates
go.uber.org/zapfrom 1.27.0 to 1.28.0Updates
google.golang.org/grpcfrom 1.77.0 to 1.81.1Updates
google.golang.org/protobuffrom 1.36.10 to 1.36.11updated-dependencies:
dependency-version: 2.38.0
dependency-type: direct:production
update-type: version-update:semver-minor
dependency-group: go-dependencies
dependency-version: 2.4.0
dependency-type: direct:production
update-type: version-update:semver-minor
dependency-group: go-dependencies
dependency-version: 28.5.2+incompatible
dependency-type: direct:production
update-type: version-update:semver-patch
dependency-group: go-dependencies
dependency-version: 2.9.2
dependency-type: direct:production
update-type: version-update:semver-minor
dependency-group: go-dependencies
dependency-version: 1.23.2
dependency-type: direct:production
update-type: version-update:semver-minor
dependency-group: go-dependencies
dependency-version: 0.6.2
dependency-type: direct:production
update-type: version-update:semver-patch
dependency-group: go-dependencies
dependency-version: 9.20.0
dependency-type: direct:production
update-type: version-update:semver-minor
dependency-group: go-dependencies
dependency-version: 0.42.0
dependency-type: direct:production
update-type: version-update:semver-minor
dependency-group: go-dependencies
dependency-version: 0.69.0
dependency-type: direct:production
update-type: version-update:semver-minor
dependency-group: go-dependencies
dependency-version: 1.44.0
dependency-type: direct:production
update-type: version-update:semver-minor
dependency-group: go-dependencies
dependency-version: 1.44.0
dependency-type: direct:production
update-type: version-update:semver-minor
dependency-group: go-dependencies
dependency-version: 1.44.0
dependency-type: direct:production
update-type: version-update:semver-minor
dependency-group: go-dependencies
dependency-version: 1.44.0
dependency-type: direct:production
update-type: version-update:semver-minor
dependency-group: go-dependencies
dependency-version: 1.28.0
dependency-type: direct:production
update-type: version-update:semver-minor
dependency-group: go-dependencies
dependency-version: 1.81.1
dependency-type: direct:production
update-type: version-update:semver-minor
dependency-group: go-dependencies
dependency-version: 1.36.11
dependency-type: direct:production
update-type: version-update:semver-patch
dependency-group: go-dependencies
...
Signed-off-by: dependabot[bot] support@github.com
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
fix: remove MIN_STAGING_BUFFER_SIZE that inflates KV cache on disk (fix: remove MIN_STAGING_BUFFER_SIZE that inflates KV cache on disk llm-d/llm-d-kv-cache#589)
The 16 MB floor on staging buffers caused every per-file write to pad
the buffer via cudaHostAlloc, then write the full padded size to disk.
For small-block models (e.g. Llama 3.1 8B where block size is ~7 MB)
this more than doubles the on-disk footprint (7 MB -> 16 MB per file)
and degrades TTFT from 2.8 s to 6.5 s at 40k-token prefix length.
Remove the constant so allocate_staging_buffer uses exactly the size
that calc_staging_bytes computes.
Fixes Remove MIN_STAGING_BUFFER_SIZE, leads to way more data stored than required on smaller models llm-d/llm-d-kv-cache#454
Signed-off-by: Jonathan Wrede wrede.jonathan00@gmail.com
style(fs-backend): collapse double blank line in thread_pool.cpp (style(fs-backend): fix clang-format break in thread_pool.cpp llm-d/llm-d-kv-cache#634)
clang-format (v21) flags the double blank line left after fix: remove MIN_STAGING_BUFFER_SIZE that inflates KV cache on disk llm-d/llm-d-kv-cache#589 removed
MIN_STAGING_BUFFER_SIZE. Fixes the repo-wide pre-commit lint gate that
currently fails on all PRs.
Signed-off-by: Kfir Toledo kfir.toledo@ibm.com
fs_backend: migrate kv_offload API to vllm 0.22.0 (fs_backend: migrate kv_offload API to vllm 0.22.0 llm-d/llm-d-kv-cache#628)
The vllm 0.22 KV-offload Python API (base, worker, OffloadPromMetrics) is
unchanged from 0.21, so no import/signature changes are needed.
The one functional change is in spec.py: adopt the OffloadingSpec base$self.hash_block_size$ (resolved via resolve_kv_cache_block_sizes)$vllm_config.cache_config.block_size$ when computing
class's
instead of
gpu_blocks_per_file. The two are equal for standard single-group models,
but cache_config.block_size can be larger on hybrid models, so the base
value is the correct hash granularity. Single-group only (no HMA).
Bump pins: vllm 0.21.0 -> 0.22.0, package version 0.21 -> 0.22. README,
Dockerfile.dev base image, and vllm-storage.yaml deployment example
updated accordingly.
Signed-off-by: Kfir Toledo kfir.toledo@ibm.com
feat: batch KV block copies via cudaMemcpyBatchAsync in fs connector (feat: batch KV block copies via cudaMemcpyBatchAsync in fs connector llm-d/llm-d-kv-cache#607)
Submit all per-(block, layer) copies in one driver call instead of N
cudaMemcpyAsync calls. Enabled by default; toggle off with
USE_BATCH_MEMCPY_READ / USE_BATCH_MEMCPY_WRITE=0. Requires CUDA 12.8+.
Speeds up KV-cache offload writes/reads when per-layer DMA sizes are
small enough that driver dispatch dominates.
Signed-off-by: Kfir Toledo kfir.toledo@ibm.com
cudaMemcpyBatchAsync was introduced in CUDA 12.8 — guard the batch
path with #if CUDA_VERSION >= 12080 and route to the per-call
cudaMemcpyAsync loop below that. Default USE_BATCH_MEMCPY_* off on
older toolchains so the env knob still makes sense.
Also drop thread_local on the attrs/attrs_idx inputs (never mutated,
no per-thread duplication needed) and move the copy_blocks dispatcher
below the helpers it dispatches to.
Signed-off-by: Kfir Toledo kfir.toledo@ibm.com
Signed-off-by: Kfir Toledo kfir.toledo@ibm.com
docs(fs-backend): note upstreaming into vLLM multi-tier offloading (docs(fs-backend): note upstreaming into vLLM multi-tier offloading llm-d/llm-d-kv-cache#632)
llmd-fs-connector==0.22 (llm-d v0.8 / vLLM v0.22) is the final release of
the standalone llm-d FS connector. The filesystem offloading logic is now
upstreamed into vLLM as the FS tier of the multi-tier offloading connector
(TieringOffloadingSpec); all new features and support continue there.
Add an [!IMPORTANT] banner to the connector README and a short note in the
root README's Connectors & Utilities list, linking the vLLM KV offloading
guide ([Docs] Add KV offloading usage guide (single- and multi-tier) vllm-project/vllm#44415).
Signed-off-by: Kfir Toledo kfir.toledo@ibm.com
feat(kvblock): trace index add and evict (feat(kvblock): trace index add and evict llm-d/llm-d-kv-cache#637)
Signed-off-by: Dong Ma winterma.dong@gmail.com
ci: refresh
:latesttag only on stable GitHub Release (ci: refresh ghcr :latest tag on stable GitHub Release llm-d/llm-d-kv-cache#624)Add a
latestinput (defaultfalse) to the shared docker-build-and-pushaction and wire it up from
ci-release.yamlandci-release-uds-tokenizer.yaml. The callers set it totrueonly when thetriggering event is a non-prerelease GitHub Release.
Previously the release workflows only pushed the immutable
vX.Y.Ztag (andvllm-v*for the UDS tokenizer), so the floating:latesttag on ghcr.iowas never refreshed after the initial manual push. That left
ghcr.io/llm-d/llm-d-uds-tokenizer:latestpointing at a 29-day-old buildwhile
v0.8.0had been published 9 days earlier, causing version skew withsibling components whose
latesttag stayed current.Dev / PR / pre-release / workflow_dispatch builds intentionally keep the
default (
false) so the floating tag is never bumped by a non-releaseartifact.
Signed-off-by: Kay Yan kay.yan@daocloud.io
feat(evictor): implement background empty directory cleanup process (feat(evictor): implement background empty directory cleanup process llm-d/llm-d-kv-cache#619)
Empty cache directories accumulate as files are evicted. This adds a
background folder-cleaner process (P(N+3), gated by ENABLE_DIR_CLEANUP)
that removes them.
How it works:
and the deleter offers each freshly-emptied parent directory after a
batch delete. Both feed a shared folder_queue.
os.rmdir, which is inherently safe: it is a no-op if a file has landed
in the directory in the meantime.
Safety:
(default 120s) so we don't race a writer that just created a bucket and
is about to populate it. This is defense-in-depth on top of rmdir's
empty-only semantics.
Config / Helm:
(default 120) env vars, wired through config.py, the Helm values and
Deployment template, and documented in CONFIGURATION.md.
Reporting:
channel surfaced in the aggregated log. The crawler's per-sweep counter
is named empty_folders_queued to reflect that it counts directories
handed to the cleaner, not directories it deleted itself. The deleter's
progress/done result-queue protocol is left unchanged.
Signed-off-by: Miro mironikolov@google.com
Signed-off-by: Miro mironikolov@google.com
Signed-off-by: Miro mironikolov@google.com
Signed-off-by: Miro mironikolov@google.com
Signed-off-by: Miro mironikolov@google.com
Signed-off-by: Miro mironikolov@google.com
Signed-off-by: Miro mironikolov@google.com
Signed-off-by: Miro mironikolov@google.com
Signed-off-by: Miro mironikolov@google.com
feat(kvevents): Track HMA group identity in kv cache index (feat(kvevents): Track HMA group identity in kv cache index llm-d/llm-d-kv-cache#627)
Signed-off-by: Sage Ahrac sagiahrak@gmail.com
Signed-off-by: Sage Ahrac sagiahrak@gmail.com
Expose the learned group catalog so scorer follow-up work can use the event-derived metadata.
Co-authored-by: Kapil Jain kapiljain1989@gmail.com
Signed-off-by: Sage Ahrac sagiahrak@gmail.com
Handle JSON encoding errors and store PodEntry directly for runtime Redis index state.
Signed-off-by: Sage Ahrac sagiahrak@gmail.com
Signed-off-by: Sage Ahrac sagiahrak@gmail.com
Signed-off-by: Sage Ahrac sagiahrak@gmail.com
Signed-off-by: Sage Ahrac sagiahrak@gmail.com
Signed-off-by: Sage Ahrac sagiahrak@gmail.com
Signed-off-by: Sage Ahrac sagiahrak@gmail.com
Signed-off-by: Sage Ahrac sagiahrak@gmail.com
Co-authored-by: Kapil Jain kapiljain1989@gmail.com
feat: Add HMA support to FS connector (feat: Add HMA support to FS connector llm-d/llm-d-kv-cache#476)
feat: Add HMA support to the fs connector
Signed-off-by: Kfir Toledo kfir.toledo@ibm.com
PVC Evictor: add crawler tests, CI, and docs after layout changes (PVC Evictor: add crawler tests, CI, and docs after layout changes llm-d/llm-d-kv-cache#618)
Add pytest for stream_cache_files_with_mapper, pvc-evictor CI workflow,
dev Makefile/requirements, Dockerfile comments for feat: Add PVC evictor BlockRemoved events llm-d/llm-d-kv-cache#605 storage events,
and docs for the flat fs_backend layout.
Keeps llmd_fs_backend in the image for upcoming feat: Add PVC evictor BlockRemoved events llm-d/llm-d-kv-cache#605; crawler stays
path-only per fix(pvc_evictor): walk current FileMapper layout in crawler llm-d/llm-d-kv-cache#611. Follow-up to pvc_evictor: support updated fs_backend layout (FileMapper v0.20+) llm-d/llm-d-kv-cache#601 / fix(pvc_evictor): walk current FileMapper layout in crawler llm-d/llm-d-kv-cache#611.
Signed-off-by: Guy Girmonsky guygir@gmail.com
Signed-off-by: Guy Girmonsky guygir@gmail.com
Signed-off-by: Guy Girmonsky guygir@gmail.com
fix (fix(make): use linux platform in Docker build targets llm-d/llm-d-kv-cache#631)
Signed-off-by: Alex alex.tech.lab@outlook.com
fs_backend: sync block_size_factor with default offloaded block size
When block_size is absent from kv_connector_extra_config, the fs
backend defaults offloaded_block_size to 256 tokens, but vLLM's
OffloadingSpec base class only derives block_size_factor when
block_size is explicitly present, leaving it at 1. The scheduler then
emits one offload key per GPU block while the worker consumes one key
per file (gpu_blocks_per_file blocks), so on hybrid models (e.g. Gemma
with sliding-window + full-attention KV cache groups) the second
group's key slice lands inside the first group's keys and every
transfer fails with:
Set block_size_factor = gpu_blocks_per_file in
SharedStorageOffloadingSpec so the scheduler and worker always agree on
the per-file key granularity, regardless of whether block_size is
configured explicitly. On single-group models the old mismatch did not
assert but silently named files after the wrong block hashes, crippling
the offload hit rate.
Fixes [fs_backend] assertion with gemma4 when doing video inference llm-d/llm-d-kv-cache#656
Co-Authored-By: Claude Fable 5 noreply@anthropic.com
Signed-off-by: Miro mironikolov@google.com
fs_backend: accept explicit block_size on hybrid models
vLLM's OffloadingSpec base class derives block_size_factor from
extra_config["block_size"] and asserts that all KV cache groups share
one GPU block size to do so. Hybrid models like Gemma 4 have groups
with different block sizes, so explicitly configuring "block_size"
crashed at startup with:
The fs backend does not need that uniformity: it sizes files in
hash_block_size (GCD of group block sizes) granularity and already
derives block_size_factor itself. Hide "block_size" from the base
class during super().init() (restoring it afterwards) so the
uniformity assert is never reached and explicit and default block_size
configurations behave identically.
Fixes [fs_backend] Assertion with gemma4 when block size is explicitly defined llm-d/llm-d-kv-cache#657
Co-Authored-By: Claude Fable 5 noreply@anthropic.com
Signed-off-by: Miro mironikolov@google.com