feat: get block ids with load errors #5668

cheese-head · 2026-01-27T00:39:16Z

Overview:

Add infrastructure to track and report block IDs that fail during asynchronous KV cache load operations in the KVBM connector. This enables vLLM to identify failed blocks and take corrective action (e.g., mark for recomputation).

Details:

Add block_ids field to WorkerTransferRequest protocol for tracking which blocks belong to each transfer operation
Add failed_block_ids field and get_block_ids_with_load_errors() method to KvConnectorWorker
Expose the method through Python bindings (connector_worker.py, dynamo_connector.py)
Add failure_tx/failure_rx channel between Scheduler and WorkerSchedulerClient to propagate transfer failures
Add drain_failures() method to collect pending failure notifications
Track request_id → uuid → block_ids mapping in worker to convert failure notifications to block IDs
Update handle_immediate_result() to send failure notifications when result.status.is_err()
Clean up tracking state when requests complete

Where should the reviewer start?

lib/llm/src/block_manager/connector/scheduler.rs - Core failure notification channel implementation
lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs - Block ID tracking and error aggregation logic
lib/llm/src/block_manager/connector/protocol.rs - Protocol changes (block_ids field)

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Relates to GitHub issue: feat: distributed kvbm #5243

Summary by CodeRabbit

New Features
- Added comprehensive error tracking for block load operations, enabling the system to track and report which blocks fail to load, improving visibility into failures and supporting better diagnostics and recovery mechanisms.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Add infrastructure to track block IDs that fail during async KV cache loading, enabling vLLM to identify and handle failed blocks gracefully. Changes: - protocol.rs: Add block_ids field to WorkerTransferRequest - scheduler.rs: Update tests with new block_ids field - slot.rs: Populate block_ids for onboard/offload operations - worker.rs: Add failed_block_ids tracking and get_block_ids_with_load_errors method - connector_worker.py: Expose get_block_ids_with_load_errors to Python - dynamo_connector.py: Implement KVConnectorBase_V1 interface method Signed-off-by: Patrick Riel <priel@nvidia.com>

…acking - Add failure_tx/failure_rx channel between Scheduler and WorkerSchedulerClient - Add drain_failures() method to collect pending failure notifications - Track request_id -> uuid -> block_ids mapping in KvConnectorWorker - Populate failed_block_ids from scheduler failure notifications - Clean up block_ids tracking when requests complete Signed-off-by: Patrick Riel <priel@nvidia.com>

copy-pr-bot · 2026-01-27T00:39:20Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Patrick Riel <priel@nvidia.com>

coderabbitai · 2026-01-27T00:49:51Z

Walkthrough

This pull request introduces a block load error tracking system that propagates failed block IDs through the KV connector stack. Methods are added to Python and Rust layers to expose and retrieve failed block IDs, with supporting infrastructure to capture, track, and drain failure notifications across workers and schedulers.

Changes

Cohort / File(s)	Summary
Python connector API layer `lib/bindings/kvbm/python/kvbm/vllm_integration/connector/dynamo_connector.py`, `lib/bindings/kvbm/python/kvbm/vllm_integration/connector_worker.py`	Added public `get_block_ids_with_load_errors()` methods to expose failed block IDs; methods delegate to underlying worker/connector or return empty set if unavailable.
Rust protocol `lib/llm/src/block_manager/connector/protocol.rs`	Added `block_ids: Vec<usize>` field to `WorkerTransferRequest` struct with `#[serde(default)]` annotation for error tracking during transfers.
Rust worker tracking `lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs`	Implemented error tracking state (`request_to_blocks`, `failed_block_ids`); added `get_block_ids_with_load_errors()` trait method; extended onboarding flow to record block IDs per request; exposed method to Python binding. Block IDs now propagated through transfer requests during onboard/offload operations.
Rust scheduler and failure channels `lib/llm/src/block_manager/connector/scheduler.rs`	Added failure notification channel infrastructure (`failure_rx`, `failure_tx`); new `drain_failures()` method aggregates pending failures; immediate-transfer failures now propagated via channel; test updates for `block_ids` field in `WorkerTransferRequest`.
Rust slot management `lib/bindings/kvbm/src/block_manager/vllm/connector/leader/slot.rs`	Captures block IDs from onboarding/offload paths and includes them in `WorkerTransferRequest` payloads for downstream error correlation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A rabbit hops through the transfer lane,
Tracking blocks through bytes and pain,
When loads go wrong, we note with care,
Each failed ID stored with flair! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main feature: adding infrastructure to track and report block IDs that fail during asynchronous KV cache load operations.
Description check	✅ Passed	The description follows the template structure with all required sections (Overview, Details, Where should the reviewer start, Related Issues) and provides comprehensive information about the changes.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs`:
- Around line 293-305: When enqueuing onboarding_operations we currently stash
request->blocks in request_to_blocks but later remove that mapping in
get_finished before we call get_block_ids_with_load_errors, which can drop the
failure->block correlation; fix by ensuring failures are drained/translated into
block IDs before cleanup: call get_block_ids_with_load_errors (or a shared
helper that drains scheduler failures and maps them to block IDs using
request_to_blocks) prior to removing the request's entry from request_to_blocks
in get_finished; factor the draining/mapping logic into a reusable helper (e.g.,
drain_and_map_load_failures) and use it from both the onboarding loop code paths
and the get_finished cleanup to preserve failure→block mappings.

🧹 Nitpick comments (1)

lib/llm/src/block_manager/connector/protocol.rs (1)
152-154: Confirm whether #[serde(default)] should mask missing block_ids.
Defaulting to empty accepts older senders but silently drops error tracking. If deployments are atomic and you want strictness, consider making the field required so deserialization fails fast when it’s absent.
💡 Possible tweak to keep the protocol explicit
-    #[serde(default)]
-    pub block_ids: Vec<usize>,
+    pub block_ids: Vec<usize>,
Based on learnings, keep the spec explicit if backward compatibility isn't required.

coderabbitai · 2026-01-27T00:49:54Z

lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs

        // immediately enqueue the onboarding operations
        for operation in onboarding_operations {
            let request_id = operation.request_id.clone();
+            let uuid = operation.uuid;
+            let block_ids = operation.block_ids.clone();
+
+            // Store block_ids mapping for error tracking (only for load operations)
+            if !block_ids.is_empty() {
+                self.request_to_blocks
+                    .entry(request_id.clone())
+                    .or_default()
+                    .insert(uuid, block_ids);
+            }


⚠️ Potential issue | 🟠 Major

Avoid losing failure → block mappings on request cleanup.
request_to_blocks is removed in get_finished, but failures are only mapped when get_block_ids_with_load_errors() drains the scheduler. If get_finished runs first, failure notifications for that request can no longer be correlated to block IDs.

🛠️ Suggested fix: drain failures before cleanup and reuse a helper

+ fn absorb_failure_notifications(&mut self) { + let failures = self.connector.drain_failures(); + for (request_id, failed_uuids) in failures { + if let Some(uuid_to_blocks) = self.request_to_blocks.get(&request_id) { + for uuid in failed_uuids { + if let Some(block_ids) = uuid_to_blocks.get(&uuid) { + for &block_id in block_ids { + self.failed_block_ids.insert(block_id as u32); + } + tracing::warn!( + request_id = %request_id, + operation_id = %uuid, + block_ids = ?block_ids, + "load operation failed; marking blocks as failed" + ); + } + } + } + } + } @@ - for request_id in &is_finished_onboarding { + // Drain failures while block-id mappings still exist. + self.absorb_failure_notifications(); + for request_id in &is_finished_onboarding { self.maybe_finished_onboarding.remove(request_id); // Clean up block_ids tracking for this request self.request_to_blocks.remove(request_id); @@ - fn get_block_ids_with_load_errors(&mut self) -> HashSet<u32> { - // Drain failures from the scheduler and convert (request_id, uuid) -> block_ids - let failures = self.connector.drain_failures(); - for (request_id, failed_uuids) in failures { - if let Some(uuid_to_blocks) = self.request_to_blocks.get(&request_id) { - for uuid in failed_uuids { - if let Some(block_ids) = uuid_to_blocks.get(&uuid) { - for &block_id in block_ids { - self.failed_block_ids.insert(block_id as u32); - } - tracing::warn!( - request_id = %request_id, - operation_id = %uuid, - block_ids = ?block_ids, - "load operation failed; marking blocks as failed" - ); - } - } - } - } - - // Return and clear the failed block IDs - std::mem::take(&mut self.failed_block_ids) - } + fn get_block_ids_with_load_errors(&mut self) -> HashSet<u32> { + self.absorb_failure_notifications(); + // Return and clear the failed block IDs + std::mem::take(&mut self.failed_block_ids) + }

Also applies to: 473-509

🤖 Prompt for AI Agents

In `@lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs` around lines 293 - 305, When enqueuing onboarding_operations we currently stash request->blocks in request_to_blocks but later remove that mapping in get_finished before we call get_block_ids_with_load_errors, which can drop the failure->block correlation; fix by ensuring failures are drained/translated into block IDs before cleanup: call get_block_ids_with_load_errors (or a shared helper that drains scheduler failures and maps them to block IDs using request_to_blocks) prior to removing the request's entry from request_to_blocks in get_finished; factor the draining/mapping logic into a reusable helper (e.g., drain_and_map_load_failures) and use it from both the onboarding loop code paths and the get_finished cleanup to preserve failure→block mappings.

cheese-head added 2 commits January 27, 2026 00:36

pull-request-size bot added the size/L label Jan 27, 2026

github-actions bot added the feat label Jan 27, 2026

cheese-head marked this pull request as ready for review January 27, 2026 00:39

cheese-head requested review from a team as code owners January 27, 2026 00:39

cargo fmt

eac002e

Signed-off-by: Patrick Riel <priel@nvidia.com>

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: get block ids with load errors #5668

feat: get block ids with load errors #5668

Uh oh!

cheese-head commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Jan 27, 2026

Uh oh!

coderabbitai bot commented Jan 27, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: get block ids with load errors #5668

Are you sure you want to change the base?

feat: get block ids with load errors #5668

Uh oh!

Conversation

cheese-head commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jan 27, 2026

Uh oh!

coderabbitai bot commented Jan 27, 2026

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cheese-head commented Jan 27, 2026 •

edited by coderabbitai bot

Loading