Skip to content

feat(multimodal): configurable tensor transport + vLLM SHM and video#1818

Open
slin1237 wants to merge 3 commits into
mainfrom
feat/multimodal-tensor-transport
Open

feat(multimodal): configurable tensor transport + vLLM SHM and video#1818
slin1237 wants to merge 3 commits into
mainfrom
feat/multimodal-tensor-transport

Conversation

@slin1237

@slin1237 slin1237 commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Description

Problem

The multimodal tensor transport was TokenSpeed-only and tuned exclusively via
SMG_TOKENSPEED_MM_* environment variables. There was no first-class
(CLI/config) way to select it, no per-worker control, and vLLM workers always
received large preprocessed tensors inline over gRPC. vLLM also rejected video
inputs entirely — which is exactly the case where avoiding the gRPC copy matters
most.

Solution

Make the transport engine-agnostic and configurable, extend the shared-memory
(/dev/shm) transport to vLLM, and wire up vLLM video. The multimodal subsystem
receives resolved values and never depends on RouterConfig.

Changes

Transport config

  • New --multimodal-tensor-transport (inline | shm | auto) and
    --multimodal-shm-min-bytes CLI flags + RouterConfig fields, plus per-worker
    WorkerSpec overrides (multimodal_tensor_transport,
    multimodal_shm_min_bytes).
  • Precedence: worker spec → router config → SMG_MM_* env → built-in
    default
    . Legacy SMG_TOKENSPEED_MM_* names are honored as fallback aliases.

vLLM shared memory

  • Hoist ShmHandle / RemoteTensorHandle into common.proto and migrate
    TokenSpeed to them; add the transport oneof to vLLM TensorData (inline
    stays field 1 for wire compatibility).
  • Gateway writes large tensors to /dev/shm via shared, engine-neutral
    mm_shm I/O (smg-mm- prefix, orphan sweep), with build- and send-path
    cleanup mirroring TokenSpeed.
  • vLLM servicer reads payloads via a shared mm_shm helper and advertises
    shm_namespace_id (GetServerInfo) so auto can verify /dev/shm
    co-location.

vLLM video

  • Assemble video inputs for vLLM (is_video) instead of rejecting them; the
    servicer routes the encoder tensor to pixel_values_videos with video field
    configs. (sglang/TRT remain image-only.)

Docs

  • docs/reference/configuration.md updated for the flags, precedence, per-worker
    override, and vLLM support.

Test Plan

Ran locally (macOS):

  • cargo +nightly fmt --all — clean
  • cargo clippy -p smg -p smg-grpc-client -p openai-protocol --all-targets -- -D warnings — clean (default features; --all-features pulls opencv)
  • cargo test -p smg — green (the pre-existing local_shm_namespace_id_resolves_on_linux is Linux-only and skipped on macOS), cargo test -p openai-protocol — green
  • cargo check --workspace --tests, cargo build -p smg-python -p smg-golang — clean
  • Python: mm_shm unit tests (/dev/shm round-trip), servicer py_compile + ruff, proto regen field assertions

New tests:

  • Rust: vLLM into_proto inline / below-threshold / image vs video modality
  • Python: grpc_servicer/tests/test_mm_shm.py (shared SHM reader)
  • E2E (e2e_test/chat_completions/test_multimodal_shm.py): Qwen3-VL image and
    video over vLLM gRPC with --multimodal-tensor-transport shm, asserting correct
    output and that smg_mm_tensors_total{path="shm",runtime="vllm"} increased
    (so a silent inline fallback fails the test). Requires a GPU runner; runs in CI.
Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets -- -D warnings passes (default features)
  • Documentation updated

Summary by CodeRabbit

  • New Features
    • Added engine-agnostic multimodal tensor transport (inline, shm, auto) plus --multimodal-shm-min-bytes to control SHM eligibility.
    • Added multimodal routing for both images and videos, including SHM namespace identity reporting (shm_namespace_id) and unified SHM lifecycle cleanup behavior.
    • Introduced gateway-level, per-worker, and environment/CLI precedence with legacy environment alias support.
  • Bug Fixes
    • Improved multimodal SHM cleanup on request/build failures and ensured the decode step strips multimodal inputs.
  • Tests
    • Added end-to-end SHM verification for single-image and single-video requests on vLLM.
  • Documentation
    • Updated configuration reference for both TokenSpeed and vLLM, including precedence and SHM-related environment variables.

@github-actions github-actions Bot added documentation Improvements or additions to documentation grpc gRPC client and router changes tests Test changes protocols Protocols crate changes model-gateway Model gateway crate changes labels Jun 22, 2026
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds shared multimodal tensor transport over shared memory, expands vLLM and TokenSpeed schemas and routing config, threads transport selection through router and proto conversion paths, adds servicer-side SHM readers, and introduces unit and e2e coverage for image and video SHM flows.

Changes

Multimodal SHM Tensor Transport

Layer / File(s) Summary
Shared proto tensor transport contracts
crates/grpc_client/proto/common.proto, crates/grpc_client/proto/tokenspeed_scheduler.proto, crates/grpc_client/proto/vllm_engine.proto
Defines shared ShmHandle and RemoteTensorHandle, rewires TokenSpeed tensor payload variants to use the shared handle types, and extends the vLLM tensor and server-info schema with transport and video fields.
Multimodal transport config surface
model_gateway/src/config/types.rs, model_gateway/src/config/builder.rs, model_gateway/src/main.rs, crates/protocols/src/worker.rs, docs/reference/configuration.md, bindings/python/src/lib.rs, bindings/python/src/smg/router_args.py, crates/grpc_client/python/pyproject.toml, grpc_servicer/pyproject.toml
Adds multimodal transport fields to router config, CLI, Python bindings, worker overrides, package metadata, and documentation, including precedence and SHM threshold behavior.
Transport config resolution and router wiring
model_gateway/src/routers/grpc/multimodal.rs, model_gateway/src/routers/grpc/router.rs, model_gateway/src/routers/grpc/pd_router.rs, model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs, model_gateway/src/routers/grpc/regular/stages/messages/request_building.rs, model_gateway/src/routers/grpc/common/stages/request_execution.rs
Introduces MmTransportConfig, stores it on multimodal components, resolves it during router initialization, threads it into multimodal assembly call sites, and strips multimodal inputs from the vLLM decode leg.
Assembly flow and SHM thresholding
model_gateway/src/routers/grpc/multimodal.rs
Updates multimodal assembly to allow vLLM image or video inputs, computes per-request SHM enablement and byte thresholds, applies the shared SHM threshold to TokenSpeed serialization, and updates assembly tests for the new signatures.
Proto conversion and SHM lifecycle
model_gateway/src/routers/grpc/proto_wrapper.rs, model_gateway/src/routers/grpc/client.rs
Generalizes multimodal proto conversion and SHM lifecycle handling to shared handle types, adds shared payload selection and SHM write helpers, unifies orphan cleanup, adds vLLM request finalization cleanup, updates proto-wrapper tests for SHM threshold and video handling, and switches gRPC client request cleanup to shared multimodal helpers.
Servicer SHM reading and vLLM deserialization
grpc_servicer/smg_grpc_servicer/mm_shm.py, grpc_servicer/smg_grpc_servicer/vllm/servicer.py, grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
Adds mm_shm.py utilities for SHM name validation, byte reads with offset and length checking, optional unlinking, and namespace identity; updates vLLM tensor payload decoding, modality-specific preprocessing, and server-info namespace exposure; updates TokenSpeed SHM handle annotation to the common proto type.
Unit, e2e, and CI validation
grpc_servicer/tests/test_mm_shm.py, e2e_test/chat_completions/test_multimodal_shm.py, scripts/ci_install_e2e_deps.sh
Adds mm_shm unit tests for name validation and payload reads, adds e2e multimodal image and video SHM tests, and ensures ffmpeg is available for video test runs.

Sequence Diagram(s)

sequenceDiagram
  participant Router
  participant MultimodalAssembly
  participant ProtoWrapper
  participant GrpcClient
  participant VllmServicer
  participant mm_shm

  Router->>MultimodalAssembly: assemble_multimodal_data(..., transport)
  MultimodalAssembly->>ProtoWrapper: build TensorData / SHM payloads
  ProtoWrapper->>GrpcClient: generate request with SHM handles
  GrpcClient->>GrpcClient: finish_vllm_request / cleanup_mm_shm_handles on failure
  GrpcClient->>VllmServicer: GenerateRequest with TensorData
  VllmServicer->>mm_shm: tensor_payload_bytes_from_shm(handle)
  mm_shm-->>VllmServicer: bytes
  VllmServicer-->>Router: response stream
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Possibly related PRs

  • lightseekorg/smg#588: Touches the same gRPC multimodal assembly pipeline and assemble_multimodal_data(...) call chain that this PR extends with transport handling.
  • lightseekorg/smg#611: Shares the same crates/grpc_client/python/pyproject.toml version bump, so it is directly related at the package-metadata level.
  • lightseekorg/smg#1515: Introduced the TokenSpeed multimodal tensor schema that this PR generalizes into shared common handle types.

Suggested labels

multimodal

Suggested reviewers

  • key4ng
  • CatherineSue

Poem

🐇 Through shared memory I hop and zoom,
Inline or SHM, tensors find room.
A ShmHandle here, a video there,
The bunny claps with protocol flair!
No backend bounds can fence me in—
Multimodal hops begin again.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changes: configurable multimodal tensor transport plus vLLM SHM and video support.
Docstring Coverage ✅ Passed Docstring coverage is 90.65% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/multimodal-tensor-transport

Comment @coderabbitai help to get the list of available commands.

"TokenSpeed multimodal tensor transport configured"
shm_min_bytes,
dev_writable = mm_shm_dev_writable(),
"TokenSpeed multimodal tensor transport configured (global default; worker specs may override)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: This log message still says "TokenSpeed" but log_mm_transport_config_once is now the engine-agnostic config logger (called for vLLM too). Should read something like "Multimodal tensor transport configured (global default; worker specs may override)".

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request unifies and extends the multimodal tensor transport mechanism to support both TokenSpeed and vLLM workers, allowing large tensor payloads to be shared via /dev/shm (shared memory) instead of being sent inline over gRPC. It introduces global configuration options, per-worker overrides in WorkerSpec, and support for video modality in vLLM. Feedback on the changes highlights a security vulnerability in the shared memory filename validation that could allow arbitrary file deletion in /dev/shm, as well as a potential KeyError crash in the vLLM servicer when processing flat keys.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +30 to +35
def validated_shm_name(name: str) -> str:
"""Reject path traversal / absolute names before touching the filesystem."""
name = name.lstrip("/")
if not name or "/" in name or name in (".", "..") or "\x00" in name:
raise ValueError(f"Invalid shm tensor name: {name!r}")
return name

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The validated_shm_name function strips leading slashes and checks for path traversal characters, but it does not verify that the shared memory file name starts with the expected prefix (smg-mm- or smg-tokenspeed-). Since the worker may unlink the file after reading it (when UNLINK_MM_SHM_AFTER_READ is enabled), a compromised or malicious router could potentially supply an arbitrary file name in /dev/shm to read or delete it. Restricting the name to the expected prefixes mitigates this risk.

Suggested change
def validated_shm_name(name: str) -> str:
"""Reject path traversal / absolute names before touching the filesystem."""
name = name.lstrip("/")
if not name or "/" in name or name in (".", "..") or "\x00" in name:
raise ValueError(f"Invalid shm tensor name: {name!r}")
return name
def validated_shm_name(name: str) -> str:
"""Reject path traversal / absolute names before touching the filesystem."""
name = name.lstrip("/")
if not name or "/" in name or name in (".", "..") or "\x00" in name:
raise ValueError(f"Invalid shm tensor name: {name!r}")
if not (name.startswith("smg-mm-") or name.startswith("smg-tokenspeed-")):
raise ValueError(f"Invalid shm tensor name prefix: {name!r}")
return name

Comment on lines 674 to 677
elif key in flat:
sizes_key = flat[key]
if sizes_key not in flat_sizes_cache:
flat_sizes_cache[sizes_key] = hf_dict[sizes_key].flatten().to(torch.int64)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If a client sends a flat_keys map containing a sizes_key that is not present in model_specific_tensors (and thus not in hf_dict), accessing hf_dict[sizes_key] will raise a KeyError. Checking if sizes_key exists in hf_dict and raising a clear ValueError prevents an unexpected crash and provides better error reporting.

            elif key in flat:
                sizes_key = flat[key]
                if sizes_key not in hf_dict:
                    raise ValueError(f"Flat sizes key {sizes_key!r} for {key!r} not found in model_specific_tensors")
                if sizes_key not in flat_sizes_cache:
                    flat_sizes_cache[sizes_key] = hf_dict[sizes_key].flatten().to(torch.int64)

/// for this worker (e.g. force `shm` for a co-located worker, `inline` for a
/// remote one).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub multimodal_tensor_transport: Option<String>,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: This field accepts any string without validation. A typo like "smh" instead of "shm" silently falls back to inline (the other arm in resolve_mm_shm_enabled logs a warning, but only once via OnceLock — so the second misconfigured worker is entirely silent). Consider either a serde deserialize validation or an enum type to catch invalid values at worker registration time rather than at request time. The CLI flag already validates with value_parser = ["inline", "shm", "auto"]; the API path doesn't get the same protection.

continue;
};
let Some(rest) = name.strip_prefix("smg-tokenspeed-") else {
let Some(rest) = name.strip_prefix(MM_SHM_NAME_PREFIX) else {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: The prefix changed from "smg-tokenspeed-" to "smg-mm-", so the orphan sweep will no longer clean up SHM files left behind by a previous SMG version that crashed before its consumer could unlink them. During a rolling upgrade, any smg-tokenspeed-* crash orphans in /dev/shm will persist until the next reboot. Consider also sweeping the legacy prefix here (same dead-pid logic applies).

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-structured PR. The engine-agnostic refactor of the SHM transport is consistent, the vLLM video wiring is correct, and the SHM lifecycle management (build-path + send-path cleanup) properly mirrors the existing TokenSpeed pattern. Three nits posted — no blocking issues.

Summary: 0 🔴 Important · 3 🟡 Nit · 0 🟣 Pre-existing

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 11e5b227f2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

kv_role=kv_role,
kv_engine_id=kv_engine_id,
data_parallel_size=parallel.data_parallel_size,
shm_namespace_id=mm_shm.shm_namespace_id(),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pin vLLM servicer to the new proto package

When the vLLM servicer is installed or published without also installing the regenerated smg-grpc-proto, grpc_servicer/pyproject.toml still allows the existing smg-grpc-proto>=0.4.11, whose GetServerInfoResponse has no shm_namespace_id field and whose TensorData has no payload oneof. In that environment this constructor rejects the unknown keyword during worker discovery, so vLLM gRPC workers fail to start/register before any request can run; the repo’s development guide says proto+servicer field changes need a matching proto release/bump or split PR.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
grpc_servicer/smg_grpc_servicer/vllm/servicer.py (1)

697-718: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject invalid placeholder ranges before creating PlaceholderRange.

Lines 706-718 accept placeholder ranges without bounds validation. Out-of-range or zero-length entries can silently create invalid mask/range state and misalign multimodal embedding placement.

Suggested fix
             placeholders = []
             for p in mm_proto.mm_placeholders:
+                if p.length <= 0:
+                    raise ValueError("Multimodal placeholder length must be > 0")
+                if p.offset + p.length > len(prompt_token_ids):
+                    raise ValueError(
+                        f"Multimodal placeholder out of bounds: "
+                        f"offset={p.offset}, length={p.length}, prompt_len={len(prompt_token_ids)}"
+                    )
                 is_embed = None
                 if prompt_ids_tensor is not None:
                     mask = prompt_ids_tensor[p.offset : p.offset + p.length] == pad_token_id
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py` around lines 697 - 718, Add
bounds validation for each placeholder entry in the mm_proto.mm_placeholders
loop before creating the PlaceholderRange object. Validate that p.offset is
non-negative, p.length is positive, and that the range (p.offset + p.length)
does not exceed the length of prompt_token_ids. Skip invalid entries or raise an
appropriate error to prevent out-of-bounds placeholder ranges from being
processed, ensuring correct multimodal embedding alignment.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py`:
- Around line 88-94: The _tensor_from_proto function does not validate that the
payload byte length matches the expected tensor size before calling reshape,
which can cause a cryptic runtime error when malformed inputs are provided. Add
validation logic before the reshape operation that calculates the expected byte
size based on the tensor dtype and shape (using torch.tensor with the dtype and
computing the total element count), compares it with the actual payload byte
length, and raises a clear ValueError if they do not match.

---

Outside diff comments:
In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py`:
- Around line 697-718: Add bounds validation for each placeholder entry in the
mm_proto.mm_placeholders loop before creating the PlaceholderRange object.
Validate that p.offset is non-negative, p.length is positive, and that the range
(p.offset + p.length) does not exceed the length of prompt_token_ids. Skip
invalid entries or raise an appropriate error to prevent out-of-bounds
placeholder ranges from being processed, ensuring correct multimodal embedding
alignment.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0e95bac5-caa2-49d9-8362-d2329f28c5af

📥 Commits

Reviewing files that changed from the base of the PR and between 6cf37bd and 11e5b22.

⛔ Files ignored due to path filters (1)
  • e2e_test/fixtures/videos/dog.mp4 is excluded by !**/*.mp4
📒 Files selected for processing (20)
  • crates/grpc_client/proto/common.proto
  • crates/grpc_client/proto/tokenspeed_scheduler.proto
  • crates/grpc_client/proto/vllm_engine.proto
  • crates/protocols/src/worker.rs
  • docs/reference/configuration.md
  • e2e_test/chat_completions/test_multimodal_shm.py
  • grpc_servicer/smg_grpc_servicer/mm_shm.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
  • grpc_servicer/smg_grpc_servicer/vllm/servicer.py
  • grpc_servicer/tests/test_mm_shm.py
  • model_gateway/src/config/builder.rs
  • model_gateway/src/config/types.rs
  • model_gateway/src/main.rs
  • model_gateway/src/routers/grpc/client.rs
  • model_gateway/src/routers/grpc/multimodal.rs
  • model_gateway/src/routers/grpc/pd_router.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs
  • model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs
  • model_gateway/src/routers/grpc/regular/stages/messages/request_building.rs
  • model_gateway/src/routers/grpc/router.rs

Comment thread grpc_servicer/smg_grpc_servicer/vllm/servicer.py
@slin1237 slin1237 force-pushed the feat/multimodal-tensor-transport branch from 11e5b22 to 63e88d5 Compare June 23, 2026 06:35
@slin1237 slin1237 requested a review from gongwei-130 as a code owner June 23, 2026 06:35
@github-actions github-actions Bot added python-bindings Python bindings changes dependencies Dependency updates labels Jun 23, 2026
@slin1237

Copy link
Copy Markdown
Collaborator Author

Addressed all review feedback and rebased on main.

Root cause of the vLLM e2e failure (launch_router.py: unrecognized arguments: --multimodal-tensor-transport): the new flags weren't wired through the Python bindings. Added multimodal_tensor_transport / multimodal_shm_min_bytes to RouterArgs (argparse + dataclass) and the PyO3 Router (to_router_config), so the flags work via the Python launcher — and the e2e gateway can set them.

Review comments

  • 🔴 SHM filename security (gemini): mm_shm.validated_shm_name now restricts names to the smg-mm- / smg-tokenspeed- prefixes, preventing arbitrary /dev/shm read/unlink.
  • 🟠 flat sizes_key KeyError (gemini): the vLLM servicer raises a clear ValueError when a flat sizes key is absent from model_specific_tensors.
  • 🟠 byte-length validation (CodeRabbit): _tensor_from_proto checks payload length against dtype×shape before reshape.
  • 🟠 placeholder bounds (CodeRabbit): out-of-bounds / zero-length placeholder ranges are rejected before mask building.
  • 🟠 proto release (Codex P1): bumped smg-grpc-proto 0.4.11 → 0.4.12 and the servicer dependency floor, so the new fields can't resolve against an old proto package.
  • 🟡 orphan sweep (Claude): also reclaims legacy smg-tokenspeed-* crash files (rolling-upgrade safety).
  • 🟡 log message (Claude): dropped "TokenSpeed" from the now-engine-agnostic transport-config log.
  • 🟡 unknown-transport warning (Claude): dedupes per distinct value instead of once-ever, so a second worker misconfigured with a different typo still warns. (Kept WorkerSpec as a plain serde type rather than introducing an enum; the CLI still validates via value_parser.)

pre-commit: ran ruff format.

Rebased onto main (b2af3740). Local gate green: cargo +nightly fmt, clippy -D warnings (default features), cargo test -p smg (1100 lib + integration), smg-python/smg-golang builds, mm_shm unit tests, ruff.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 63e88d5b54

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread bindings/python/src/lib.rs Outdated
Comment on lines +805 to +806
multimodal_tensor_transport = None,
multimodal_shm_min_bytes = None,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Append Python constructor args to preserve positions

Because these two new parameters are inserted before dp_aware in the PyO3 signature, existing positional calls to smg.smg_rs.Router/_Router that were valid before now bind the old dp_aware argument to multimodal_tensor_transport and shift every following argument; PyO3 will either reject the type or configure unrelated fields incorrectly. The same signature already appends health_check_port last to avoid this exact break, so these options should also be appended or made keyword-only.

Useful? React with 👍 / 👎.

*WRITABLE.get_or_init(|| {
let name = format!("smg-tokenspeed-probe-{}", process::id());
let path = tokenspeed_shm_path(&name);
let name = format!("smg-mm-probe-{}", process::id());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make the SHM writability probe name unique

In deployments where multiple gateway processes share /dev/shm but run in separate PID namespaces, they can all have the same process::id() (commonly PID 1), so this fixed probe name collides. Since the probe uses create_new and caches the result in OnceLock, a concurrent or stale smg-mm-probe-1 file makes mm_shm_dev_writable() permanently return false for that process, causing shm/auto transport to fall back to inline even though the real per-payload names would be writable.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py (1)

1184-1214: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

TokenSpeed SHM validation still allows non-SMG namespaces.

This path still accepts any top-level /dev/shm name that passes traversal checks, then reads/unlinks it. Please enforce the same prefix allowlist (smg-mm- / smg-tokenspeed-) used by the shared mm_shm helper to avoid arbitrary /dev/shm targeting.

🔒 Suggested fix
     `@staticmethod`
     def _validated_shm_name(name: str) -> str:
         name = name.lstrip("/")
         if not name or "/" in name or name in (".", "..") or "\x00" in name:
             raise ValueError(f"Invalid TensorData.shm name: {name!r}")
+        if not name.startswith(("smg-mm-", "smg-tokenspeed-")):
+            raise ValueError(f"TensorData.shm name outside allowed namespace: {name!r}")
         return name
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py` around lines 1184 -
1214, The _validated_shm_name method only checks for path traversal
vulnerabilities but does not enforce the required namespace prefix allowlist,
allowing access to arbitrary shared memory files in /dev/shm. Modify the
_validated_shm_name method to add a prefix validation check that ensures the
name starts with one of the allowed prefixes (such as smg-mm- or
smg-tokenspeed-) before allowing the name. If the name does not match the
allowed prefixes, raise a ValueError with a descriptive error message indicating
the invalid namespace prefix.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@bindings/python/src/smg/router_args.py`:
- Around line 501-505: The add_argument call for --multimodal-shm-min-bytes uses
type=int without validation, allowing negative values which are invalid for a
minimum bytes threshold. Replace type=int with a custom type validation function
that converts the input to an integer and rejects negative values by raising an
ArgumentTypeError, ensuring only non-negative integers are accepted for this
argument.

---

Outside diff comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 1184-1214: The _validated_shm_name method only checks for path
traversal vulnerabilities but does not enforce the required namespace prefix
allowlist, allowing access to arbitrary shared memory files in /dev/shm. Modify
the _validated_shm_name method to add a prefix validation check that ensures the
name starts with one of the allowed prefixes (such as smg-mm- or
smg-tokenspeed-) before allowing the name. If the name does not match the
allowed prefixes, raise a ValueError with a descriptive error message indicating
the invalid namespace prefix.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: fa4dc311-16d3-4f9a-9d20-cbcee6fbb6b7

📥 Commits

Reviewing files that changed from the base of the PR and between 11e5b22 and 63e88d5.

⛔ Files ignored due to path filters (1)
  • e2e_test/fixtures/videos/dog.mp4 is excluded by !**/*.mp4
📒 Files selected for processing (24)
  • bindings/python/src/lib.rs
  • bindings/python/src/smg/router_args.py
  • crates/grpc_client/proto/common.proto
  • crates/grpc_client/proto/tokenspeed_scheduler.proto
  • crates/grpc_client/proto/vllm_engine.proto
  • crates/grpc_client/python/pyproject.toml
  • crates/protocols/src/worker.rs
  • docs/reference/configuration.md
  • e2e_test/chat_completions/test_multimodal_shm.py
  • grpc_servicer/pyproject.toml
  • grpc_servicer/smg_grpc_servicer/mm_shm.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
  • grpc_servicer/smg_grpc_servicer/vllm/servicer.py
  • grpc_servicer/tests/test_mm_shm.py
  • model_gateway/src/config/builder.rs
  • model_gateway/src/config/types.rs
  • model_gateway/src/main.rs
  • model_gateway/src/routers/grpc/client.rs
  • model_gateway/src/routers/grpc/multimodal.rs
  • model_gateway/src/routers/grpc/pd_router.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs
  • model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs
  • model_gateway/src/routers/grpc/regular/stages/messages/request_building.rs
  • model_gateway/src/routers/grpc/router.rs

Comment thread bindings/python/src/smg/router_args.py
@slin1237 slin1237 force-pushed the feat/multimodal-tensor-transport branch from 63e88d5 to 5623b42 Compare June 23, 2026 17:57
}
}

fn log_tokenspeed_mm_timing_enabled() -> bool {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: This function was missed in the tokenspeed_mm_ rename sweep. It's now the sole caller from the engine-agnostic mm_tensor_payload (line 380), so the tokenspeed in its name is misleading — renaming to log_mm_timing_enabled would keep it consistent with the rest of the refactor.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5623b42270

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +56 to +58
if UNLINK_MM_SHM_AFTER_READ:
try:
os.unlink(path)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep SHM segments alive when PD decode reuses them

When vLLM sequential PD handles a multimodal request with SHM enabled, execute_sequential_pd clones the request for prefill and then reuses the original proto_request for decode without clearing mm_inputs (model_gateway/src/routers/grpc/common/stages/request_execution.rs:571-573), so both legs carry the same SHM handle. This unlink runs after the prefill servicer reads the tensor, causing the decode servicer to open a missing /dev/shm file and fail; either avoid sending the multimodal payload to decode on that path or defer cleanup until all consumers are done.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py (1)

1186-1214: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Harden TokenSpeed SHM name validation to the same namespace allowlist.

Line 1210’s local validator accepts any /dev/shm basename, so a crafted shm_handle.name can still read/unlink non-transport entries. This bypasses the new prefix-boundary hardening introduced in mm_shm.validated_shm_name. Reuse the shared validator (or shared read helper) here to keep both engines under the same security contract.

Suggested minimal fix
@@
     `@staticmethod`
     def _validated_shm_name(name: str) -> str:
-        name = name.lstrip("/")
-        if not name or "/" in name or name in (".", "..") or "\x00" in name:
-            raise ValueError(f"Invalid TensorData.shm name: {name!r}")
-        return name
+        return mm_shm.validated_shm_name(name)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py` around lines 1186 -
1214, The local _validated_shm_name method uses permissive validation that only
checks for basic invalid characters and patterns, but it allows reading or
unlinking non-transport SHM entries that should be protected. Replace the call
to the local _validated_shm_name validator in the function that reads shared
memory (around line 1186) with a call to the shared validator
mm_shm.validated_shm_name that includes proper prefix-boundary hardening. This
ensures both the TokenSpeed servicer and other engines enforce the same strict
security contract for SHM access, preventing crafted shm_handle names from
bypassing namespace restrictions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@bindings/python/src/lib.rs`:
- Around line 805-806: The new constructor parameters
multimodal_tensor_transport and multimodal_shm_min_bytes are being inserted in
the middle of the _Router PyO3 constructor parameter list at lines 805-806 and
927-928, which breaks backward compatibility for existing Python callers using
positional arguments. Move these new parameters to the end of the constructor
parameter list instead of inserting them in the middle, ensuring all existing
positional argument indices remain unchanged.

In `@model_gateway/src/routers/grpc/proto_wrapper.rs`:
- Around line 481-484: The mm_shm_min_bytes_env function returns an untrimmed
string from env_first which causes integer parsing to fail silently when
whitespace is present in the environment variable value. To fix this, call
trim() on the value returned by env_first before attempting to parse it as a
usize, ensuring that values with leading or trailing whitespace like " 65536 "
are correctly parsed and honored instead of being ignored.

---

Outside diff comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 1186-1214: The local _validated_shm_name method uses permissive
validation that only checks for basic invalid characters and patterns, but it
allows reading or unlinking non-transport SHM entries that should be protected.
Replace the call to the local _validated_shm_name validator in the function that
reads shared memory (around line 1186) with a call to the shared validator
mm_shm.validated_shm_name that includes proper prefix-boundary hardening. This
ensures both the TokenSpeed servicer and other engines enforce the same strict
security contract for SHM access, preventing crafted shm_handle names from
bypassing namespace restrictions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e5e23567-d60d-4af3-855f-874d563a2876

📥 Commits

Reviewing files that changed from the base of the PR and between 63e88d5 and 5623b42.

⛔ Files ignored due to path filters (1)
  • e2e_test/fixtures/videos/dog.mp4 is excluded by !**/*.mp4
📒 Files selected for processing (25)
  • bindings/python/src/lib.rs
  • bindings/python/src/smg/router_args.py
  • crates/grpc_client/proto/common.proto
  • crates/grpc_client/proto/tokenspeed_scheduler.proto
  • crates/grpc_client/proto/vllm_engine.proto
  • crates/grpc_client/python/pyproject.toml
  • crates/protocols/src/worker.rs
  • docs/reference/configuration.md
  • e2e_test/chat_completions/test_multimodal_shm.py
  • grpc_servicer/pyproject.toml
  • grpc_servicer/smg_grpc_servicer/mm_shm.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
  • grpc_servicer/smg_grpc_servicer/vllm/servicer.py
  • grpc_servicer/tests/test_mm_shm.py
  • model_gateway/src/config/builder.rs
  • model_gateway/src/config/types.rs
  • model_gateway/src/main.rs
  • model_gateway/src/routers/grpc/client.rs
  • model_gateway/src/routers/grpc/multimodal.rs
  • model_gateway/src/routers/grpc/pd_router.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs
  • model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs
  • model_gateway/src/routers/grpc/regular/stages/messages/request_building.rs
  • model_gateway/src/routers/grpc/router.rs
  • scripts/ci_install_e2e_deps.sh

Comment thread bindings/python/src/lib.rs Outdated
Comment thread model_gateway/src/routers/grpc/proto_wrapper.rs
Make the multimodal tensor transport engine-agnostic and configurable, extend
the shared-memory transport (previously TokenSpeed-only) to vLLM, and add vLLM
video inputs.

Transport config:
- New --multimodal-tensor-transport (inline|shm|auto) and
  --multimodal-shm-min-bytes CLI flags + RouterConfig fields, plus per-worker
  WorkerSpec overrides. Precedence: worker spec -> router config -> SMG_MM_*
  env (legacy SMG_TOKENSPEED_MM_* honored as aliases) -> built-in default. The
  multimodal subsystem receives resolved values and never depends on
  RouterConfig.

vLLM shared memory:
- Hoist ShmHandle/RemoteTensorHandle to common.proto and migrate TokenSpeed to
  them; add the transport oneof to vLLM TensorData. The gateway writes large
  tensors to /dev/shm (shared mm_shm I/O, smg-mm- prefix) with build- and
  send-path cleanup; the vLLM servicer reads them via a shared mm_shm helper and
  advertises shm_namespace_id so `auto` can verify co-location.

vLLM video:
- Assemble video inputs for vLLM (is_video) instead of rejecting them; the
  servicer routes the encoder tensor to pixel_values_videos with `video` field
  configs. Adds an e2e test running Qwen3-VL video over SHM with a committed
  fixture, asserting the SHM transport is exercised.

Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
@slin1237 slin1237 force-pushed the feat/multimodal-tensor-transport branch from 5623b42 to a2d7a08 Compare June 24, 2026 14:32

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py (1)

1183-1214: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Use the shared SHM validator here too.

This TokenSpeed path still accepts any basename under /dev/shm, while grpc_servicer/smg_grpc_servicer/mm_shm.py now restricts handles to the smg-mm- / smg-tokenspeed- namespaces. That leaves TokenSpeed able to read or unlink arbitrary /dev/shm entries from a crafted request, so the new boundary is only enforced on the vLLM side.

Suggested fix
     def _tensor_payload_bytes_from_shm(
         shm_handle: common_pb2.ShmHandle,
     ) -> bytes:
-        name = TokenSpeedSchedulerServicer._validated_shm_name(shm_handle.name)
-
-        path = os.path.join("/dev/shm", name)
-        fd = None
-        try:
-            fd = os.open(path, os.O_RDONLY)
-            raw = os.pread(fd, int(shm_handle.nbytes), int(shm_handle.offset))
-        finally:
-            if fd is not None:
-                os.close(fd)
-            if fd is not None and UNLINK_MM_SHM_AFTER_READ:
-                try:
-                    os.unlink(path)
-                except FileNotFoundError:
-                    pass
-
-        if len(raw) != int(shm_handle.nbytes):
-            raise ValueError(
-                f"TensorData.shm byte length mismatch for name={shm_handle.name!r}: "
-                f"expected {int(shm_handle.nbytes)}, got {len(raw)}"
-            )
-        return raw
+        return mm_shm.tensor_payload_bytes_from_shm(shm_handle)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py` around lines 1183 -
1214, The TokenSpeed shared-memory read path bypasses the centralized SHM
namespace validation, so `_tensor_payload_bytes_from_shm` can still open and
unlink arbitrary `/dev/shm` entries. Update this flow to reuse the shared
validator from `grpc_servicer/smg_grpc_servicer/mm_shm.py` (or the same
namespace-checking logic) before `os.open` and `os.unlink`, and keep the
existing `TokenSpeedSchedulerServicer._validated_shm_name` call only if it
enforces the same `smg-mm-` / `smg-tokenspeed-` restrictions.
♻️ Duplicate comments (1)
bindings/python/src/smg/router_args.py (1)

501-505: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Reject negative --multimodal-shm-min-bytes values.

Line 503 still accepts negative integers, but this value is later consumed by the Rust config path as Option<usize>. That means an invalid CLI value is accepted here and only fails later at the Python→Rust boundary instead of being rejected immediately.

Suggested fix
+        def _non_negative_int(value: str) -> int:
+            parsed = int(value)
+            if parsed < 0:
+                raise argparse.ArgumentTypeError(
+                    "--multimodal-shm-min-bytes must be >= 0"
+                )
+            return parsed
+
         multimodal_group.add_argument(
             f"--{prefix}multimodal-shm-min-bytes",
-            type=int,
+            type=_non_negative_int,
             default=None,
             help="Minimum multimodal tensor size (bytes) before the SHM transport is used",
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bindings/python/src/smg/router_args.py` around lines 501 - 505, Reject
invalid negative values for the multimodal SHM threshold in the router argument
parser. Update the argument definition in the multimodal group inside the
router_args.py parser setup so the --multimodal-shm-min-bytes option validates
non-negative integers before they are passed onward, instead of allowing them to
reach the Python→Rust boundary as an Option[usize]. Use the existing multimodal
argument registration path and its type/validation handling to enforce this
immediately at parse time.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/mm_shm.py`:
- Around line 45-58: The SHM read path in tensor_payload_bytes_from_shm still
follows symlinks, so a crafted shared-memory name can escape the intended
namespace. Update tensor_payload_bytes_from_shm to open the resolved path with
symlink-safe flags and/or validate the opened file descriptor refers to a
regular file under the expected shm directory before reading; keep the fix
localized to tensor_payload_bytes_from_shm and the validated_shm_name/path
handling around os.open, os.pread, and os.unlink.

In `@model_gateway/src/routers/grpc/multimodal.rs`:
- Around line 1753-1765: The worker override handling in the multimodal
transport resolver should not let an invalid
WorkerSpec.multimodal_tensor_transport value suppress a valid global setting.
Update the mode selection around worker_transport_mode_override() and the match
on mode in the multimodal transport path so unknown worker values are logged via
log_unknown_mm_transport_once() and then ignored, allowing the code to fall back
to the resolved global transport from transport / DEFAULT_MM_TENSOR_TRANSPORT
instead of immediately returning false.

In `@scripts/ci_install_e2e_deps.sh`:
- Around line 20-27: The ffmpeg dependency check in ci_install_e2e_deps.sh is
swallowing install failures, which lets the E2E job continue and fail later in
the test phase. Update the ffmpeg installation path so it remains fatal when
apt-get install or the apt-get check fails, or explicitly set a flag that the
video test flow can read to skip those tests. Keep the change centered around
the ffmpeg/apt-get branch in the CI install script so the job behavior is clear
and actionable.

---

Outside diff comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 1183-1214: The TokenSpeed shared-memory read path bypasses the
centralized SHM namespace validation, so `_tensor_payload_bytes_from_shm` can
still open and unlink arbitrary `/dev/shm` entries. Update this flow to reuse
the shared validator from `grpc_servicer/smg_grpc_servicer/mm_shm.py` (or the
same namespace-checking logic) before `os.open` and `os.unlink`, and keep the
existing `TokenSpeedSchedulerServicer._validated_shm_name` call only if it
enforces the same `smg-mm-` / `smg-tokenspeed-` restrictions.

---

Duplicate comments:
In `@bindings/python/src/smg/router_args.py`:
- Around line 501-505: Reject invalid negative values for the multimodal SHM
threshold in the router argument parser. Update the argument definition in the
multimodal group inside the router_args.py parser setup so the
--multimodal-shm-min-bytes option validates non-negative integers before they
are passed onward, instead of allowing them to reach the Python→Rust boundary as
an Option[usize]. Use the existing multimodal argument registration path and its
type/validation handling to enforce this immediately at parse time.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: dedf9c64-b5b8-4113-a8a7-212353691222

📥 Commits

Reviewing files that changed from the base of the PR and between 5623b42 and a2d7a08.

⛔ Files ignored due to path filters (1)
  • e2e_test/fixtures/videos/dog.mp4 is excluded by !**/*.mp4
📒 Files selected for processing (25)
  • bindings/python/src/lib.rs
  • bindings/python/src/smg/router_args.py
  • crates/grpc_client/proto/common.proto
  • crates/grpc_client/proto/tokenspeed_scheduler.proto
  • crates/grpc_client/proto/vllm_engine.proto
  • crates/grpc_client/python/pyproject.toml
  • crates/protocols/src/worker.rs
  • docs/reference/configuration.md
  • e2e_test/chat_completions/test_multimodal_shm.py
  • grpc_servicer/pyproject.toml
  • grpc_servicer/smg_grpc_servicer/mm_shm.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
  • grpc_servicer/smg_grpc_servicer/vllm/servicer.py
  • grpc_servicer/tests/test_mm_shm.py
  • model_gateway/src/config/builder.rs
  • model_gateway/src/config/types.rs
  • model_gateway/src/main.rs
  • model_gateway/src/routers/grpc/client.rs
  • model_gateway/src/routers/grpc/multimodal.rs
  • model_gateway/src/routers/grpc/pd_router.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs
  • model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs
  • model_gateway/src/routers/grpc/regular/stages/messages/request_building.rs
  • model_gateway/src/routers/grpc/router.rs
  • scripts/ci_install_e2e_deps.sh

Comment thread grpc_servicer/smg_grpc_servicer/mm_shm.py
Comment thread model_gateway/src/routers/grpc/multimodal.rs
Comment on lines +20 to +27
if ! command -v ffmpeg >/dev/null 2>&1; then
if command -v apt-get >/dev/null 2>&1; then
echo "Installing ffmpeg (video decode backend)..."
apt-get update -qq && apt-get install -y --no-install-recommends ffmpeg \
|| echo "WARNING: ffmpeg install failed; multimodal video e2e tests will fail" >&2
else
echo "WARNING: ffmpeg + apt-get unavailable; multimodal video e2e tests will fail" >&2
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Don’t hide ffmpeg install failures from the E2E job.

The new video test does not skip when ffmpeg is missing, so continuing here only pushes the failure into the test phase with a less actionable error. If this script is the CI dependency gate for that suite, the install needs to stay fatal there or set a flag that the video tests can skip on.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/ci_install_e2e_deps.sh` around lines 20 - 27, The ffmpeg dependency
check in ci_install_e2e_deps.sh is swallowing install failures, which lets the
E2E job continue and fail later in the test phase. Update the ffmpeg
installation path so it remains fatal when apt-get install or the apt-get check
fails, or explicitly set a flag that the video test flow can read to skip those
tests. Keep the change centered around the ffmpeg/apt-get branch in the CI
install script so the job behavior is clear and actionable.

- e2e: skip the vLLM video SHM test when ffmpeg is unavailable instead of
  failing (an apt install flake was the only red CI check)
- servicer: open SHM payloads with O_NOFOLLOW so a symlink planted at a
  validated name in world-writable /dev/shm can't redirect the read/unlink
- gateway: clear mm_inputs on the sequential-PD decode leg so it no longer
  re-reads an SHM segment the prefill servicer already unlinked
  (mirrors execute_dual_dispatch)
- gateway: make the /dev/shm writability probe name unique (pid+nanos) to
  avoid spurious failure across PID namespaces that share /dev/shm
- gateway: an invalid per-worker transport override now falls through to the
  global mode instead of silently forcing inline
- bindings: append multimodal_* constructor args last to preserve PyO3
  positional compatibility; reject negative --multimodal-shm-min-bytes
- proto_wrapper: finish the rename sweep (log_mm_timing_enabled)

Signed-off-by: Simo Lin <linsimo.mark@gmail.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
grpc_servicer/smg_grpc_servicer/mm_shm.py (1)

45-68: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Reject oversized SHM handles before calling os.pread.

This helper still trusts shm_handle.nbytes for the read size. In grpc_servicer/smg_grpc_servicer/vllm/servicer.py:87-104, _tensor_from_proto() only computes the dtype/shape-based expected byte count after calling into this helper, so a malformed handle can force an arbitrarily large read/allocation and take down the servicer before the later length check runs.

Suggested fix
-def tensor_payload_bytes_from_shm(shm_handle, shm_dir: str = DEFAULT_SHM_DIR) -> bytes:
+def tensor_payload_bytes_from_shm(
+    shm_handle, expected_nbytes: int, shm_dir: str = DEFAULT_SHM_DIR
+) -> bytes:
     """Read a tensor payload the gateway wrote to ``shm_dir`` for ``shm_handle``."""
     name = validated_shm_name(shm_handle.name)
     path = os.path.join(shm_dir, name)
+    actual_nbytes = int(shm_handle.nbytes)
+    if actual_nbytes != expected_nbytes:
+        raise ValueError(
+            f"shm tensor byte length mismatch for name={shm_handle.name!r}: "
+            f"expected {expected_nbytes}, got {actual_nbytes}"
+        )
     fd = None
     try:
         # O_NOFOLLOW: /dev/shm is world-writable, so a same-host attacker could
         # plant a symlink at the (validated) name pointing at an arbitrary file;
         # refuse to follow it so a crafted handle can't read/unlink outside SHM.
         fd = os.open(path, os.O_RDONLY | os.O_NOFOLLOW)
-        raw = os.pread(fd, int(shm_handle.nbytes), int(shm_handle.offset))
+        raw = os.pread(fd, expected_nbytes, int(shm_handle.offset))
@@
-    if len(raw) != int(shm_handle.nbytes):
+    if len(raw) != expected_nbytes:
         raise ValueError(
             f"shm tensor byte length mismatch for name={shm_handle.name!r}: "
-            f"expected {int(shm_handle.nbytes)}, got {len(raw)}"
+            f"expected {expected_nbytes}, got {len(raw)}"
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpc_servicer/smg_grpc_servicer/mm_shm.py` around lines 45 - 68, The shm read
helper in tensor_payload_bytes_from_shm still trusts shm_handle.nbytes before
os.pread, so reject malformed or oversized handles up front instead of reading
first. Add a pre-read size validation in tensor_payload_bytes_from_shm using the
expected tensor byte count derived from the handle metadata, and fail fast
before os.pread when the requested size is invalid; then keep the existing
post-read length check as a safety net. Use the existing
tensor_payload_bytes_from_shm and _tensor_from_proto flow to locate the change,
and make sure the validation happens before any allocation or read is attempted.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@model_gateway/src/routers/grpc/common/stages/request_execution.rs`:
- Around line 571-577: The decode request in request_execution.rs is clearing
multimodal inputs unconditionally, which breaks cases where relay_kv_params is
false and decode must recompute the prompt locally. Update the logic around
decode_request.clear_mm_inputs() so mm_inputs are only stripped on the remote-KV
path, and keep them intact when local recompute is required. Use the existing
decode path in execute_dual_dispatch and the relay_kv_params gating to either
preserve multimodal tensors or route multimodal n > 1 requests through a safe
inline/reject path before SHM is unlinked.

---

Outside diff comments:
In `@grpc_servicer/smg_grpc_servicer/mm_shm.py`:
- Around line 45-68: The shm read helper in tensor_payload_bytes_from_shm still
trusts shm_handle.nbytes before os.pread, so reject malformed or oversized
handles up front instead of reading first. Add a pre-read size validation in
tensor_payload_bytes_from_shm using the expected tensor byte count derived from
the handle metadata, and fail fast before os.pread when the requested size is
invalid; then keep the existing post-read length check as a safety net. Use the
existing tensor_payload_bytes_from_shm and _tensor_from_proto flow to locate the
change, and make sure the validation happens before any allocation or read is
attempted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 66d9fa8b-3d92-4daa-b61d-40b900bd364c

📥 Commits

Reviewing files that changed from the base of the PR and between a2d7a08 and 210f58b.

📒 Files selected for processing (7)
  • bindings/python/src/lib.rs
  • bindings/python/src/smg/router_args.py
  • e2e_test/chat_completions/test_multimodal_shm.py
  • grpc_servicer/smg_grpc_servicer/mm_shm.py
  • model_gateway/src/routers/grpc/common/stages/request_execution.rs
  • model_gateway/src/routers/grpc/multimodal.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs

Comment on lines +571 to +577
// Decode keeps the prefill request_id (load-bearing for NIXL P/D
// correlation on vLLM < 0.13). Strip multimodal data: the decode worker
// only needs the transferred KV cache, not the pixel tensors, and reusing
// an SHM handle would re-read a segment the prefill servicer already
// unlinked. Mirrors execute_dual_dispatch.
let mut decode_request = proto_request;
decode_request.clear_mm_inputs();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Don’t strip multimodal inputs when decode must recompute the prompt.

When relay_kv_params is false, the NIXL path skips KV relay and expects decode to recompute the prompt locally. Clearing mm_inputs unconditionally here leaves multimodal decode requests without the image/video tensors needed for that recompute. Gate this to the remote-KV path, or force a safe inline/reject path for multimodal n > 1 before prefill unlinks SHM handles.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@model_gateway/src/routers/grpc/common/stages/request_execution.rs` around
lines 571 - 577, The decode request in request_execution.rs is clearing
multimodal inputs unconditionally, which breaks cases where relay_kv_params is
false and decode must recompute the prompt locally. Update the logic around
decode_request.clear_mm_inputs() so mm_inputs are only stripped on the remote-KV
path, and keep them intact when local recompute is required. Use the existing
decode path in execute_dual_dispatch and the relay_kv_params gating to either
preserve multimodal tensors or route multimodal n > 1 requests through a safe
inline/reject path before SHM is unlinked.

…e mm

Review follow-up: the previous commit cleared mm_inputs on the sequential-PD
decode leg, which breaks the case where decode recomputes the prompt locally
(e.g. n>1 NIXL, or when KV relay is skipped) — it would lose the pixel tensors.

Instead, force inline multimodal transport for disaggregated (Dual) requests in
resolve_mm_shm_enabled and stop stripping mm_inputs on decode. SHM is unsafe for
PD regardless of the strip: it is single-consumer (the servicer unlinks each
segment after reading) while a PD request has two legs. Inline keeps both the
resume-from-KV and recompute paths correct; SHM stays available for single-worker
requests. Addresses coderabbit (recompute) and codex P2 (PD SHM lifecycle).

Signed-off-by: Simo Lin <linsimo.mark@gmail.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5c72d3da8d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

GrpcClient::Vllm(_) => {
ensure_image_only(&precomputed, "vLLM")?;
Ok(MultimodalData::Vllm(assemble_vllm(precomputed)))
ensure_image_or_video(&precomputed, "vLLM")?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use video flat sizes before accepting vLLM video

When this starts routing video payloads to vLLM, Qwen3-VL video requests still carry the image flat-layout key: qwen3_vl.rs declares pixel_values as FieldLayout::flat("patches_per_image"), while the video preprocessor emits patches_per_video (qwen_vl_base.rs). The vLLM servicer then rewrites pixel_values to pixel_values_videos but keeps the stale patches_per_image sizes key and aborts with Flat sizes key 'patches_per_image' ... not found, so the newly added vLLM video path fails for the advertised Qwen3-VL video case unless the layout is made modality-specific before enabling video here.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Dependency updates documentation Improvements or additions to documentation grpc gRPC client and router changes model-gateway Model gateway crate changes protocols Protocols crate changes python-bindings Python bindings changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant