perf(multimodal): reduce video decode, Qwen preprocess, and TokenSpeed handoff overhead by yechank-nvidia · Pull Request #1820 · lightseekorg/smg

yechank-nvidia · 2026-06-23T12:51:15Z

Description

Problem

Multimodal video requests still spend too much time outside model execution:

Video decode overhead. OpenCV was seeking for every sampled frame, which flushes/re-seeks the
decoder repeatedly.
Qwen video preprocessing overhead. Decoded RGB video frames were often materialized through
DynamicImage and extra tensor copies before patchification.
TokenSpeed handoff overhead. Large video encoder inputs were still expensive to serialize and
pass inline.

Solution

Decode sampled video frames by sequentially grab()-ing intervening frames and read()-ing only
sampled frames.
Add borrowed RGB video preprocessing paths for Qwen processors, including direct RGB patchification
and parallel work for larger inputs.
Pack TokenSpeed multimodal encoder inputs into offset SHM segments for video tensors, with inline
fallback.

Changes

Video decode

crates/multimodal/src/media.rs: sequential OpenCV frame sampling, timing logs, safer decoder-
position tracking.

Qwen preprocessing

crates/multimodal/src/vision/processors/qwen_vl_base.rs: borrowed RGB video fast path, direct
patchify path, parallel preprocessing.
crates/multimodal/src/vision/transforms.rs: raw-RGB bicubic resize helper.

TokenSpeed handoff

model_gateway/.../grpc/multimodal.rs: packed per-item encoder tensor serialization and offset SHM
transport.
model_gateway/.../grpc/proto_wrapper.rs: tensor offset metadata.
grpc_servicer/.../tokenspeed/servicer.py: offset SHM tensor reads and validation.
grpc_servicer/tests/test_tokenspeed_multimodal_shm.py: SHM offset coverage.

Test Plan

cargo build --release -j 4 -p smg --bin smg
pytest -q grpc_servicer/tests/test_tokenspeed_multimodal_shm.py
git diff --check

Checklist

cargo +nightly fmt passes
cargo clippy --all-targets --all-features -- -D warnings passes
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and
merge PRs

Summary by CodeRabbit

New Features
- Multimodal image/video processing now supports faster reference-based preprocessing and improved handling of local video inputs.
- Added new configuration options to tune multimodal preprocessing parallelism and helper thread usage.
Bug Fixes
- Improved accuracy and reliability of multimodal tensor size, offset, and shared-memory handling.
- Fixed several video decoding and resize edge cases for more consistent results.
Performance
- Reduced unnecessary copying during multimodal preparation and token assembly.
- Optimized shared-memory output writing and concurrency in multimodal request handling.

gemini-code-assist

Code Review

This pull request optimizes video decoding with OpenCV by sequentially grabbing intervening frames instead of seeking directly to each sampled frame. A review comment identified a critical bug where decoded_pos is not updated if capture.read() fails, which would cause subsequent frame decoding to go out of sync. The reviewer provided a code suggestion to correctly update the decoder position and handle grab failures explicitly.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

coderabbitai · 2026-06-23T12:53:55Z

📝 Walkthrough

Walkthrough

Multimodal media decoding now uses async blocking helpers, path-aware video decoding, and duration-aware FFmpeg/OpenCV selection. Qwen vision preprocessing shifts to borrowed image refs and raw RGB patchification. The gateway, TokenSpeed servicer, and SHM writer update their preprocessing, serialization, and validation paths, with matching tests and docs.

Changes

Multimodal Pipeline and TokenSpeed Transport

Layer / File(s)	Summary
Async media decode entrypoints `crates/multimodal/Cargo.toml`, `crates/multimodal/src/media.rs`	The Tokio dependency declaration is reformatted, and base64 payload decoding plus image/video decode wrappers now route through async blocking helpers with size-checked path-based inputs and backend fallback dispatch.
Video backend execution `crates/multimodal/src/media.rs`	OpenCV sampling, ffmpeg subprocess helpers, duration-aware filter selection, MP4 duration parsing, decode timing, and temp-file timing logging now use the new optional input-byte flow; tests cover duration parsing and stdout preallocation.
Borrowed-image preprocess API `crates/multimodal/src/vision/processor.rs`, `crates/multimodal/src/vision/processors/qwen2_vl.rs`, `crates/multimodal/src/vision/processors/qwen3_vl.rs`, `crates/multimodal/src/vision/transforms.rs`, `docs/reference/configuration.md`	`VisionPreProcessor` gains `preprocess_image_refs`, Qwen2/Qwen3 delegate to the borrowed-image path, and `par_threads` now uses cached env-configured limits that are documented alongside the new preprocessing knobs.
Raw RGB patchification `crates/multimodal/src/vision/processors/qwen_vl_base.rs`	The shared Qwen VL base now resizes into raw RGB bytes, patchifies from LUT-normalized buffers, and rewires image/video preprocessing around the direct RGB path; parity tests compare resized image and video outputs.
Concurrent preprocessing and expansion `model_gateway/src/routers/grpc/multimodal.rs`	The gateway overlaps model-config lookup with preprocessing, stores encoder inputs in `Arc`, resolves placeholder token IDs during preprocessing, and refactors token expansion, placeholder mapping, and timing bookkeeping with updated tests.
TokenSpeed serialization and spans `model_gateway/src/routers/grpc/multimodal.rs`	TokenSpeed assembly now slices flat tensors with precomputed spans, serializes encoder inputs from `ArrayViewD`, borrows model-specific tensors, and selects SHM transport by modality; tests cover transport defaulting, flat-span helpers, and SHM packing.
TokenSpeed servicer parsing `grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`, `grpc_servicer/tests/test_tokenspeed_multimodal_shm.py`	Placeholder offsets, tensor sizing, and SHM feature reconstruction now use explicit byte-size and offset metadata, with tests covering the updated parsing and validation paths.
TokenSpeed SHM writer `model_gateway/src/routers/grpc/proto_wrapper.rs`	Environment-backed timing and transport settings are cached once, SHM payloads are counted during writing, and the encoder SHM cleanup helper is removed.

Sequence Diagram(s)

sequenceDiagram
  participant process_multimodal_parts
  participant assemble_tokenspeed
  participant write_tokenspeed_shm_with
  participant CountingWriter
  process_multimodal_parts->>assemble_tokenspeed: pass preprocessed encoder inputs
  assemble_tokenspeed->>write_tokenspeed_shm_with: serialize TokenSpeed SHM payload
  write_tokenspeed_shm_with->>CountingWriter: write packed bytes
  CountingWriter-->>write_tokenspeed_shm_with: bytes_written
  write_tokenspeed_shm_with-->>assemble_tokenspeed: return ShmHandle

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90+ minutes

Possibly related issues

lightseekorg/smg issue 566: The PR touches the same multimodal preprocessing and backend assembly path as the issue.

Possibly related PRs

lightseekorg/smg#1603: Both PRs overhaul the same multimodal video/Qwen VL decode and RGB-bytes preprocessing paths.
lightseekorg/smg#1604: Both PRs modify TokenSpeed multimodal SHM transport and serialization in model_gateway/src/routers/grpc/*.
lightseekorg/smg#1515: Both PRs change the TokenSpeed multimodal gRPC servicer and gateway assembly for shared placeholder/offset handling.

Suggested reviewers

slin1237
key4ng
CatherineSue

Poem

I nibbled bytes beneath the moon,
and patchlets danced in RGB tune.
The SHM carrots lined up neat,
while ffmpeg beat a little beat.
Hop hop—multimodal bliss! 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 58.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main performance-focused multimodal changes across video decode, Qwen preprocessing, and TokenSpeed handoff.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

…rame seek decode_video_with_opencv_file called capture.set(CAP_PROP_POS_FRAMES, idx) for every sampled frame. OpenCV flushes/re-seeks the decoder on each POS_FRAMES set (~10 ms/frame even for adjacent frames), so a 20-frame 2 fps clip spent ~195 ms in decode -- the largest single component of multimodal video TTFT. Advance to each sampled frame by sequentially grabbing the intervening frames (grab() decodes without retrieving, ~1-2 ms/frame) and reading only the sampled ones. This avoids a decoder flush/seek for every sampled frame. Measured (Qwen3.5-4B, 20-frame 512x512 clip): decode ~197 ms -> ~69 ms (2.8x), end-to-end video request ~357 ms -> ~228 ms (~36%). Verified bit-exact against the previous per-frame seek: identical decoded pixels on both dense and sparse (non-keyframe) sampling, and identical model output on a numbered-frame video, so accuracy is unchanged. Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>

Move more media decode work off the async runtime and add borrowed/parallel Qwen image and video preprocessing paths. Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>

Pack TokenSpeed encoder inputs into offset SHM segments, preserve placeholder spans for faster worker handoff, and default video tensor transport to auto. Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 60d6d7b68f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-26T02:48:10Z

+    "fs",
+    "rt-multi-thread",
+    "process",
+    "time",


Enable Tokio io-util for pipe reads

This crate now imports tokio::io::AsyncReadExt and calls read_to_end on the ffmpeg stdout/stderr pipes, but the llm-multimodal Tokio dependency still enables only sync, fs, rt-multi-thread, process, and time. When this crate is checked or consumed without another workspace target enabling Tokio's full/io-util feature, the new extension trait is not available and video decoding builds fail; add io-util to this feature list.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/multimodal/src/media.rs`:
- Around line 323-332: The `read_and_hash` and `decode_video_frames_from_path`
paths in `media.rs` are operating on different snapshots of the same video file,
which can make `VideoClip.raw_bytes` and `VideoClip.hash` disagree with the
decoded frames. Change the `decode` flow so both hashing and frame decoding use
one immutable file snapshot or a stable descriptor/snapshot created once, rather
than reopening `canonical` separately. Keep the fix localized around
`read_and_hash`, `decode_video_frames_from_path`, and the `tokio::try_join!`
call so the returned `VideoClip` always reflects a single consistent file
version.
- Around line 1039-1043: The missing-binary branch in the `command.spawn()`
error handling still hardcodes a `video_url inputs` message, but
`MediaConnectorError::VideoDecode` is now used by `File`, `DataUrl`, and
`InlineBytes` paths too. Update the error text in this `map_err` block to use a
source-agnostic decode message tied to the `program` being spawned, so missing
`ffmpeg`/`ffprobe` guidance applies to all decode inputs.

In `@crates/multimodal/src/vision/processors/qwen_vl_base.rs`:
- Around line 1361-1398: The new test only covers the serial resize/preprocess
path and does not verify the image-parallel branch. Add a parity test around
QwenVLProcessorBase::patchify_image_rgb_block_band that exercises the same
resize/normalize inputs and compares its output against the existing serial
patchify/preprocess path, so the parallel image fast path gets the same
regression coverage as the video path.

In `@crates/multimodal/src/vision/transforms.rs`:
- Around line 446-456: The row-threshold check in par_threads can overflow when
computing 2 * cfg.min_rows_per_thread from
par_config/SMG_MM_PREPROCESS_PAR_MIN_ROWS. Update the early-return condition to
use saturating arithmetic for that comparison so it cannot panic in debug/tests
or wrap in release, while keeping the existing behavior of returning 1 for small
workloads.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 1192-1193: The cast fallback in
TokenSpeedSchedulerServicer._feature_from_proto can trigger _tensor_from_proto
to unlink a shared packed SHM segment too early when multiple packed inputs
share the same backing file. Update the fallback path so it does not destroy
shared SHM before all packed items are read—either prevent packing when cast_to
is required on the servicer side, or change _tensor_payload_bytes_from_shm and
related SHM handling to defer unlinking until every offset/reference in the
shared segment has been materialized.

In `@model_gateway/src/routers/grpc/multimodal.rs`:
- Around line 1161-1186: Reject placeholder/item count mismatches before
constructing TokenSpeed items in the multimodal path: in the loop that builds
`pending_items`, stop relying on
`mm_placeholders_by_item.next().unwrap_or_default()` and instead validate that
`placeholders_for_items(&intermediate.placeholders, patch_offsets)` yields
exactly `item_count` groups before iteration. If the counts differ, return an
error early from the surrounding multimodal serialization flow so
`encoder_input_for_item` and `serialize_model_specific_for_item` are not used to
build partially invalid `PendingTokenSpeedItem` values.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f1d34efa-d749-452c-9e07-649de652678d

📥 Commits

Reviewing files that changed from the base of the PR and between 37fd1ef and 60d6d7b.

📒 Files selected for processing (12)

crates/multimodal/Cargo.toml
crates/multimodal/src/media.rs
crates/multimodal/src/vision/processor.rs
crates/multimodal/src/vision/processors/qwen2_vl.rs
crates/multimodal/src/vision/processors/qwen3_vl.rs
crates/multimodal/src/vision/processors/qwen_vl_base.rs
crates/multimodal/src/vision/transforms.rs
docs/reference/configuration.md
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
grpc_servicer/tests/test_tokenspeed_multimodal_shm.py
model_gateway/src/routers/grpc/multimodal.rs
model_gateway/src/routers/grpc/proto_wrapper.rs

coderabbitai · 2026-06-26T02:51:41Z

+        let read_and_hash = async {
+            let bytes = Bytes::from(fs::read(&canonical).await?);
+            let bytes_for_hash = bytes.clone();
+            let hash = task::spawn_blocking(move || crate::hasher::hash_video(&bytes_for_hash))
+                .await
+                .map_err(MediaConnectorError::Blocking)?;
+            Ok::<_, MediaConnectorError>((bytes, hash))
+        };
+        let decode = decode_video_frames_from_path(&canonical, input_bytes, None, cfg);
+        let ((bytes, hash), decoded) = tokio::try_join!(read_and_hash, decode)?;


🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Decode and hash the same file snapshot.

Lines 323-332 read/hash canonical in one task while decoding the path in another. If that file is modified between those operations, the returned VideoClip can contain frames from one version and raw_bytes/hash from another. Use one immutable snapshot for both operations, or decode through a stable descriptor/snapshot instead of reopening the path twice.

Based on crates/multimodal/src/types.rs, VideoClip stores decoded frames alongside the original raw_bytes and hash, so those fields need to describe the same content.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/multimodal/src/media.rs` around lines 323 - 332, The `read_and_hash` and `decode_video_frames_from_path` paths in `media.rs` are operating on different snapshots of the same video file, which can make `VideoClip.raw_bytes` and `VideoClip.hash` disagree with the decoded frames. Change the `decode` flow so both hashing and frame decoding use one immutable file snapshot or a stable descriptor/snapshot created once, rather than reopening `canonical` separately. Keep the fix localized around `read_and_hash`, `decode_video_frames_from_path`, and the `tokio::try_join!` call so the returned `VideoClip` always reflects a single consistent file version.

coderabbitai · 2026-06-26T02:51:41Z

+    command.spawn().map_err(|e| {
        if e.kind() == std::io::ErrorKind::NotFound {
            MediaConnectorError::VideoDecode(format!(
                "{program} executable not found; install {program} to decode video_url inputs"
            ))


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Generalize the missing-binary error message.

Line 1042 still says video_url inputs, but this helper now backs File, DataUrl, and InlineBytes decode paths too. When ffmpeg/ffprobe is missing, that error points operators at the wrong source type.

Suggested fix

- "{program} executable not found; install {program} to decode video_url inputs" + "{program} executable not found; install {program} to decode video inputs"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

command.spawn().map_err(|e| {

if e.kind() == std::io::ErrorKind::NotFound {

MediaConnectorError::VideoDecode(format!(

"{program} executable not found; install {program} to decode video_url inputs"

))

command.spawn().map_err(|e| {

if e.kind() == std::io::ErrorKind::NotFound {

MediaConnectorError::VideoDecode(format!(

"{program} executable not found; install {program} to decode video inputs"

))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/multimodal/src/media.rs` around lines 1039 - 1043, The missing-binary branch in the `command.spawn()` error handling still hardcodes a `video_url inputs` message, but `MediaConnectorError::VideoDecode` is now used by `File`, `DataUrl`, and `InlineBytes` paths too. Update the error text in this `map_err` block to use a source-agnostic decode message tied to the `program` being spawned, so missing `ffmpeg`/`ffprobe` guidance applies to all decode inputs.

coderabbitai · 2026-06-26T02:51:41Z

+    #[test]
+    fn test_preprocess_image_matches_tensor_patchify_with_resize() {
+        let processor = QwenVLProcessorBase::new(create_video_test_config());
+        let config = PreProcessorConfig {
+            image_mean: Some(processor.default_mean().to_vec()),
+            image_std: Some(processor.default_std().to_vec()),
+            ..Default::default()
+        };
+        let image = create_sized_pattern_frame(7, 9, 3);
+        let (target_h, target_w) = processor.smart_resize(9, 7).unwrap();
+        assert!(
+            (target_w as u32, target_h as u32) != (7u32, 9u32),
+            "test must force a resize; target {target_w}x{target_h} should differ from 7x9"
+        );
+
+        let result = processor
+            .preprocess(std::slice::from_ref(&image), &config)
+            .unwrap();
+        let actual = result.encoder_input.as_slice_memory_order().unwrap();
+
+        let resized = resize_bicubic_pil(&image, target_w as u32, target_h as u32);
+        let tensor =
+            to_tensor_and_normalize(&resized, &processor.default_mean(), &processor.default_std());
+        let (grid_t, grid_h, grid_w) = processor.calculate_grid_thw(target_h, target_w, 1);
+        let mut expected = Vec::new();
+        processor
+            .patchify_into(&tensor, grid_t, grid_h, grid_w, &mut expected)
+            .unwrap();
+
+        assert_eq!(actual.len(), expected.len());
+        for (idx, (&got, &want)) in actual.iter().zip(expected.iter()).enumerate() {
+            assert_eq!(
+                got.to_bits(),
+                want.to_bits(),
+                "image patch value differs at index {idx}: got {got}, want {want}"
+            );
+        }
+    }


📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add an image-parallel parity test.

This new test validates the resize path, but it stays on the serial branch. The video fast path already has a parallel-block regression guard; patchify_image_rgb_block_band should get the same coverage.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/multimodal/src/vision/processors/qwen_vl_base.rs` around lines 1361 - 1398, The new test only covers the serial resize/preprocess path and does not verify the image-parallel branch. Add a parity test around QwenVLProcessorBase::patchify_image_rgb_block_band that exercises the same resize/normalize inputs and compares its output against the existing serial patchify/preprocess path, so the parallel image fast path gets the same regression coverage as the video path.

coderabbitai · 2026-06-26T02:51:41Z

 pub(crate) fn par_threads(out_bytes: usize, out_rows: usize) -> usize {
-    const PAR_MIN_BYTES: usize = 1 << 19; // ~512 KiB output; below this, serial
-    const MIN_ROWS_PER_THREAD: usize = 32; // keep enough work per thread
-    const MAX_THREADS: usize = 32; // spawning hundreds of threads costs more than it saves
-    if out_bytes < PAR_MIN_BYTES || out_rows < 2 * MIN_ROWS_PER_THREAD {
+    let cfg = par_config();
+    if out_bytes < cfg.min_bytes || out_rows < 2 * cfg.min_rows_per_thread {
        return 1;
    }
-    let avail = std::thread::available_parallelism()
-        .map(|n| n.get())
-        .unwrap_or(1);
-    (out_rows / MIN_ROWS_PER_THREAD)
+    static AVAILABLE_PARALLELISM: OnceLock<usize> = OnceLock::new();
+    let avail = *AVAILABLE_PARALLELISM
+        .get_or_init(|| std::thread::available_parallelism().map(|n| n.get()).unwrap_or(1));
+    (out_rows / cfg.min_rows_per_thread)
        .min(avail)
-        .clamp(1, MAX_THREADS)
+        .clamp(1, cfg.max_threads)


🩺 Stability & Availability | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

sed -n '1,260p' crates/multimodal/src/vision/transforms.rs && printf '\n--- CUT ---\n' && sed -n '260,520p' crates/multimodal/src/vision/transforms.rs

Repository: lightseekorg/smg

Length of output: 18396

🏁 Script executed:

python3 - <<'PY' from pathlib import Path p = Path('crates/multimodal/src/vision/transforms.rs') text = p.read_text() for needle in ['par_config', 'env_usize', 'par_threads', 'SMG_MM_PREPROCESS_PAR_MIN_ROWS']: print(f'\n### {needle}') idx = text.find(needle) if idx == -1: print('not found') continue start = max(0, text.rfind('\n', 0, idx-400)) end = min(len(text), text.find('\n', idx+1200) if text.find('\n', idx+1200)!=-1 else len(text)) print(text[start:end]) PY

Repository: lightseekorg/smg

Length of output: 6854

🏁 Script executed:

git grep -n "pub(crate) fn par_threads\|fn par_config\|env_usize\|SMG_MM_PREPROCESS_PAR_MIN_ROWS" -- crates/multimodal/src/vision/transforms.rs

Repository: lightseekorg/smg

Length of output: 838

Use saturating arithmetic for the row threshold. cfg.min_rows_per_thread comes from SMG_MM_PREPROCESS_PAR_MIN_ROWS, so 2 * cfg.min_rows_per_thread can overflow here and panic in debug/tests or wrap in release.

Suggested fix

- if out_bytes < cfg.min_bytes || out_rows < 2 * cfg.min_rows_per_thread { + if out_bytes < cfg.min_bytes + || out_rows < cfg.min_rows_per_thread.saturating_mul(2) + { return 1; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pub(crate) fn par_threads(out_bytes: usize, out_rows: usize) -> usize {

const PAR_MIN_BYTES: usize = 1 << 19; // ~512 KiB output; below this, serial

const MIN_ROWS_PER_THREAD: usize = 32; // keep enough work per thread

const MAX_THREADS: usize = 32; // spawning hundreds of threads costs more than it saves

if out_bytes < PAR_MIN_BYTES || out_rows < 2 * MIN_ROWS_PER_THREAD {

let cfg = par_config();

if out_bytes < cfg.min_bytes || out_rows < 2 * cfg.min_rows_per_thread {

return 1;

}

let avail = std::thread::available_parallelism()

.map(|n| n.get())

.unwrap_or(1);

(out_rows / MIN_ROWS_PER_THREAD)

static AVAILABLE_PARALLELISM: OnceLock<usize> = OnceLock::new();

let avail = *AVAILABLE_PARALLELISM

.get_or_init(|| std::thread::available_parallelism().map(|n| n.get()).unwrap_or(1));

(out_rows / cfg.min_rows_per_thread)

.min(avail)

.clamp(1, MAX_THREADS)

.clamp(1, cfg.max_threads)

pub(crate) fn par_threads(out_bytes: usize, out_rows: usize) -> usize {

let cfg = par_config();

if out_bytes < cfg.min_bytes

|| out_rows < cfg.min_rows_per_thread.saturating_mul(2)

{

return 1;

}

static AVAILABLE_PARALLELISM: OnceLock<usize> = OnceLock::new();

let avail = *AVAILABLE_PARALLELISM

.get_or_init(|| std::thread::available_parallelism().map(|n| n.get()).unwrap_or(1));

(out_rows / cfg.min_rows_per_thread)

.min(avail)

.clamp(1, cfg.max_threads)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/multimodal/src/vision/transforms.rs` around lines 446 - 456, The row-threshold check in par_threads can overflow when computing 2 * cfg.min_rows_per_thread from par_config/SMG_MM_PREPROCESS_PAR_MIN_ROWS. Update the early-return condition to use saturating arithmetic for that comparison so it cannot panic in debug/tests or wrap in release, while keeping the existing behavior of returning 1 for small workloads.

coderabbitai · 2026-06-26T02:51:41Z

+        if cast_to is not None and dtype != cast_to:
            return TokenSpeedSchedulerServicer._tensor_from_proto(tensor_data, cast_to=cast_to)


🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Do not unlink shared packed SHM segments during cast fallback.

When packed encoder inputs share one SHM file, a dtype mismatch makes _feature_from_proto materialize via _tensor_from_proto; _tensor_payload_bytes_from_shm can then unlink the shared segment after the first item, causing later offsets in the same segment to fail. Either avoid packing when a servicer-side cast is required, or make the fallback read all shared handles before unlinking/refcount the segment.

Also applies to: 1239-1246

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py` around lines 1192 - 1193, The cast fallback in TokenSpeedSchedulerServicer._feature_from_proto can trigger _tensor_from_proto to unlink a shared packed SHM segment too early when multiple packed inputs share the same backing file. Update the fallback path so it does not destroy shared SHM before all packed items are read—either prevent packing when cast_to is required on the servicer side, or change _tensor_payload_bytes_from_shm and related SHM handling to defer unlinking until every offset/reference in the shared segment has been materialized.

coderabbitai · 2026-06-26T02:51:41Z

+    let mut mm_placeholders_by_item =
+        placeholders_for_items(&intermediate.placeholders, patch_offsets).into_iter();
+    let mut pending_items: Vec<PendingTokenSpeedItem<'_>> = Vec::with_capacity(item_count);
    for item_index in 0..item_count {
        let item_encoder_input = match encoder_input_for_item(
            &intermediate.preprocessed,
            &intermediate.field_layouts,
+            &flat_spans,
            item_index,
        ) {
            Ok(value) => value,
-            Err(error) => {
-                cleanup_tokenspeed_items_encoder_shm(&items, None);
-                return Err(error);
-            }
+            Err(error) => return Err(error),
        };
-        let encoder_input_started = Instant::now();
-        let encoder_input = serialize_array_as_tokenspeed_tensor(
-            &item_encoder_input,
-            &encoder_input_dtype,
-            shm_enabled,
-        );
-        let encoder_input_serialize_ms = encoder_input_started.elapsed().as_secs_f64() * 1000.0;
-        let model_specific_started = Instant::now();
+        let model_specific_started = log_timing.then(Instant::now);
        let model_specific_tensors = match serialize_model_specific_for_item(
            &intermediate.preprocessed.model_specific,
            &intermediate.field_layouts,
+            &flat_spans,
            item_index,
        ) {
            Ok(value) => value,
-            Err(error) => {
-                // `encoder_input` (possibly SHM) was created for this item but the
-                // item isn't built; clean it plus all prior items.
-                cleanup_tokenspeed_items_encoder_shm(&items, Some(&encoder_input));
-                return Err(error);
-            }
+            Err(error) => return Err(error),
        };
-        let model_specific_serialize_ms = model_specific_started.elapsed().as_secs_f64() * 1000.0;
-        let mm_placeholders =
-            placeholders_for_item(item_index, &intermediate.placeholders, &patch_offsets);
-        let content_hash = content_hash_for_item(intermediate.modality, &intermediate, item_index);
+        let model_specific_serialize_ms =
+            model_specific_started.map(|started| started.elapsed().as_secs_f64() * 1000.0);
+        let mm_placeholders = mm_placeholders_by_item.next().unwrap_or_default();


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Reject placeholder/item count mismatches before building TokenSpeed items.

unwrap_or_default() silently emits an item with empty placeholders when placeholders_for_items(...) returns fewer groups than item_count; the TokenSpeed servicer rejects empty placeholder lists, and extra groups would be dropped. Validate the group count before iterating.

Suggested fix

- let mut mm_placeholders_by_item = - placeholders_for_items(&intermediate.placeholders, patch_offsets).into_iter(); + let mm_placeholders_by_item_vec = + placeholders_for_items(&intermediate.placeholders, patch_offsets); + anyhow::ensure!( + mm_placeholders_by_item_vec.len() == item_count, + "TokenSpeed placeholder item count mismatch: placeholders={}, items={item_count}", + mm_placeholders_by_item_vec.len(), + ); + let mut mm_placeholders_by_item = mm_placeholders_by_item_vec.into_iter(); @@ - let mm_placeholders = mm_placeholders_by_item.next().unwrap_or_default(); + let mm_placeholders = mm_placeholders_by_item + .next() + .expect("placeholder item count validated above");

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let mut mm_placeholders_by_item =

placeholders_for_items(&intermediate.placeholders, patch_offsets).into_iter();

let mut pending_items: Vec<PendingTokenSpeedItem<'_>> = Vec::with_capacity(item_count);

for item_index in 0..item_count {

let item_encoder_input = match encoder_input_for_item(

&intermediate.preprocessed,

&intermediate.field_layouts,

&flat_spans,

item_index,

) {

Ok(value) => value,

Err(error) => {

cleanup_tokenspeed_items_encoder_shm(&items, None);

return Err(error);

}

Err(error) => return Err(error),

};

let encoder_input_started = Instant::now();

let encoder_input = serialize_array_as_tokenspeed_tensor(

&item_encoder_input,

&encoder_input_dtype,

shm_enabled,

);

let encoder_input_serialize_ms = encoder_input_started.elapsed().as_secs_f64() * 1000.0;

let model_specific_started = Instant::now();

let model_specific_started = log_timing.then(Instant::now);

let model_specific_tensors = match serialize_model_specific_for_item(

&intermediate.preprocessed.model_specific,

&intermediate.field_layouts,

&flat_spans,

item_index,

) {

Ok(value) => value,

Err(error) => {

// `encoder_input` (possibly SHM) was created for this item but the

// item isn't built; clean it plus all prior items.

cleanup_tokenspeed_items_encoder_shm(&items, Some(&encoder_input));

return Err(error);

}

Err(error) => return Err(error),

};

let model_specific_serialize_ms = model_specific_started.elapsed().as_secs_f64() * 1000.0;

let mm_placeholders =

placeholders_for_item(item_index, &intermediate.placeholders, &patch_offsets);

let content_hash = content_hash_for_item(intermediate.modality, &intermediate, item_index);

let model_specific_serialize_ms =

model_specific_started.map(|started| started.elapsed().as_secs_f64() * 1000.0);

let mm_placeholders = mm_placeholders_by_item.next().unwrap_or_default();

let mm_placeholders_by_item_vec =

placeholders_for_items(&intermediate.placeholders, patch_offsets);

anyhow::ensure!(

mm_placeholders_by_item_vec.len() == item_count,

"TokenSpeed placeholder item count mismatch: placeholders={}, items={item_count}",

mm_placeholders_by_item_vec.len(),

);

let mut mm_placeholders_by_item = mm_placeholders_by_item_vec.into_iter();

let mut pending_items: Vec<PendingTokenSpeedItem<'_>> = Vec::with_capacity(item_count);

for item_index in 0..item_count {

let item_encoder_input = match encoder_input_for_item(

&intermediate.preprocessed,

&intermediate.field_layouts,

&flat_spans,

item_index,

) {

Ok(value) => value,

Err(error) => return Err(error),

};

let model_specific_started = log_timing.then(Instant::now);

let model_specific_tensors = match serialize_model_specific_for_item(

&intermediate.preprocessed.model_specific,

&intermediate.field_layouts,

&flat_spans,

item_index,

) {

Ok(value) => value,

Err(error) => return Err(error),

};

let model_specific_serialize_ms =

model_specific_started.map(|started| started.elapsed().as_secs_f64() * 1000.0);

let mm_placeholders = mm_placeholders_by_item

.next()

.expect("placeholder item count validated above");

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@model_gateway/src/routers/grpc/multimodal.rs` around lines 1161 - 1186, Reject placeholder/item count mismatches before constructing TokenSpeed items in the multimodal path: in the loop that builds `pending_items`, stop relying on `mm_placeholders_by_item.next().unwrap_or_default()` and instead validate that `placeholders_for_items(&intermediate.placeholders, patch_offsets)` yields exactly `item_count` groups before iteration. If the counts differ, return an error early from the surrounding multimodal serialization flow so `encoder_input_for_item` and `serialize_model_specific_for_item` are not used to build partially invalid `PendingTokenSpeedItem` values.

github-actions Bot added the multimodal Multimodal crate changes label Jun 23, 2026

yechank-nvidia changed the title ~~perf(multimodal): decode sampled video frames sequentially, not per-f…~~ perf(multimodal): decode sampled video frames sequentially, not per-frame Jun 23, 2026

yechank-nvidia changed the title ~~perf(multimodal): decode sampled video frames sequentially, not per-frame~~ perf(multimodal): decode sampled video frames sequentially, not per-frame seek Jun 23, 2026

gemini-code-assist Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread crates/multimodal/src/media.rs Outdated

lightseek-bot assigned chenht2022 Jun 25, 2026

github-actions Bot added documentation Improvements or additions to documentation dependencies Dependency updates grpc gRPC client and router changes tests Test changes model-gateway Model gateway crate changes labels Jun 25, 2026

yechank-nvidia force-pushed the yechan/mm-video-decode-seq branch from 6215045 to e99ae9e Compare June 26, 2026 02:22

yechank-nvidia changed the title ~~perf(multimodal): decode sampled video frames sequentially, not per-frame seek~~ perf(multimodal): reduce video decode, Qwen preprocess, and TokenSpeed handoff overhead Jun 26, 2026

yechank-nvidia added 3 commits June 25, 2026 19:34

perf(multimodal): reduce vision preprocessing overhead

e3c70b3

Move more media decode work off the async runtime and add borrowed/parallel Qwen image and video preprocessing paths. Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>

yechank-nvidia force-pushed the yechan/mm-video-decode-seq branch from e99ae9e to 60d6d7b Compare June 26, 2026 02:39

yechank-nvidia marked this pull request as ready for review June 26, 2026 02:39

yechank-nvidia requested review from CatherineSue, key4ng and slin1237 as code owners June 26, 2026 02:39

chatgpt-codex-connector Bot reviewed Jun 26, 2026

View reviewed changes

coderabbitai Bot requested changes Jun 26, 2026

View reviewed changes

		if cast_to is not None and dtype != cast_to:
		return TokenSpeedSchedulerServicer._tensor_from_proto(tensor_data, cast_to=cast_to)

Uh oh!

Conversation

yechank-nvidia commented Jun 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Changes

Test Plan

Summary by CodeRabbit

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yechank-nvidia commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading