Skip to content

perf(multimodal): reduce video decode, Qwen preprocess, and TokenSpeed handoff overhead#1820

Open
yechank-nvidia wants to merge 3 commits into
lightseekorg:mainfrom
yechank-nvidia:yechan/mm-video-decode-seq
Open

perf(multimodal): reduce video decode, Qwen preprocess, and TokenSpeed handoff overhead#1820
yechank-nvidia wants to merge 3 commits into
lightseekorg:mainfrom
yechank-nvidia:yechan/mm-video-decode-seq

Conversation

@yechank-nvidia

@yechank-nvidia yechank-nvidia commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Description

Problem

Multimodal video requests still spend too much time outside model execution:

  1. Video decode overhead. OpenCV was seeking for every sampled frame, which flushes/re-seeks the
    decoder repeatedly.
  2. Qwen video preprocessing overhead. Decoded RGB video frames were often materialized through
    DynamicImage and extra tensor copies before patchification.
  3. TokenSpeed handoff overhead. Large video encoder inputs were still expensive to serialize and
    pass inline.

Solution

  • Decode sampled video frames by sequentially grab()-ing intervening frames and read()-ing only
    sampled frames.
  • Add borrowed RGB video preprocessing paths for Qwen processors, including direct RGB patchification
    and parallel work for larger inputs.
  • Pack TokenSpeed multimodal encoder inputs into offset SHM segments for video tensors, with inline
    fallback.

Changes

Video decode

  • crates/multimodal/src/media.rs: sequential OpenCV frame sampling, timing logs, safer decoder-
    position tracking.

Qwen preprocessing

  • crates/multimodal/src/vision/processors/qwen_vl_base.rs: borrowed RGB video fast path, direct
    patchify path, parallel preprocessing.
  • crates/multimodal/src/vision/transforms.rs: raw-RGB bicubic resize helper.

TokenSpeed handoff

  • model_gateway/.../grpc/multimodal.rs: packed per-item encoder tensor serialization and offset SHM
    transport.
  • model_gateway/.../grpc/proto_wrapper.rs: tensor offset metadata.
  • grpc_servicer/.../tokenspeed/servicer.py: offset SHM tensor reads and validation.
  • grpc_servicer/tests/test_tokenspeed_multimodal_shm.py: SHM offset coverage.

Test Plan

  • cargo build --release -j 4 -p smg --bin smg
  • pytest -q grpc_servicer/tests/test_tokenspeed_multimodal_shm.py
  • git diff --check
Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and
    merge PRs

Summary by CodeRabbit

  • New Features

    • Multimodal image/video processing now supports faster reference-based preprocessing and improved handling of local video inputs.
    • Added new configuration options to tune multimodal preprocessing parallelism and helper thread usage.
  • Bug Fixes

    • Improved accuracy and reliability of multimodal tensor size, offset, and shared-memory handling.
    • Fixed several video decoding and resize edge cases for more consistent results.
  • Performance

    • Reduced unnecessary copying during multimodal preparation and token assembly.
    • Optimized shared-memory output writing and concurrency in multimodal request handling.

@github-actions github-actions Bot added the multimodal Multimodal crate changes label Jun 23, 2026
@yechank-nvidia yechank-nvidia changed the title perf(multimodal): decode sampled video frames sequentially, not per-f… perf(multimodal): decode sampled video frames sequentially, not per-frame Jun 23, 2026
@yechank-nvidia yechank-nvidia changed the title perf(multimodal): decode sampled video frames sequentially, not per-frame perf(multimodal): decode sampled video frames sequentially, not per-frame seek Jun 23, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes video decoding with OpenCV by sequentially grabbing intervening frames instead of seeking directly to each sampled frame. A review comment identified a critical bug where decoded_pos is not updated if capture.read() fails, which would cause subsequent frame decoding to go out of sync. The reviewer provided a code suggestion to correctly update the decoder position and handle grab failures explicitly.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread crates/multimodal/src/media.rs Outdated
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Multimodal media decoding now uses async blocking helpers, path-aware video decoding, and duration-aware FFmpeg/OpenCV selection. Qwen vision preprocessing shifts to borrowed image refs and raw RGB patchification. The gateway, TokenSpeed servicer, and SHM writer update their preprocessing, serialization, and validation paths, with matching tests and docs.

Changes

Multimodal Pipeline and TokenSpeed Transport

Layer / File(s) Summary
Async media decode entrypoints
crates/multimodal/Cargo.toml, crates/multimodal/src/media.rs
The Tokio dependency declaration is reformatted, and base64 payload decoding plus image/video decode wrappers now route through async blocking helpers with size-checked path-based inputs and backend fallback dispatch.
Video backend execution
crates/multimodal/src/media.rs
OpenCV sampling, ffmpeg subprocess helpers, duration-aware filter selection, MP4 duration parsing, decode timing, and temp-file timing logging now use the new optional input-byte flow; tests cover duration parsing and stdout preallocation.
Borrowed-image preprocess API
crates/multimodal/src/vision/processor.rs, crates/multimodal/src/vision/processors/qwen2_vl.rs, crates/multimodal/src/vision/processors/qwen3_vl.rs, crates/multimodal/src/vision/transforms.rs, docs/reference/configuration.md
VisionPreProcessor gains preprocess_image_refs, Qwen2/Qwen3 delegate to the borrowed-image path, and par_threads now uses cached env-configured limits that are documented alongside the new preprocessing knobs.
Raw RGB patchification
crates/multimodal/src/vision/processors/qwen_vl_base.rs
The shared Qwen VL base now resizes into raw RGB bytes, patchifies from LUT-normalized buffers, and rewires image/video preprocessing around the direct RGB path; parity tests compare resized image and video outputs.
Concurrent preprocessing and expansion
model_gateway/src/routers/grpc/multimodal.rs
The gateway overlaps model-config lookup with preprocessing, stores encoder inputs in Arc, resolves placeholder token IDs during preprocessing, and refactors token expansion, placeholder mapping, and timing bookkeeping with updated tests.
TokenSpeed serialization and spans
model_gateway/src/routers/grpc/multimodal.rs
TokenSpeed assembly now slices flat tensors with precomputed spans, serializes encoder inputs from ArrayViewD, borrows model-specific tensors, and selects SHM transport by modality; tests cover transport defaulting, flat-span helpers, and SHM packing.
TokenSpeed servicer parsing
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py, grpc_servicer/tests/test_tokenspeed_multimodal_shm.py
Placeholder offsets, tensor sizing, and SHM feature reconstruction now use explicit byte-size and offset metadata, with tests covering the updated parsing and validation paths.
TokenSpeed SHM writer
model_gateway/src/routers/grpc/proto_wrapper.rs
Environment-backed timing and transport settings are cached once, SHM payloads are counted during writing, and the encoder SHM cleanup helper is removed.

Sequence Diagram(s)

sequenceDiagram
  participant process_multimodal_parts
  participant assemble_tokenspeed
  participant write_tokenspeed_shm_with
  participant CountingWriter
  process_multimodal_parts->>assemble_tokenspeed: pass preprocessed encoder inputs
  assemble_tokenspeed->>write_tokenspeed_shm_with: serialize TokenSpeed SHM payload
  write_tokenspeed_shm_with->>CountingWriter: write packed bytes
  CountingWriter-->>write_tokenspeed_shm_with: bytes_written
  write_tokenspeed_shm_with-->>assemble_tokenspeed: return ShmHandle
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90+ minutes

Possibly related issues

  • lightseekorg/smg issue 566: The PR touches the same multimodal preprocessing and backend assembly path as the issue.

Possibly related PRs

  • lightseekorg/smg#1603: Both PRs overhaul the same multimodal video/Qwen VL decode and RGB-bytes preprocessing paths.
  • lightseekorg/smg#1604: Both PRs modify TokenSpeed multimodal SHM transport and serialization in model_gateway/src/routers/grpc/*.
  • lightseekorg/smg#1515: Both PRs change the TokenSpeed multimodal gRPC servicer and gateway assembly for shared placeholder/offset handling.

Suggested reviewers

  • slin1237
  • key4ng
  • CatherineSue

Poem

I nibbled bytes beneath the moon,
and patchlets danced in RGB tune.
The SHM carrots lined up neat,
while ffmpeg beat a little beat.
Hop hop—multimodal bliss! 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main performance-focused multimodal changes across video decode, Qwen preprocessing, and TokenSpeed handoff.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@github-actions github-actions Bot added documentation Improvements or additions to documentation dependencies Dependency updates grpc gRPC client and router changes tests Test changes model-gateway Model gateway crate changes labels Jun 25, 2026
@yechank-nvidia yechank-nvidia force-pushed the yechan/mm-video-decode-seq branch from 6215045 to e99ae9e Compare June 26, 2026 02:22
@yechank-nvidia yechank-nvidia changed the title perf(multimodal): decode sampled video frames sequentially, not per-frame seek perf(multimodal): reduce video decode, Qwen preprocess, and TokenSpeed handoff overhead Jun 26, 2026
…rame seek

decode_video_with_opencv_file called capture.set(CAP_PROP_POS_FRAMES, idx)
for every sampled frame. OpenCV flushes/re-seeks the decoder on each
POS_FRAMES set (~10 ms/frame even for adjacent frames), so a 20-frame 2 fps
clip spent ~195 ms in decode -- the largest single component of multimodal
video TTFT.

Advance to each sampled frame by sequentially grabbing the intervening
frames (grab() decodes without retrieving, ~1-2 ms/frame) and reading only
the sampled ones. This avoids a decoder flush/seek for every sampled frame.

Measured (Qwen3.5-4B, 20-frame 512x512 clip): decode ~197 ms -> ~69 ms
(2.8x), end-to-end video request ~357 ms -> ~228 ms (~36%).

Verified bit-exact against the previous per-frame seek: identical decoded
pixels on both dense and sparse (non-keyframe) sampling, and identical model
output on a numbered-frame video, so accuracy is unchanged.

Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Move more media decode work off the async runtime and add borrowed/parallel Qwen image and video preprocessing paths.

Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Pack TokenSpeed encoder inputs into offset SHM segments, preserve placeholder spans for faster worker handoff, and default video tensor transport to auto.

Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
@yechank-nvidia yechank-nvidia force-pushed the yechan/mm-video-decode-seq branch from e99ae9e to 60d6d7b Compare June 26, 2026 02:39
@yechank-nvidia yechank-nvidia marked this pull request as ready for review June 26, 2026 02:39

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 60d6d7b68f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"fs",
"rt-multi-thread",
"process",
"time",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enable Tokio io-util for pipe reads

This crate now imports tokio::io::AsyncReadExt and calls read_to_end on the ffmpeg stdout/stderr pipes, but the llm-multimodal Tokio dependency still enables only sync, fs, rt-multi-thread, process, and time. When this crate is checked or consumed without another workspace target enabling Tokio's full/io-util feature, the new extension trait is not available and video decoding builds fail; add io-util to this feature list.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/multimodal/src/media.rs`:
- Around line 323-332: The `read_and_hash` and `decode_video_frames_from_path`
paths in `media.rs` are operating on different snapshots of the same video file,
which can make `VideoClip.raw_bytes` and `VideoClip.hash` disagree with the
decoded frames. Change the `decode` flow so both hashing and frame decoding use
one immutable file snapshot or a stable descriptor/snapshot created once, rather
than reopening `canonical` separately. Keep the fix localized around
`read_and_hash`, `decode_video_frames_from_path`, and the `tokio::try_join!`
call so the returned `VideoClip` always reflects a single consistent file
version.
- Around line 1039-1043: The missing-binary branch in the `command.spawn()`
error handling still hardcodes a `video_url inputs` message, but
`MediaConnectorError::VideoDecode` is now used by `File`, `DataUrl`, and
`InlineBytes` paths too. Update the error text in this `map_err` block to use a
source-agnostic decode message tied to the `program` being spawned, so missing
`ffmpeg`/`ffprobe` guidance applies to all decode inputs.

In `@crates/multimodal/src/vision/processors/qwen_vl_base.rs`:
- Around line 1361-1398: The new test only covers the serial resize/preprocess
path and does not verify the image-parallel branch. Add a parity test around
QwenVLProcessorBase::patchify_image_rgb_block_band that exercises the same
resize/normalize inputs and compares its output against the existing serial
patchify/preprocess path, so the parallel image fast path gets the same
regression coverage as the video path.

In `@crates/multimodal/src/vision/transforms.rs`:
- Around line 446-456: The row-threshold check in par_threads can overflow when
computing 2 * cfg.min_rows_per_thread from
par_config/SMG_MM_PREPROCESS_PAR_MIN_ROWS. Update the early-return condition to
use saturating arithmetic for that comparison so it cannot panic in debug/tests
or wrap in release, while keeping the existing behavior of returning 1 for small
workloads.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 1192-1193: The cast fallback in
TokenSpeedSchedulerServicer._feature_from_proto can trigger _tensor_from_proto
to unlink a shared packed SHM segment too early when multiple packed inputs
share the same backing file. Update the fallback path so it does not destroy
shared SHM before all packed items are read—either prevent packing when cast_to
is required on the servicer side, or change _tensor_payload_bytes_from_shm and
related SHM handling to defer unlinking until every offset/reference in the
shared segment has been materialized.

In `@model_gateway/src/routers/grpc/multimodal.rs`:
- Around line 1161-1186: Reject placeholder/item count mismatches before
constructing TokenSpeed items in the multimodal path: in the loop that builds
`pending_items`, stop relying on
`mm_placeholders_by_item.next().unwrap_or_default()` and instead validate that
`placeholders_for_items(&intermediate.placeholders, patch_offsets)` yields
exactly `item_count` groups before iteration. If the counts differ, return an
error early from the surrounding multimodal serialization flow so
`encoder_input_for_item` and `serialize_model_specific_for_item` are not used to
build partially invalid `PendingTokenSpeedItem` values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f1d34efa-d749-452c-9e07-649de652678d

📥 Commits

Reviewing files that changed from the base of the PR and between 37fd1ef and 60d6d7b.

📒 Files selected for processing (12)
  • crates/multimodal/Cargo.toml
  • crates/multimodal/src/media.rs
  • crates/multimodal/src/vision/processor.rs
  • crates/multimodal/src/vision/processors/qwen2_vl.rs
  • crates/multimodal/src/vision/processors/qwen3_vl.rs
  • crates/multimodal/src/vision/processors/qwen_vl_base.rs
  • crates/multimodal/src/vision/transforms.rs
  • docs/reference/configuration.md
  • grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
  • grpc_servicer/tests/test_tokenspeed_multimodal_shm.py
  • model_gateway/src/routers/grpc/multimodal.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs

Comment on lines +323 to +332
let read_and_hash = async {
let bytes = Bytes::from(fs::read(&canonical).await?);
let bytes_for_hash = bytes.clone();
let hash = task::spawn_blocking(move || crate::hasher::hash_video(&bytes_for_hash))
.await
.map_err(MediaConnectorError::Blocking)?;
Ok::<_, MediaConnectorError>((bytes, hash))
};
let decode = decode_video_frames_from_path(&canonical, input_bytes, None, cfg);
let ((bytes, hash), decoded) = tokio::try_join!(read_and_hash, decode)?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Decode and hash the same file snapshot.

Lines 323-332 read/hash canonical in one task while decoding the path in another. If that file is modified between those operations, the returned VideoClip can contain frames from one version and raw_bytes/hash from another. Use one immutable snapshot for both operations, or decode through a stable descriptor/snapshot instead of reopening the path twice.

Based on crates/multimodal/src/types.rs, VideoClip stores decoded frames alongside the original raw_bytes and hash, so those fields need to describe the same content.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/multimodal/src/media.rs` around lines 323 - 332, The `read_and_hash`
and `decode_video_frames_from_path` paths in `media.rs` are operating on
different snapshots of the same video file, which can make `VideoClip.raw_bytes`
and `VideoClip.hash` disagree with the decoded frames. Change the `decode` flow
so both hashing and frame decoding use one immutable file snapshot or a stable
descriptor/snapshot created once, rather than reopening `canonical` separately.
Keep the fix localized around `read_and_hash`, `decode_video_frames_from_path`,
and the `tokio::try_join!` call so the returned `VideoClip` always reflects a
single consistent file version.

Comment on lines +1039 to 1043
command.spawn().map_err(|e| {
if e.kind() == std::io::ErrorKind::NotFound {
MediaConnectorError::VideoDecode(format!(
"{program} executable not found; install {program} to decode video_url inputs"
))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Generalize the missing-binary error message.

Line 1042 still says video_url inputs, but this helper now backs File, DataUrl, and InlineBytes decode paths too. When ffmpeg/ffprobe is missing, that error points operators at the wrong source type.

Suggested fix
-                "{program} executable not found; install {program} to decode video_url inputs"
+                "{program} executable not found; install {program} to decode video inputs"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
command.spawn().map_err(|e| {
if e.kind() == std::io::ErrorKind::NotFound {
MediaConnectorError::VideoDecode(format!(
"{program} executable not found; install {program} to decode video_url inputs"
))
command.spawn().map_err(|e| {
if e.kind() == std::io::ErrorKind::NotFound {
MediaConnectorError::VideoDecode(format!(
"{program} executable not found; install {program} to decode video inputs"
))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/multimodal/src/media.rs` around lines 1039 - 1043, The missing-binary
branch in the `command.spawn()` error handling still hardcodes a `video_url
inputs` message, but `MediaConnectorError::VideoDecode` is now used by `File`,
`DataUrl`, and `InlineBytes` paths too. Update the error text in this `map_err`
block to use a source-agnostic decode message tied to the `program` being
spawned, so missing `ffmpeg`/`ffprobe` guidance applies to all decode inputs.

Comment on lines +1361 to +1398
#[test]
fn test_preprocess_image_matches_tensor_patchify_with_resize() {
let processor = QwenVLProcessorBase::new(create_video_test_config());
let config = PreProcessorConfig {
image_mean: Some(processor.default_mean().to_vec()),
image_std: Some(processor.default_std().to_vec()),
..Default::default()
};
let image = create_sized_pattern_frame(7, 9, 3);
let (target_h, target_w) = processor.smart_resize(9, 7).unwrap();
assert!(
(target_w as u32, target_h as u32) != (7u32, 9u32),
"test must force a resize; target {target_w}x{target_h} should differ from 7x9"
);

let result = processor
.preprocess(std::slice::from_ref(&image), &config)
.unwrap();
let actual = result.encoder_input.as_slice_memory_order().unwrap();

let resized = resize_bicubic_pil(&image, target_w as u32, target_h as u32);
let tensor =
to_tensor_and_normalize(&resized, &processor.default_mean(), &processor.default_std());
let (grid_t, grid_h, grid_w) = processor.calculate_grid_thw(target_h, target_w, 1);
let mut expected = Vec::new();
processor
.patchify_into(&tensor, grid_t, grid_h, grid_w, &mut expected)
.unwrap();

assert_eq!(actual.len(), expected.len());
for (idx, (&got, &want)) in actual.iter().zip(expected.iter()).enumerate() {
assert_eq!(
got.to_bits(),
want.to_bits(),
"image patch value differs at index {idx}: got {got}, want {want}"
);
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add an image-parallel parity test.

This new test validates the resize path, but it stays on the serial branch. The video fast path already has a parallel-block regression guard; patchify_image_rgb_block_band should get the same coverage.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/multimodal/src/vision/processors/qwen_vl_base.rs` around lines 1361 -
1398, The new test only covers the serial resize/preprocess path and does not
verify the image-parallel branch. Add a parity test around
QwenVLProcessorBase::patchify_image_rgb_block_band that exercises the same
resize/normalize inputs and compares its output against the existing serial
patchify/preprocess path, so the parallel image fast path gets the same
regression coverage as the video path.

Comment on lines 446 to +456
pub(crate) fn par_threads(out_bytes: usize, out_rows: usize) -> usize {
const PAR_MIN_BYTES: usize = 1 << 19; // ~512 KiB output; below this, serial
const MIN_ROWS_PER_THREAD: usize = 32; // keep enough work per thread
const MAX_THREADS: usize = 32; // spawning hundreds of threads costs more than it saves
if out_bytes < PAR_MIN_BYTES || out_rows < 2 * MIN_ROWS_PER_THREAD {
let cfg = par_config();
if out_bytes < cfg.min_bytes || out_rows < 2 * cfg.min_rows_per_thread {
return 1;
}
let avail = std::thread::available_parallelism()
.map(|n| n.get())
.unwrap_or(1);
(out_rows / MIN_ROWS_PER_THREAD)
static AVAILABLE_PARALLELISM: OnceLock<usize> = OnceLock::new();
let avail = *AVAILABLE_PARALLELISM
.get_or_init(|| std::thread::available_parallelism().map(|n| n.get()).unwrap_or(1));
(out_rows / cfg.min_rows_per_thread)
.min(avail)
.clamp(1, MAX_THREADS)
.clamp(1, cfg.max_threads)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

sed -n '1,260p' crates/multimodal/src/vision/transforms.rs && printf '\n--- CUT ---\n' && sed -n '260,520p' crates/multimodal/src/vision/transforms.rs

Repository: lightseekorg/smg

Length of output: 18396


🏁 Script executed:

python3 - <<'PY'
from pathlib import Path
p = Path('crates/multimodal/src/vision/transforms.rs')
text = p.read_text()
for needle in ['par_config', 'env_usize', 'par_threads', 'SMG_MM_PREPROCESS_PAR_MIN_ROWS']:
    print(f'\n### {needle}')
    idx = text.find(needle)
    if idx == -1:
        print('not found')
        continue
    start = max(0, text.rfind('\n', 0, idx-400))
    end = min(len(text), text.find('\n', idx+1200) if text.find('\n', idx+1200)!=-1 else len(text))
    print(text[start:end])
PY

Repository: lightseekorg/smg

Length of output: 6854


🏁 Script executed:

git grep -n "pub(crate) fn par_threads\|fn par_config\|env_usize\|SMG_MM_PREPROCESS_PAR_MIN_ROWS" -- crates/multimodal/src/vision/transforms.rs

Repository: lightseekorg/smg

Length of output: 838


Use saturating arithmetic for the row threshold. cfg.min_rows_per_thread comes from SMG_MM_PREPROCESS_PAR_MIN_ROWS, so 2 * cfg.min_rows_per_thread can overflow here and panic in debug/tests or wrap in release.

Suggested fix
-    if out_bytes < cfg.min_bytes || out_rows < 2 * cfg.min_rows_per_thread {
+    if out_bytes < cfg.min_bytes
+        || out_rows < cfg.min_rows_per_thread.saturating_mul(2)
+    {
         return 1;
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub(crate) fn par_threads(out_bytes: usize, out_rows: usize) -> usize {
const PAR_MIN_BYTES: usize = 1 << 19; // ~512 KiB output; below this, serial
const MIN_ROWS_PER_THREAD: usize = 32; // keep enough work per thread
const MAX_THREADS: usize = 32; // spawning hundreds of threads costs more than it saves
if out_bytes < PAR_MIN_BYTES || out_rows < 2 * MIN_ROWS_PER_THREAD {
let cfg = par_config();
if out_bytes < cfg.min_bytes || out_rows < 2 * cfg.min_rows_per_thread {
return 1;
}
let avail = std::thread::available_parallelism()
.map(|n| n.get())
.unwrap_or(1);
(out_rows / MIN_ROWS_PER_THREAD)
static AVAILABLE_PARALLELISM: OnceLock<usize> = OnceLock::new();
let avail = *AVAILABLE_PARALLELISM
.get_or_init(|| std::thread::available_parallelism().map(|n| n.get()).unwrap_or(1));
(out_rows / cfg.min_rows_per_thread)
.min(avail)
.clamp(1, MAX_THREADS)
.clamp(1, cfg.max_threads)
pub(crate) fn par_threads(out_bytes: usize, out_rows: usize) -> usize {
let cfg = par_config();
if out_bytes < cfg.min_bytes
|| out_rows < cfg.min_rows_per_thread.saturating_mul(2)
{
return 1;
}
static AVAILABLE_PARALLELISM: OnceLock<usize> = OnceLock::new();
let avail = *AVAILABLE_PARALLELISM
.get_or_init(|| std::thread::available_parallelism().map(|n| n.get()).unwrap_or(1));
(out_rows / cfg.min_rows_per_thread)
.min(avail)
.clamp(1, cfg.max_threads)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/multimodal/src/vision/transforms.rs` around lines 446 - 456, The
row-threshold check in par_threads can overflow when computing 2 *
cfg.min_rows_per_thread from par_config/SMG_MM_PREPROCESS_PAR_MIN_ROWS. Update
the early-return condition to use saturating arithmetic for that comparison so
it cannot panic in debug/tests or wrap in release, while keeping the existing
behavior of returning 1 for small workloads.

Comment on lines +1192 to 1193
if cast_to is not None and dtype != cast_to:
return TokenSpeedSchedulerServicer._tensor_from_proto(tensor_data, cast_to=cast_to)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Do not unlink shared packed SHM segments during cast fallback.

When packed encoder inputs share one SHM file, a dtype mismatch makes _feature_from_proto materialize via _tensor_from_proto; _tensor_payload_bytes_from_shm can then unlink the shared segment after the first item, causing later offsets in the same segment to fail. Either avoid packing when a servicer-side cast is required, or make the fallback read all shared handles before unlinking/refcount the segment.

Also applies to: 1239-1246

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py` around lines 1192 -
1193, The cast fallback in TokenSpeedSchedulerServicer._feature_from_proto can
trigger _tensor_from_proto to unlink a shared packed SHM segment too early when
multiple packed inputs share the same backing file. Update the fallback path so
it does not destroy shared SHM before all packed items are read—either prevent
packing when cast_to is required on the servicer side, or change
_tensor_payload_bytes_from_shm and related SHM handling to defer unlinking until
every offset/reference in the shared segment has been materialized.

Comment on lines +1161 to +1186
let mut mm_placeholders_by_item =
placeholders_for_items(&intermediate.placeholders, patch_offsets).into_iter();
let mut pending_items: Vec<PendingTokenSpeedItem<'_>> = Vec::with_capacity(item_count);
for item_index in 0..item_count {
let item_encoder_input = match encoder_input_for_item(
&intermediate.preprocessed,
&intermediate.field_layouts,
&flat_spans,
item_index,
) {
Ok(value) => value,
Err(error) => {
cleanup_tokenspeed_items_encoder_shm(&items, None);
return Err(error);
}
Err(error) => return Err(error),
};
let encoder_input_started = Instant::now();
let encoder_input = serialize_array_as_tokenspeed_tensor(
&item_encoder_input,
&encoder_input_dtype,
shm_enabled,
);
let encoder_input_serialize_ms = encoder_input_started.elapsed().as_secs_f64() * 1000.0;
let model_specific_started = Instant::now();
let model_specific_started = log_timing.then(Instant::now);
let model_specific_tensors = match serialize_model_specific_for_item(
&intermediate.preprocessed.model_specific,
&intermediate.field_layouts,
&flat_spans,
item_index,
) {
Ok(value) => value,
Err(error) => {
// `encoder_input` (possibly SHM) was created for this item but the
// item isn't built; clean it plus all prior items.
cleanup_tokenspeed_items_encoder_shm(&items, Some(&encoder_input));
return Err(error);
}
Err(error) => return Err(error),
};
let model_specific_serialize_ms = model_specific_started.elapsed().as_secs_f64() * 1000.0;
let mm_placeholders =
placeholders_for_item(item_index, &intermediate.placeholders, &patch_offsets);
let content_hash = content_hash_for_item(intermediate.modality, &intermediate, item_index);
let model_specific_serialize_ms =
model_specific_started.map(|started| started.elapsed().as_secs_f64() * 1000.0);
let mm_placeholders = mm_placeholders_by_item.next().unwrap_or_default();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Reject placeholder/item count mismatches before building TokenSpeed items.

unwrap_or_default() silently emits an item with empty placeholders when placeholders_for_items(...) returns fewer groups than item_count; the TokenSpeed servicer rejects empty placeholder lists, and extra groups would be dropped. Validate the group count before iterating.

Suggested fix
-    let mut mm_placeholders_by_item =
-        placeholders_for_items(&intermediate.placeholders, patch_offsets).into_iter();
+    let mm_placeholders_by_item_vec =
+        placeholders_for_items(&intermediate.placeholders, patch_offsets);
+    anyhow::ensure!(
+        mm_placeholders_by_item_vec.len() == item_count,
+        "TokenSpeed placeholder item count mismatch: placeholders={}, items={item_count}",
+        mm_placeholders_by_item_vec.len(),
+    );
+    let mut mm_placeholders_by_item = mm_placeholders_by_item_vec.into_iter();
@@
-        let mm_placeholders = mm_placeholders_by_item.next().unwrap_or_default();
+        let mm_placeholders = mm_placeholders_by_item
+            .next()
+            .expect("placeholder item count validated above");
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let mut mm_placeholders_by_item =
placeholders_for_items(&intermediate.placeholders, patch_offsets).into_iter();
let mut pending_items: Vec<PendingTokenSpeedItem<'_>> = Vec::with_capacity(item_count);
for item_index in 0..item_count {
let item_encoder_input = match encoder_input_for_item(
&intermediate.preprocessed,
&intermediate.field_layouts,
&flat_spans,
item_index,
) {
Ok(value) => value,
Err(error) => {
cleanup_tokenspeed_items_encoder_shm(&items, None);
return Err(error);
}
Err(error) => return Err(error),
};
let encoder_input_started = Instant::now();
let encoder_input = serialize_array_as_tokenspeed_tensor(
&item_encoder_input,
&encoder_input_dtype,
shm_enabled,
);
let encoder_input_serialize_ms = encoder_input_started.elapsed().as_secs_f64() * 1000.0;
let model_specific_started = Instant::now();
let model_specific_started = log_timing.then(Instant::now);
let model_specific_tensors = match serialize_model_specific_for_item(
&intermediate.preprocessed.model_specific,
&intermediate.field_layouts,
&flat_spans,
item_index,
) {
Ok(value) => value,
Err(error) => {
// `encoder_input` (possibly SHM) was created for this item but the
// item isn't built; clean it plus all prior items.
cleanup_tokenspeed_items_encoder_shm(&items, Some(&encoder_input));
return Err(error);
}
Err(error) => return Err(error),
};
let model_specific_serialize_ms = model_specific_started.elapsed().as_secs_f64() * 1000.0;
let mm_placeholders =
placeholders_for_item(item_index, &intermediate.placeholders, &patch_offsets);
let content_hash = content_hash_for_item(intermediate.modality, &intermediate, item_index);
let model_specific_serialize_ms =
model_specific_started.map(|started| started.elapsed().as_secs_f64() * 1000.0);
let mm_placeholders = mm_placeholders_by_item.next().unwrap_or_default();
let mm_placeholders_by_item_vec =
placeholders_for_items(&intermediate.placeholders, patch_offsets);
anyhow::ensure!(
mm_placeholders_by_item_vec.len() == item_count,
"TokenSpeed placeholder item count mismatch: placeholders={}, items={item_count}",
mm_placeholders_by_item_vec.len(),
);
let mut mm_placeholders_by_item = mm_placeholders_by_item_vec.into_iter();
let mut pending_items: Vec<PendingTokenSpeedItem<'_>> = Vec::with_capacity(item_count);
for item_index in 0..item_count {
let item_encoder_input = match encoder_input_for_item(
&intermediate.preprocessed,
&intermediate.field_layouts,
&flat_spans,
item_index,
) {
Ok(value) => value,
Err(error) => return Err(error),
};
let model_specific_started = log_timing.then(Instant::now);
let model_specific_tensors = match serialize_model_specific_for_item(
&intermediate.preprocessed.model_specific,
&intermediate.field_layouts,
&flat_spans,
item_index,
) {
Ok(value) => value,
Err(error) => return Err(error),
};
let model_specific_serialize_ms =
model_specific_started.map(|started| started.elapsed().as_secs_f64() * 1000.0);
let mm_placeholders = mm_placeholders_by_item
.next()
.expect("placeholder item count validated above");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@model_gateway/src/routers/grpc/multimodal.rs` around lines 1161 - 1186,
Reject placeholder/item count mismatches before constructing TokenSpeed items in
the multimodal path: in the loop that builds `pending_items`, stop relying on
`mm_placeholders_by_item.next().unwrap_or_default()` and instead validate that
`placeholders_for_items(&intermediate.placeholders, patch_offsets)` yields
exactly `item_count` groups before iteration. If the counts differ, return an
error early from the surrounding multimodal serialization flow so
`encoder_input_for_item` and `serialize_model_specific_for_item` are not used to
build partially invalid `PendingTokenSpeedItem` values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Dependency updates documentation Improvements or additions to documentation grpc gRPC client and router changes model-gateway Model gateway crate changes multimodal Multimodal crate changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants