Skip to content

cp: fix: resolve VLM CI failures for PP recipes and collate_fn (1799) into r0.4.0#1889

Open
svcnvidia-nemo-ci wants to merge 2 commits intor0.4.0from
cherry-pick-1799-r0.4.0
Open

cp: fix: resolve VLM CI failures for PP recipes and collate_fn (1799) into r0.4.0#1889
svcnvidia-nemo-ci wants to merge 2 commits intor0.4.0from
cherry-pick-1799-r0.4.0

Conversation

@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor

beep boop [🤖]: Hi @HuiyingLi 👋,

we've cherry picked #1799 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

* fix: resolve VLM CI failures for PP recipes and collate_fn

- Add drop_last, collate_fn, and num_workers to validation dataloaders for
  kimi25vl_medpix, qwen3_vl_moe_235b, and qwen3_5_moe_medpix PP recipes
- Fix qwen3_5_moe_medpix model ID placeholder to Qwen/Qwen3.5-397B-A17B
- Reduce qwen3_5_35b_neat_packing packed_sequence lengths from 4096 to 2048
  to fix OOM
- Make drop_overlong optional (default False) in default_collate_fn to prevent
  batch size reduction that crashes PP schedules
- Add ffmpeg to Docker image for torchcodec audio/video decoding
- Update max_length to 2048 across VLM PP recipes
- Add unit tests for drop_overlong behavior

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* chore: remove Dockerfile change from this PR

The ffmpeg install belongs in a separate PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: make kimi_k25_vl_collate_fn truncate instead of drop overlong samples

Align kimi_k25_vl_collate_fn with default_collate_fn by adding
drop_overlong parameter (default False). With PP, dropping samples
reduces batch size below what the schedule expects, causing
"Expecting N arg_mbs but got M" errors. Truncation preserves batch
size while respecting max_length.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: reduce kimi25vl max_length from 2048 to 1024

Lower sequence length to fit within GPU memory budget for
pp_size=8, ep_size=32, local_batch_size=8 configuration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: delegate truncation to processor in kimi_k25_vl_collate_fn

Address Claude review: instead of slicing input_ids after image token
expansion (which breaks pixel_values/grid_thws consistency), pass
truncation=True and max_length to the processor so it truncates text
tokens before expansion. This keeps image data in sync with tokens.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: add post-expansion truncation in kimi_k25_vl_collate_fn

Address Claude review: image token expansion can push sequences past
max_length even after processor truncation. Add post-expansion
truncation for the default (non-drop) path to ensure consistent
sequence lengths for torch.stack.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: drop image data when truncation cuts into image token region

Address Claude review: after post-expansion truncation, verify that
all expected image tokens survived. If truncation cut into the image
region, skip pixel_values/grid_thws for that sample to avoid
mismatch in the model forward pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: replace orphaned image tokens and add truncation-into-image test

Address Claude review: when truncation cuts into the image token
region and image data is dropped, replace remaining media_token_id
tokens with pad_token_id to prevent the model from looking for
missing pixel data. Add test covering this path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: assert truncation=True in default_collate_fn max_length test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: inline vision processing in Qwen3VLMoeModel to fix image token mismatch

Replace super().forward(input_ids=None) with inline vision feature
processing. When input_ids=None, HF's get_placeholder_mask falls back
to bfloat16 embedding comparison which produces false positives,
causing "Image features and image tokens do not match" on the 235B.
Inline processing passes the original input_ids for reliable integer
comparison.

Also fix video token ID (151652 -> 151656) and add PP attention mask
size guard to prevent MoE routing errors on non-Stage-0 ranks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: pass mm_token_type_ids to get_rope_index in inline vision path

The inline vision processing was passing image_grid_thw as positional
arg where get_rope_index expects mm_token_type_ids, causing shape
mismatch on 235B: "shape of the mask [2047] does not match [3]".
Use keyword arguments to pass mm_token_type_ids from kwargs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: compute expected_image_tokens across all images in grid_thws

Address Claude review: for multi-image samples, sum expected tokens
across all images in grid_thws, not just the first one.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: resolve linting errors in qwen3_vl_moe tests

Fix ruff I001 (import sort), F401 (unused import), F841 (unused variable).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: auto-format collate_fns.py for ruff

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add CPU-runnable coverage for qwen3_vl_moe inline vision and PP guard

Move the module-level CUDA skip marker onto the classes that actually
instantiate the full model, then add two new CPU-runnable test classes
(TestQwen3VLMoeModelInlineVisionCpu, TestQwen3VLMoeForConditionalGenerationPpGuardCpu)
that invoke forward() as an unbound method with a MagicMock self.

This lets codecov/patch see coverage of the new inline vision branches
(single-modality, merged images+videos, text-only, dict attention mask)
and the PP attention-mask guard / chunked pixel_values dispatch without
requiring GPU, which was blocking PR merge at 58% patch coverage.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: extract _DEFAULT_MERGE_KERNEL to avoid silent drift

Replace the hardcoded (2, 2) merge kernel in kimi_k25_vl_collate_fn's
post-expansion truncation check with a module-level _DEFAULT_MERGE_KERNEL
constant, and reference the same constant from _expand_image_tokens' default.
This ensures both computations stay in sync if the vision-tower merge kernel
ever changes.

Addresses Claude review feedback on commit 86c5535.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor Author

/ok to test e313626

@svcnvidia-nemo-ci svcnvidia-nemo-ci added cherry-pick Run CICD Trigger Testing CICD labels Apr 17, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 19, 2026

/ok to test c60de72

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick Run CICD Trigger Testing CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants