cp: fix: resolve VLM CI failures for PP recipes and collate_fn (1799) into r0.4.0#1889
Open
svcnvidia-nemo-ci wants to merge 2 commits intor0.4.0from
Open
cp: fix: resolve VLM CI failures for PP recipes and collate_fn (1799) into r0.4.0#1889svcnvidia-nemo-ci wants to merge 2 commits intor0.4.0from
fix: resolve VLM CI failures for PP recipes and collate_fn (1799) into r0.4.0#1889svcnvidia-nemo-ci wants to merge 2 commits intor0.4.0from
Conversation
* fix: resolve VLM CI failures for PP recipes and collate_fn - Add drop_last, collate_fn, and num_workers to validation dataloaders for kimi25vl_medpix, qwen3_vl_moe_235b, and qwen3_5_moe_medpix PP recipes - Fix qwen3_5_moe_medpix model ID placeholder to Qwen/Qwen3.5-397B-A17B - Reduce qwen3_5_35b_neat_packing packed_sequence lengths from 4096 to 2048 to fix OOM - Make drop_overlong optional (default False) in default_collate_fn to prevent batch size reduction that crashes PP schedules - Add ffmpeg to Docker image for torchcodec audio/video decoding - Update max_length to 2048 across VLM PP recipes - Add unit tests for drop_overlong behavior Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * chore: remove Dockerfile change from this PR The ffmpeg install belongs in a separate PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: make kimi_k25_vl_collate_fn truncate instead of drop overlong samples Align kimi_k25_vl_collate_fn with default_collate_fn by adding drop_overlong parameter (default False). With PP, dropping samples reduces batch size below what the schedule expects, causing "Expecting N arg_mbs but got M" errors. Truncation preserves batch size while respecting max_length. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: reduce kimi25vl max_length from 2048 to 1024 Lower sequence length to fit within GPU memory budget for pp_size=8, ep_size=32, local_batch_size=8 configuration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: delegate truncation to processor in kimi_k25_vl_collate_fn Address Claude review: instead of slicing input_ids after image token expansion (which breaks pixel_values/grid_thws consistency), pass truncation=True and max_length to the processor so it truncates text tokens before expansion. This keeps image data in sync with tokens. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: add post-expansion truncation in kimi_k25_vl_collate_fn Address Claude review: image token expansion can push sequences past max_length even after processor truncation. Add post-expansion truncation for the default (non-drop) path to ensure consistent sequence lengths for torch.stack. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: drop image data when truncation cuts into image token region Address Claude review: after post-expansion truncation, verify that all expected image tokens survived. If truncation cut into the image region, skip pixel_values/grid_thws for that sample to avoid mismatch in the model forward pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: replace orphaned image tokens and add truncation-into-image test Address Claude review: when truncation cuts into the image token region and image data is dropped, replace remaining media_token_id tokens with pad_token_id to prevent the model from looking for missing pixel data. Add test covering this path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: assert truncation=True in default_collate_fn max_length test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: inline vision processing in Qwen3VLMoeModel to fix image token mismatch Replace super().forward(input_ids=None) with inline vision feature processing. When input_ids=None, HF's get_placeholder_mask falls back to bfloat16 embedding comparison which produces false positives, causing "Image features and image tokens do not match" on the 235B. Inline processing passes the original input_ids for reliable integer comparison. Also fix video token ID (151652 -> 151656) and add PP attention mask size guard to prevent MoE routing errors on non-Stage-0 ranks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: pass mm_token_type_ids to get_rope_index in inline vision path The inline vision processing was passing image_grid_thw as positional arg where get_rope_index expects mm_token_type_ids, causing shape mismatch on 235B: "shape of the mask [2047] does not match [3]". Use keyword arguments to pass mm_token_type_ids from kwargs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: compute expected_image_tokens across all images in grid_thws Address Claude review: for multi-image samples, sum expected tokens across all images in grid_thws, not just the first one. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: resolve linting errors in qwen3_vl_moe tests Fix ruff I001 (import sort), F401 (unused import), F841 (unused variable). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * style: auto-format collate_fns.py for ruff Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add CPU-runnable coverage for qwen3_vl_moe inline vision and PP guard Move the module-level CUDA skip marker onto the classes that actually instantiate the full model, then add two new CPU-runnable test classes (TestQwen3VLMoeModelInlineVisionCpu, TestQwen3VLMoeForConditionalGenerationPpGuardCpu) that invoke forward() as an unbound method with a MagicMock self. This lets codecov/patch see coverage of the new inline vision branches (single-modality, merged images+videos, text-only, dict attention mask) and the PP attention-mask guard / chunked pixel_values dispatch without requiring GPU, which was blocking PR merge at 58% patch coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: extract _DEFAULT_MERGE_KERNEL to avoid silent drift Replace the hardcoded (2, 2) merge kernel in kimi_k25_vl_collate_fn's post-expansion truncation check with a module-level _DEFAULT_MERGE_KERNEL constant, and reference the same constant from _expand_image_tokens' default. This ensures both computations stay in sync if the vision-tower merge kernel ever changes. Addresses Claude review feedback on commit 86c5535. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Contributor
Author
|
/ok to test e313626 |
Contributor
|
/ok to test c60de72 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
beep boop [🤖]: Hi @HuiyingLi 👋,