Skip to content

[model, feature] qwen3-omni: add packed sequence support and shared sequence utilities#4304

Open
hbhflw2000 wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
hbhflw2000:pr4_omni3_packseq_sequence_utils
Open

[model, feature] qwen3-omni: add packed sequence support and shared sequence utilities#4304
hbhflw2000 wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
hbhflw2000:pr4_omni3_packseq_sequence_utils

Conversation

@hbhflw2000

@hbhflw2000 hbhflw2000 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Add Qwen3-Omni packed sequence training support and introduce shared raw sequence padding / packed-sequence metadata utilities for the Qwen3-Omni training path.

Changelog

  • Add Qwen3-Omni pack_sequences_in_batch=True forward-step support.
  • Preserve dense CP behavior by keeping raw input_ids available for model-internal mRoPE while slicing train tensors on CP ranks.
  • Add shared raw-batch sequence padding helpers in training/utils/padding_utils.py.
  • Add shared uniform PackedSeqParams construction in training/utils/packed_seq_utils.py.
  • Follow the existing Qwen3-VL packed-padding pattern WITHOUT changing Qwen3-VL code in this PR.
  • Add unit coverage for Qwen3-Omni packed sequence / CP behavior and shared sequence utilities.

Design note / RFC

This implementation follows the existing Qwen3-VL packed-padding pattern: pad raw batch sequence tensors to an aligned dense length, build uniform THD PackedSeqParams, and keep model-specific multimodal / mRoPE handling inside the Qwen3-Omni step and model code.

This PR intentionally does not reuse slice_batch_for_context_parallel for Qwen3-Omni raw-batch padding. That utility operates after embedding preparation and slices inputs_embeds, while Qwen3-Omni needs pre-forward raw sequence normalization so the full input_ids tensor remains available for multimodal placeholder handling and mRoPE.

The shared abstraction here is intentionally narrow: compute the padded target sequence length, pad/truncate common raw batch tensors, and construct uniform THD PackedSeqParams. Model-specific logic such as multimodal merge, CP rank slicing, and mRoPE handling remains in Qwen3-Omni code.

ATTENTION: Qwen3-VL code is intentionally left unchanged in this PR. Applying these helpers back to Qwen3-VL can be considered separately with Qwen3-VL-specific regression coverage.

Validation

Unit tests:

pytest tests/unit_tests/training/utils/test_padding_utils.py tests/unit_tests/training/utils/test_packed_seq_utils.py
# 16 passed

pytest tests/unit_tests/models/qwen_omni/test_qwen3_omni_step.py tests/unit_tests/models/qwen_omni/modeling_qwen3_omni/test_omni_model.py
# 27 passed

E2E validation:
4-node / 32-GPU Qwen3-Omni packed sequence full-model training passed:
Parallel config: TP=2, PP=2, CP=2, EP=4, SP=True.
Training config: seq_length=16384, global_batch_size=16, micro_batch_size=2, train_iters=200.
Result: completed 200 steps with finite loss, stable grad norm, and stable throughput.

@copy-pr-bot

copy-pr-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
@hbhflw2000 hbhflw2000 force-pushed the pr4_omni3_packseq_sequence_utils branch from f0f95d8 to 042eed1 Compare June 11, 2026 11:27
@yaoyu-33 yaoyu-33 added area:model Model implementations and HF bridge logic feature New capabilities, enhancements, or enablement work needs-review PR is ready for code review and waiting on a reviewer labels Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:model Model implementations and HF bridge logic community-request feature New capabilities, enhancements, or enablement work needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants