feat(data): add text chat collate for unified HF datasets by yaoyu-33 · Pull Request #4307 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-06-11T16:57:10Z

Summary

Stacked on #4169. Follow-up refactor #4315 is stacked on this PR.

add a text-only chat collate path under the existing HF conversation/VLM dataset utilities
reuse the fix(data): anchor VLM assistant loss masks to chat templates #4169 assistant-mask helpers for text conversations (build_assistant_loss_mask and build_shifted_labels_and_loss_mask)
add text_chat / chat maker aliases and tokenizer fallback for plain LLM checkpoints without an AutoProcessor
leave existing model-specific VLM collates untouched

Stack

Current order: #4169 -> #4307 -> #4315.

Validation

ruff check on changed files
python3 -m py_compile on changed files/tests
git diff --check
standalone pre-commit run --all-files

Full uv run pre-commit run --all-files was blocked on this workstation before hooks by a platform wheel availability issue for nvidia-resiliency-ext==0.6.0.

Follow-up validation requested

GPU cluster validation is in progress to compare SQuAD loss curves against #4169 using the same config and seed.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

…sistant-mask

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-06-11T22:19:47Z

Conflict resolved in 24c4b153416df150516ecada1ce81f017413703b.

HF dataset creation workflow after this refactor:

A maker function loads the HF dataset split and normalizes rows into Bridge chat schema:
- text-only rows use messages, conversation, or legacy conversations
- multimodal rows use processor-ready conversation plus optional media metadata
HFDatasetConversationProvider selects the maker, builds split-specific normalized examples, loads the HF processor/tokenizer, and wraps examples with ConversationDataset.
ConversationDataset owns repeat-to-target-length, optional shuffle, and binding the chosen collate implementation.
The collate function renders the chat template, tokenizes, builds shifted labels/loss masks, and for VLM processors builds model-specific visual inputs.

I traced the apparent wrappers/layers while resolving the conflict. The current layers still each own a distinct responsibility: maker normalization, provider split construction, repeat/shuffle dataset wrapper, and model-specific collate. I did not remove any of them in this pass.

Additional fixes included:

make_text_chat_dataset now accepts the legacy conversations column so it matches text_chat_collate_fn.
Processor-to-tokenizer fallback now emits a debug log with the original AutoProcessor failure context.
Qwen/Kimi VLM collates keep the new hf_datasets.token_utils import while preserving the base branch's chat-template kwargs forwarding.

Validation:

Targeted static check on touched files: passed.
Internal focused unit validation: uv run --no-sync python -m pytest tests/unit_tests/data/hf_datasets tests/unit_tests/data/vlm_datasets/test_collate.py -v -> 50 passed.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

svcnvidia-nemo-ci · 2026-06-12T00:00:14Z

/nvskills-ci

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

svcnvidia-nemo-ci · 2026-06-12T00:28:31Z

/nvskills-ci

yaoyu-33 · 2026-06-12T01:15:59Z

            return False
        return "pack_sequences" in inspect.signature(selected_impl).parameters

    def _get_maker(self) -> Callable[..., List[Dict[str, Any]]]:


no need this anymore, since directly call get_hf_dataset_maker

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

svcnvidia-nemo-ci · 2026-06-12T01:21:17Z

/nvskills-ci

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

svcnvidia-nemo-ci · 2026-06-12T02:12:25Z

/nvskills-ci

yaoyu-33 · 2026-06-12T04:28:51Z

GPU cluster validation update for da973fe1:

Job 12743499: completed 0:0.
- Provider/text-collate regression: 12 passed.
- LLM SQuAD smoke covered non-packed, online in-batch packing, and offline packing paths for 20 train iterations each.
Job 12743560: completed targeted shared-helper regression, 1 passed.
Local static checks: UV_NO_SYNC=1 uv run ruff check ... and UV_NO_SYNC=1 uv run pre-commit run --all-files passed.

Loss summary:

path	iter 1 loss	iter 20 loss	delta
nonpack	11.927900	11.785750	-0.142150
online in-batch packing	11.850320	10.645260	-1.205060
offline packing	11.981930	9.041459	-2.940471

Notes:

Online packing used the text chat collate path with hf_processor_path=None, so it exercised the training-context tokenizer path that previously exposed the Megatron tokenizer wrapper issue.
Offline packing used enable_offline_packing with PackedSequenceSpecs(packed_sequence_size=512, pad_seq_to_mult=1); micro-batch size was 1 for offline packed THD, while nonpack/online used micro-batch size 2.
All three loss series were finite and ended below their first logged step. No design migration blocker remains from this validation.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

svcnvidia-nemo-ci · 2026-06-12T04:34:47Z

/nvskills-ci

yaoyu-33 · 2026-06-12T04:42:10Z

Resolved the stacked-base conflicts in ac9fee89 against mira/issue4041-vlm-assistant-mask.

Validation:

PR merge state is now CLEAN / MERGEABLE.
GPU validation job 12745628 completed 0:0.
ruff check passed on the conflict-resolution files.
Focused unit tests passed:
- HF dataset provider/collate/text SFT + shared VLM processing + packed sequence dataset: 47 passed.
- Packing-related config validation subset: 16 passed.
Local UV_NO_SYNC=1 uv run pre-commit run --all-files passed before the merge commit.

yaoyu-33 and others added 30 commits June 1, 2026 20:34

docs(data): map VLM data pipeline refactor state

864dbf0

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

refactor(data): unify VLM processing helpers

145459a

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

chore(data): remove VLM phase report

cf0b3cf

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

refactor(data): add VLM source adapters

68bfd59

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

fix(data): preserve Qwen Energon assistant masking

0d3acb2

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

refactor(data): remove VLM collate wrapper helpers

6584a2e

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

refactor(data): share VLM collate in Energon encoders

2d2ff97

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

refactor(data): rename HF Energon task encoder

b2dd093

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

refactor(data): clarify VLM collate steps

916a6a8

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

refactor(data): remove partial vlm collate wrappers

d358f01

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

fix(data): restore vlm energon safety checks

886b6a1

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

refactor(data): unify VLM Energon collation path

a72a0cc

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

fix(data): thread HF task encoder collate options

4a4a093

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

refactor(data): name Gemma3 VLM collator explicitly

8e21828

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

refactor(data): move VLM collators to model modules

e72d2c4

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

chore(data): remove stale Kimi VL debug comments

4f73bd8

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

fix(data): anchor VLM assistant loss masks to chat templates

7176a1d

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Merge origin/main into mira/issue4041-vlm-assistant-mask

f9b6b47

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

fix(data): make assistant mask role ends explicit

9e87f67

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Apply suggestions from code review

d9896d8

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

fix(data): require explicit VLM assistant mask boundaries

98db230

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

fix(data): preserve Kimi assistant mask before padding

6fda95c

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

fix(data): clarify VLM loss role naming

2b02235

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

fix(data): separate VLM loss roles and parts

58cc94e

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

fix(data): keep VLM boundary masks role based

69ce12d

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

fix(data): align VLM assistant masks with padding

2a78fa8

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Merge remote-tracking branch 'origin/main' into mira/issue4041-vlm-as…

b8a5496

…sistant-mask

fix(data): update Qwen VLM padding mask test

7693b4d

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

feat(data): add text chat collate for HF conversations

26f48a2

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Merge branch 'main' into mira/issue4041-vlm-assistant-mask

06fdfb0

yaoyu-33 added the needs-review PR is ready for code review and waiting on a reviewer label Jun 11, 2026

fix(data): resolve HF dataset refactor conflicts

24c4b15

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 added 3 commits June 11, 2026 15:30

refactor(data): rename HF conversation dataset provider

7a419c3

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

fix(data): let hf conversation dataloader control order

314060b

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

fix(data): restore packed hf text sft path

222c19e

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 mentioned this pull request Jun 11, 2026

[bug] nemotron_omni SFT loss mask never supervises assistant <|im_end|> terminator, so free-form fine-tunes fail to learn when to stop #4265

Open

refactor(data): split sequence packing config switches

2353c8c

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

fix(data): pack text HF batches in collate

7797f4d

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 commented Jun 12, 2026

View reviewed changes

fix(data): align text in-batch packing metadata

d113c10

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 force-pushed the mira/issue4041-vlm-assistant-mask branch from 003f43a to da214a6 Compare June 12, 2026 01:23

yaoyu-33 requested review from a team, erhoo82 and malay-nagda as code owners June 12, 2026 01:23

yaoyu-33 force-pushed the mira/issue4041-vlm-assistant-mask branch from da214a6 to d01ca27 Compare June 12, 2026 01:25

fix(data): unwrap megatron tokenizers for text chat collate

da973fe

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

fix(data): resolve stacked base conflicts

ac9fee8

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 mentioned this pull request Jun 12, 2026

refactor(data): move VLM sequence batching to collate #4315

Draft

yaoyu-33 mentioned this pull request Jun 12, 2026

fix(data): anchor VLM assistant loss masks to chat templates #4169

Open

cuichenx approved these changes Jun 12, 2026

View reviewed changes

yaoyu-33 added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed needs-review PR is ready for code review and waiting on a reviewer labels Jun 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data): add text chat collate for unified HF datasets#4307

feat(data): add text chat collate for unified HF datasets#4307
yaoyu-33 wants to merge 41 commits into
mira/issue4041-vlm-assistant-maskfrom
kant/pr4169-unified-hf-dataset

yaoyu-33 commented Jun 11, 2026 •

edited

Loading

Uh oh!

yaoyu-33 commented Jun 11, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 12, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 12, 2026

Uh oh!

yaoyu-33 Jun 12, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 12, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 12, 2026

Uh oh!

yaoyu-33 commented Jun 12, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 12, 2026

Uh oh!

yaoyu-33 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yaoyu-33 commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stack

Validation

Follow-up validation requested

Uh oh!

yaoyu-33 commented Jun 11, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 12, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 12, 2026

Uh oh!

yaoyu-33 Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

svcnvidia-nemo-ci commented Jun 12, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 12, 2026

Uh oh!

yaoyu-33 commented Jun 12, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 12, 2026

Uh oh!

yaoyu-33 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yaoyu-33 commented Jun 11, 2026 •

edited

Loading