Skip to content

feat(data): add text chat collate for unified HF datasets#4307

Open
yaoyu-33 wants to merge 41 commits into
mira/issue4041-vlm-assistant-maskfrom
kant/pr4169-unified-hf-dataset
Open

feat(data): add text chat collate for unified HF datasets#4307
yaoyu-33 wants to merge 41 commits into
mira/issue4041-vlm-assistant-maskfrom
kant/pr4169-unified-hf-dataset

Conversation

@yaoyu-33

@yaoyu-33 yaoyu-33 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Stacked on #4169. Follow-up refactor #4315 is stacked on this PR.

  • add a text-only chat collate path under the existing HF conversation/VLM dataset utilities
  • reuse the fix(data): anchor VLM assistant loss masks to chat templates #4169 assistant-mask helpers for text conversations (build_assistant_loss_mask and build_shifted_labels_and_loss_mask)
  • add text_chat / chat maker aliases and tokenizer fallback for plain LLM checkpoints without an AutoProcessor
  • leave existing model-specific VLM collates untouched

Stack

Current order: #4169 -> #4307 -> #4315.

Validation

  • ruff check on changed files
  • python3 -m py_compile on changed files/tests
  • git diff --check
  • standalone pre-commit run --all-files

Full uv run pre-commit run --all-files was blocked on this workstation before hooks by a platform wheel availability issue for nvidia-resiliency-ext==0.6.0.

Follow-up validation requested

GPU cluster validation is in progress to compare SQuAD loss curves against #4169 using the same config and seed.

yaoyu-33 and others added 30 commits June 1, 2026 20:34
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33 yaoyu-33 added the needs-review PR is ready for code review and waiting on a reviewer label Jun 11, 2026
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

Conflict resolved in 24c4b153416df150516ecada1ce81f017413703b.

HF dataset creation workflow after this refactor:

  1. A maker function loads the HF dataset split and normalizes rows into Bridge chat schema:
    • text-only rows use messages, conversation, or legacy conversations
    • multimodal rows use processor-ready conversation plus optional media metadata
  2. HFDatasetConversationProvider selects the maker, builds split-specific normalized examples, loads the HF processor/tokenizer, and wraps examples with ConversationDataset.
  3. ConversationDataset owns repeat-to-target-length, optional shuffle, and binding the chosen collate implementation.
  4. The collate function renders the chat template, tokenizes, builds shifted labels/loss masks, and for VLM processors builds model-specific visual inputs.

I traced the apparent wrappers/layers while resolving the conflict. The current layers still each own a distinct responsibility: maker normalization, provider split construction, repeat/shuffle dataset wrapper, and model-specific collate. I did not remove any of them in this pass.

Additional fixes included:

  • make_text_chat_dataset now accepts the legacy conversations column so it matches text_chat_collate_fn.
  • Processor-to-tokenizer fallback now emits a debug log with the original AutoProcessor failure context.
  • Qwen/Kimi VLM collates keep the new hf_datasets.token_utils import while preserving the base branch's chat-template kwargs forwarding.

Validation:

  • Targeted static check on touched files: passed.
  • Internal focused unit validation: uv run --no-sync python -m pytest tests/unit_tests/data/hf_datasets tests/unit_tests/data/vlm_datasets/test_collate.py -v -> 50 passed.

yaoyu-33 added 3 commits June 11, 2026 15:30
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@svcnvidia-nemo-ci

Copy link
Copy Markdown
Contributor

/nvskills-ci

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@svcnvidia-nemo-ci

Copy link
Copy Markdown
Contributor

/nvskills-ci

return False
return "pack_sequences" in inspect.signature(selected_impl).parameters

def _get_maker(self) -> Callable[..., List[Dict[str, Any]]]:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need this anymore, since directly call get_hf_dataset_maker

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@svcnvidia-nemo-ci

Copy link
Copy Markdown
Contributor

/nvskills-ci

@yaoyu-33 yaoyu-33 force-pushed the mira/issue4041-vlm-assistant-mask branch from 003f43a to da214a6 Compare June 12, 2026 01:23
@yaoyu-33 yaoyu-33 requested review from a team, erhoo82 and malay-nagda as code owners June 12, 2026 01:23
@yaoyu-33 yaoyu-33 force-pushed the mira/issue4041-vlm-assistant-mask branch from da214a6 to d01ca27 Compare June 12, 2026 01:25
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@svcnvidia-nemo-ci

Copy link
Copy Markdown
Contributor

/nvskills-ci

@yaoyu-33

Copy link
Copy Markdown
Contributor Author

GPU cluster validation update for da973fe1:

  • Job 12743499: completed 0:0.
    • Provider/text-collate regression: 12 passed.
    • LLM SQuAD smoke covered non-packed, online in-batch packing, and offline packing paths for 20 train iterations each.
  • Job 12743560: completed targeted shared-helper regression, 1 passed.
  • Local static checks: UV_NO_SYNC=1 uv run ruff check ... and UV_NO_SYNC=1 uv run pre-commit run --all-files passed.

Loss summary:

path iter 1 loss iter 20 loss delta
nonpack 11.927900 11.785750 -0.142150
online in-batch packing 11.850320 10.645260 -1.205060
offline packing 11.981930 9.041459 -2.940471

Notes:

  • Online packing used the text chat collate path with hf_processor_path=None, so it exercised the training-context tokenizer path that previously exposed the Megatron tokenizer wrapper issue.
  • Offline packing used enable_offline_packing with PackedSequenceSpecs(packed_sequence_size=512, pad_seq_to_mult=1); micro-batch size was 1 for offline packed THD, while nonpack/online used micro-batch size 2.
  • All three loss series were finite and ended below their first logged step. No design migration blocker remains from this validation.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@svcnvidia-nemo-ci

Copy link
Copy Markdown
Contributor

/nvskills-ci

@yaoyu-33

Copy link
Copy Markdown
Contributor Author

Resolved the stacked-base conflicts in ac9fee89 against mira/issue4041-vlm-assistant-mask.

Validation:

  • PR merge state is now CLEAN / MERGEABLE.
  • GPU validation job 12745628 completed 0:0.
  • ruff check passed on the conflict-resolution files.
  • Focused unit tests passed:
    • HF dataset provider/collate/text SFT + shared VLM processing + packed sequence dataset: 47 passed.
    • Packing-related config validation subset: 16 passed.
  • Local UV_NO_SYNC=1 uv run pre-commit run --all-files passed before the merge commit.

@yaoyu-33 yaoyu-33 added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed needs-review PR is ready for code review and waiting on a reviewer labels Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:data Dataset builders, preprocessing, and samplers feature New capabilities, enhancements, or enablement work needs-more-tests Requires additional L0 and L1 test coverage before merge ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants