Skip to content

refactor(data): move VLM sequence batching to collate#4315

Draft
yaoyu-33 wants to merge 2 commits into
kant/pr4169-unified-hf-datasetfrom
franklin/pr4307-padding-collate
Draft

refactor(data): move VLM sequence batching to collate#4315
yaoyu-33 wants to merge 2 commits into
kant/pr4169-unified-hf-datasetfrom
franklin/pr4307-padding-collate

Conversation

@yaoyu-33

@yaoyu-33 yaoyu-33 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Stacked on #4307. Current order: #4169 -> #4307 -> #4315.

Implements issue #4041 Step 2 by moving generic VLM padding, truncation, and in-batch packing preparation from vlm_step into collate-time helpers. The VLM training step now consumes already-collated sequence tensors and packed metadata.

Validation status:

  • Targeted static checks passed locally: py_compile, ruff check, ruff format --check, and git diff --check.
  • Full pre-commit is currently blocked on this workstation by the known nvidia-resiliency-ext wheel/platform resolver issue.
  • Internal loss-parity validation against feat(data): add text chat collate for unified HF datasets #4307 head is pending; this PR remains draft until that completes.

@copy-pr-bot

copy-pr-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33 yaoyu-33 force-pushed the franklin/pr4307-padding-collate branch 2 times, most recently from f4cd4fb to 5af7e07 Compare June 12, 2026 05:22
@yaoyu-33 yaoyu-33 force-pushed the franklin/pr4307-padding-collate branch 3 times, most recently from 530700e to 56e0f1f Compare June 12, 2026 17:00
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33 yaoyu-33 force-pushed the franklin/pr4307-padding-collate branch from 56e0f1f to 5a25523 Compare June 12, 2026 17:05
Comment thread src/megatron/bridge/data/hf_datasets/conversation_dataset.py Outdated
Comment thread src/megatron/bridge/data/hf_datasets/conversation_dataset.py Outdated
Comment thread src/megatron/bridge/data/hf_datasets/conversation_dataset.py Outdated
Comment thread src/megatron/bridge/data/sequence_packing.py Outdated
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant