feat(data): add text chat collate for unified HF datasets#4307
feat(data): add text chat collate for unified HF datasets#4307yaoyu-33 wants to merge 41 commits into
Conversation
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
Conflict resolved in HF dataset creation workflow after this refactor:
I traced the apparent wrappers/layers while resolving the conflict. The current layers still each own a distinct responsibility: maker normalization, provider split construction, repeat/shuffle dataset wrapper, and model-specific collate. I did not remove any of them in this pass. Additional fixes included:
Validation:
|
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
/nvskills-ci |
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
/nvskills-ci |
| return False | ||
| return "pack_sequences" in inspect.signature(selected_impl).parameters | ||
|
|
||
| def _get_maker(self) -> Callable[..., List[Dict[str, Any]]]: |
There was a problem hiding this comment.
no need this anymore, since directly call get_hf_dataset_maker
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
/nvskills-ci |
003f43a to
da214a6
Compare
da214a6 to
d01ca27
Compare
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
/nvskills-ci |
|
GPU cluster validation update for
Loss summary:
Notes:
|
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
/nvskills-ci |
|
Resolved the stacked-base conflicts in Validation:
|
Summary
Stacked on #4169. Follow-up refactor #4315 is stacked on this PR.
build_assistant_loss_maskandbuild_shifted_labels_and_loss_mask)text_chat/chatmaker aliases and tokenizer fallback for plain LLM checkpoints without an AutoProcessorStack
Current order: #4169 -> #4307 -> #4315.
Validation
ruff checkon changed filespython3 -m py_compileon changed files/testsgit diff --checkpre-commit run --all-filesFull
uv run pre-commit run --all-fileswas blocked on this workstation before hooks by a platform wheel availability issue fornvidia-resiliency-ext==0.6.0.Follow-up validation requested
GPU cluster validation is in progress to compare SQuAD loss curves against #4169 using the same config and seed.