Skip to content

Fix the dataset post initialization#936

Merged
copybara-service[bot] merged 1 commit intomainfrom
lance-dataset
Dec 23, 2025
Merged

Fix the dataset post initialization#936
copybara-service[bot] merged 1 commit intomainfrom
lance-dataset

Conversation

@wang2yn84
Copy link
Collaborator

The nightly regression script is broken due to the recent changes to create_dataset. In the script we split the train dataset and validation dataset. However, post_init_dataset creates the IterDataset which can't be random accessed and split like that.

  1. Refactored the post_init_dataset to include the validation dataset split and epoch repeat.
  2. Refactored the apply_template to include both custom template and transformer tokenizer provided template.
  3. Add extensive tests for apply_template and post_init_dataset
  4. The reason why the script was broken silently is because it's not included in the CI. Add the minimal run to CI to avoid future breakage.

It's a good idea to open an issue first for discussion.

Reference

Colab Notebook

Checklist

  • I have added all the necessary unit tests for my change.
  • I have verified that my change does not break existing code and all unit tests pass.
  • I have added all appropriate doc-strings/documentation.
  • My PR is based on the latest changes of the main branch (if unsure, rebase the code).
  • I have signed the Contributor License Agreement.
  • I have followed Contribution Guidelines.

@copybara-service copybara-service bot merged commit 9c82596 into main Dec 23, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant