Skip to content

feat: GPTDatasetFolder for easier definition of dataset-based mixtures#58

Open
MaxiBoether wants to merge 6 commits into
mainfrom
feat/MaxiBoether/foldersampling
Open

feat: GPTDatasetFolder for easier definition of dataset-based mixtures#58
MaxiBoether wants to merge 6 commits into
mainfrom
feat/MaxiBoether/foldersampling

Conversation

@MaxiBoether
Copy link
Copy Markdown

@MaxiBoether MaxiBoether commented Mar 11, 2025

This PR adds a middle layer in after the GPTDataset (per file prefix) to accomodate the common setting where we have one directory per dataset with multiple files for this dataset in that directory. Then, the GPTDatasetFolder can be merged using the existing BlendedDataset.

It can be used with our existing runs sbatch files. The only thing that needs to change is instead of calling MEGATRON_LM_DIR/scripts/tools/create_data_config.py in the DATA_ARGS, you just need to pass a space separated list of directories (and optionally, a weight after each directory).

TODO: Implement looping on each directory (@TJ-Solergibert is on that) by moving looping to the BlendedDataset.

@MaxiBoether MaxiBoether changed the title feat: Introduce GPTDatasetFolder for easier definition of dataset-based mixtures feat: GPTDatasetFolder for easier definition of dataset-based mixtures Mar 11, 2025
@MaxiBoether MaxiBoether marked this pull request as ready for review March 11, 2025 16:52
Comment on lines +460 to +476
for i, _split in enumerate(Split):
if split[i] is None:
mid_level_datasets.append(None)
else:
mid_level_datasets.append(
self.build_generic_dataset(
self.cls,
self.is_built_on_rank,
synchronize_ranks,
None, # indexed_dataset (unused)
dataset_path, # folder_path
None, # indexed_indices (unused)
sizes[i],
_split,
self.config,
)
)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we're not really using splits in our runs but it was weird not respecting it. I am not 100% sure this would actually work if you used splits though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants