feat: `GPTDatasetFolder` for easier definition of dataset-based mixtures by MaxiBoether · Pull Request #58 · swiss-ai/Megatron-LM

MaxiBoether · 2025-03-11T14:30:04Z

This PR adds a middle layer in after the GPTDataset (per file prefix) to accomodate the common setting where we have one directory per dataset with multiple files for this dataset in that directory. Then, the GPTDatasetFolder can be merged using the existing BlendedDataset.

It can be used with our existing runs sbatch files. The only thing that needs to change is instead of calling MEGATRON_LM_DIR/scripts/tools/create_data_config.py in the DATA_ARGS, you just need to pass a space separated list of directories (and optionally, a weight after each directory).

TODO: Implement looping on each directory (@TJ-Solergibert is on that) by moving looping to the BlendedDataset.

MaxiBoether · 2025-03-11T16:59:25Z

+            for i, _split in enumerate(Split):
+                if split[i] is None:
+                    mid_level_datasets.append(None)
+                else:
+                    mid_level_datasets.append(
+                        self.build_generic_dataset(
+                            self.cls,
+                            self.is_built_on_rank,
+                            synchronize_ranks,
+                            None, # indexed_dataset (unused)
+                            dataset_path, # folder_path
+                            None, # indexed_indices (unused)
+                            sizes[i],
+                            _split,
+                            self.config,
+                        )
+                    )


I know we're not really using splits in our runs but it was weird not respecting it. I am not 100% sure this would actually work if you used splits though.

MaxiBoether added 6 commits March 10, 2025 17:41

First untested version

64e22da

move condition to assertion

c574105

updates

1f186bf

remove unnecessary whitespace

c6720f3

leftover from previous iteration

5767206

leftovers

040b00b

MaxiBoether changed the title ~~feat: Introduce GPTDatasetFolder for easier definition of dataset-based mixtures~~ feat: GPTDatasetFolder for easier definition of dataset-based mixtures Mar 11, 2025

MaxiBoether marked this pull request as ready for review March 11, 2025 16:52

MaxiBoether requested a review from TJ-Solergibert March 11, 2025 16:52

MaxiBoether commented Mar 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: `GPTDatasetFolder` for easier definition of dataset-based mixtures#58

feat: `GPTDatasetFolder` for easier definition of dataset-based mixtures#58
MaxiBoether wants to merge 6 commits into
mainfrom
feat/MaxiBoether/foldersampling

MaxiBoether commented Mar 11, 2025 •

edited

Loading

Uh oh!

MaxiBoether Mar 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MaxiBoether commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxiBoether Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MaxiBoether commented Mar 11, 2025 •

edited

Loading