Skip to content

Conversation

@Xiaoming-AMD
Copy link
Collaborator

Summary

This PR adds mock HuggingFace dataset support for the TorchTitan backend, allowing offline runs
and CI testing without requiring dataset downloads from the HuggingFace Hub.

🔧 Changes

  • primus/utils/mock_hf_dataset.py
    • Implements lightweight synthetic dataset generators for token/text samples.
    • Activated automatically when:
      • training.mock_data=True is set in experiment config / CLI.
  • primus/modules/trainer/torchtitan/patch_utils.py
    • Provides a wrapper patch_mock_hf_dataset() for backend integration.
  • tests/modules/trainer/torchtitan/test_patch_mock_hf_dataset.py
    • Unit tests covering dataset creation, iteration, and shape validation.

🧠 Usage

Real vs Mock dataset toggle via command line:

# Use real dataset (default)
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml bash examples/run_pretrain.sh --training.mock_data=False

# Use mock dataset (offline)
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml bash examples/run_pretrain.sh --training.mock_data=True

@Xiaoming-AMD Xiaoming-AMD merged commit c57539f into main Oct 27, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants