mlcommons · ShriyaRishab · Feb 6, 2025 · Jan 30, 2025 · Jan 30, 2025 · Jan 31, 2025
@@ -10,7 +10,7 @@ To use this repository, please install a supported version of PyTorch with GPU s
 
 We recommend using the latest NeMo FW container. The latest tested compatible version is `nvcr.io/nvidia/nemo:24.12-rc0`).
 
-#### Container Setup
+#### Container setup
 
 All of the following codes are assumed to be run within a container. A [Dockerfile](./Dockerfile) is available for building containers on top of `nvcr.io/nvidia/nemo:24.12-rc0`. 
 
@@ -33,7 +33,10 @@ Note: it's recommended to map your `.ssh` folder to inside the container, so tha
 
 ### Steps to download and verify data
 
-The current codebase is still using GPT3's train/val datasets and SentencePieceModel tokenizer. Please refer to [GPT3 instructions](https://github.com/mlcommons/training/tree/master/large_language_model/megatron-lm#preprocessed-data-download) to download **the raw C4 dataset** that we can preprocess later. 
+The current codebase is using C4 dataset for train and evaluation. Please refer to [Section 3](#preprocessed-data-download) for downloading the preprocessed dataset and [Section 6](#data-preprocessing) if you would like to perform manual tokenization. 
+
+
+### Steps to download the checkpoint
 
 ### Steps to run and time
 
@@ -100,20 +103,22 @@ After the download is complete, you should see five files under `TOKENIZER_PATH`
 
 ### Training and test data separation
 
-To be determined. For now, we are using the default split from the C4 dataset. 
+We use the default split from the C4 dataset. This means that we use `c4-train.<x>-of-01024.json.gz` files for training and `c4-validation.<x>-of-00008.json.gz` files for evaluation. 
 
 ### Training data order
 
-To be determined. Current plan is to use the last 256 of 1024 files (shards 6 and 7) for the benchmarked area. 
+We randomly shuffle the **last 256 of 1024 shards** for the benchmarking area.
 
 ### Test data order
 
-To be determined. 
+We use the first 47M tokens in the validation dataset for validation. We **do not shuffle** the validation dataset. 
 
 # 4. Model
 ### Publication/Attribution
 
-The model largely follows the Llama 3.1 405B [paper](https://arxiv.org/abs/2407.21783). The main difference is that the model parameters is *to be determined from experiments*. 
+The model largely follows the Llama 3.1 405B [paper](https://arxiv.org/abs/2407.21783). Two noticeable differences are: 
+1. We replace the paper's TikTokenizer with the **Mixtral 8x22b tokenizer** in this benchmark. Please refer to the [Tokenizer](#tokenizer) section for more details.  
+1. We replace the paper's AdamW with the **Adam optimizer** in this benchmark. Please refer to the [Optimizer](#optimizer-spec) section for more details. 
 
 ### Model details
 
@@ -128,24 +133,24 @@ The model largely follows the Llama 3.1 405B [paper](https://arxiv.org/abs/2407.
 | Hidden Dimension | 53248 |
 | Activation | SwiGLU | 
 | Normalization | RMSNorm |  
-| Tokenizer | TikTokenizer |
-| Vocab size | 128,000 |  
+| Tokenizer | Mixtral 8x22B tokenizer |
+| Vocab size | 32,000 |  
 | Context Length | 8192 |
 
 
-### Checkpoint download and conversion
+### Checkpoint download
 
-To be determined. For now, we are not using Llama 3.1 default checkpoint. 
-
-~~To experiment with a given checkpoint, we have added a `--ckpt` argument that loads the pretrained checkpoint from a **NeMo checkpoint path**, which requires some checkpoint format conversion if the original checkpoint is in LlamaStack or HuggingFace format.~~
+MLCommons hosts the checkpoint for download **exclusively by MLCommons Members**. You must first agree to the [confidentiality notice](https://llama3-1.mlcommons.org) using your organizational email address, then you will receive a link to a directory containing Rclone download instructions. _If you cannot access the form but you are part of a MLCommons Member organization, submit the [MLCommons subscription form](https://mlcommons.org/community/subscribe/) with your organizational email address and [associate a Google account](https://accounts.google.com/SignUpWithoutGmail) with your organizational email address. You should then be able to access the confidentiality form using that Google account._
 
 #### Saving and restoring a checkpoint
 
 Large runs might need to span across multiple Slurm jobs, and we need to save and load checkpoints with contexts so that training can resume between jobs. To support this, we have added some environment variables. Please refer to `config.sh` for more details. 
 
-### Optimizer
+### Optimizer spec
 
-Adam
+1. Optimizer type: **Adam**
+2. Warmup steps computed as $8000 \times \lceil {1152 \over GBS} \rceil$.
+3. LR Scheduler's maximum number of steps computed as $1,200,000 \times \lceil {1152 \over GBS} \rceil$
 
 # 5. Quality
 ### Quality metric
@@ -154,28 +159,28 @@ Log Perplexity
 
 ### Quality target
 
-To be determined. 
+Validation log perplexity = 5.6
 
 ### Evaluation frequency
 
-To be determined. 
+We perform evaluation every **377,487,360** tokens. 
 
 ### Evaluation thoroughness
 
-To be determined. 
+We evaluate using **47,185,920** tokens from the validation dataset. 
 
 
 # 6. Other
 
-### Data Preprocessing
+### Data preprocessing
 
-Here are the instructions to prepare the preprocessed dataset from scratch. Data preprocessing is already done and the final dataset can be accessed by following instructions in the [Preprocessed data download]() section. 
+Here are the instructions to prepare the preprocessed dataset from scratch. Data preprocessing is already done and the final dataset can be accessed by following instructions in the [Preprocessed data download](#preprocessed-data-download) section. 
 
-#### Tokenizer
+#### Prepare tokenizer
 
 We use Mixtral 8x22B tokenizer in this benchmark. Tokenizer files can be downloaded [here](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1/tree/main). Only the five files containing tokenizer-related contents (`special_tokens_map.json`, `tokenizer.json`, `tokenizer.model`, `tokenizer.model.v1`, `tokenizer_config.json`) are needed. 
 
-#### Run Data preprocessing
+#### Run data preprocessing
 
 Run the following commands to merge all 1024 training files into 8 `json.gz` files and all 8 validation files into a single `json.gz` file. Each of the `json.gz` files will be preprocessed into a pair of megatron dataset files (`.bin` and `.idx`). 
 
@@ -199,5 +204,34 @@ export MERGED_C4_PATH=""
 # this path is used for storing the preprocessed .bin and .idx files
 export PREPROCESSED_PATH=""
 
+# Extra Slurm-related arguments can be provided here
 sbatch preprocess.sh
 ```
+
+### HuggingFace Checkpoint Preprocessing
+
+Here are the instructions to prepare the NeMo-formatted checkpoint from scratch. Checkpoint conversion is already done and the converted checkpoint can be accessed by following instructions in the [Checkpoint download](#checkpoint-download) section. 
+
+#### HuggingFace checkpoint downloading
+
+We use the HuggingFace Llama 3.1 405B checkpoint as the initial checkpoint in this benchmark. Original HuggingFace checkpoint can be downloaded [here](https://huggingface.co/meta-llama/Llama-3.1-405B). **Notice that we are downloading the BF16 not the FP8 version of the model**. 
+
+#### Run model conversion
+
+Assuming that we have downloaded the HuggingFace checkpoint to a `<SRC_PATH>` directory, we can run [this script](./utils/launch_nemo_convert.sh) (which calls [this python script](./utils/nemo_convert.py)) to perform checkpoint format conversion. After such conversion is done, you should be able to find the converted checkpoint under `<DST_PATH>` directory, and there should be two subfolders inside this directory - `context` and `weights`. 
+
+```bash
+# fill in the built container path here
+export CONT_IMAGE_URL=""
+# fill in the folder that holds the HF checkpoint here
+# under this folder, you should see a lot of safetensors
+export SRC_PATH=""
+# fill in the destination folder of your choice here
+# after conversion is done, you can find context and weights under this path
+export DST_PATH=""
+
+# Extra Slurm-related arguments can be provided here
+sbatch launch_nemo_convert.sh
+```
+
+After the model conversion is done, we can then set `MODEL_CKPT=$DST_PATH` together with `FROM_HF=1` when launching our job, so that we can resume training from the converted HF checkpoint. 
@@ -138,7 +138,7 @@ def version(self):
 
 ### MLPerf callbacks
 def compute_consumed_mllog_tokens(trainer, init_global_step, global_batch_size, seq_length):
-    steps_since_resume = trainer.global_step - init_global_step
+    steps_since_resume = trainer.global_step + 1 - init_global_step # global steps are 0-indexed
     consumed_samples = (
         steps_since_resume * global_batch_size
     )
@@ -205,6 +205,7 @@ def on_validation_end(self, trainer, pl_module):
             if isinstance(logger, MetricsLogger):
                 if logger.is_target_reached:
                     trainer.should_stop = True
+                    self.set_success_status()
 
         if not trainer.should_stop:
             mllogger.start(key=constants.BLOCK_START, metadata={"epoch_num": self.consumed_tokens(trainer)})

@@ -60,6 +60,9 @@ export MODEL_CKPT=""
 export CONTINUAL_CKPT=""
 # Model: Whether we want to restore from MODEL_CKPT path. If 0, then we are not restoring. 
 export USE_CKPT=0
+# Model: Whether we are resuming from a NeMo-formatted HuggingFace checkpoint (weights only). 
+#     If set to 1, then checkpoint resuming code will not try to load the optimizer states. 
+export FROM_HF=1
 # Model: Whether we want to save a checkpoint. Must be 1 if NPAR > 1. If 1, then we save a checkpoint at the end.
 export SAVE_CKPT=0
 
@@ -71,18 +74,10 @@ export SIZE="405b"
 export GBS=1152
 # Dataloader: Micro batch size
 export MBS=1
-# Dataloader: Evaluate every N batches, optional
-#     defaults to evaluate every 20 batches, or 188_743_680 tokens
-export EVAL_EVERY="20"
-# Dataloader: Evaluate using N batches, optional
-#     defaults to use 10 batches for evaluation, or 94_371_840 tokens
-#     If an empty string is provided (""), then we use full validation dataset for evaluation
-export EVAL_BATCHES="10"
 # Dataloader: Max run N batches, optional
-#     defaults to train 425 steps, or 4_010_803_200 tokens
 #     If an empty string is provided (""), then the training will continue until time limit
 #     If we want to save a checkpoint, then this value must be set
-export MAX_STEPS="425"
+export MAX_STEPS=""
 
 # Experiment: starting steps
 #     This is the starting "offset" step from the checkpoint. 

@@ -14,6 +14,7 @@
 
 import argparse
 from typing import Optional
+import math
 from nemo.collections import llm
 from nemo.collections.common.tokenizers import AutoTokenizer
 from nemo import lightning as nl
@@ -70,14 +71,16 @@ def slurm_executor(
         ),
         nodes=nodes,
         ntasks_per_node=devices,
-        gpus_per_node=devices,
         mem="0",
         exclusive=True,
-        gres="gpu:8",
         packager=run.GitArchivePackager(),
         dependencies=dependencies,
     )
 
+    if devices != 0:
+        executor.gpus_per_node=devices
+        executor.gres = "gpu:8"
+
     executor.launcher = None
     executor.container_image = container_image
     executor.container_mounts = mounts
@@ -97,6 +100,8 @@ def get_pretrain(
 ) -> run.Partial:
 
     exp_name = size
+    base_gbs = 1152
+    gbs = data_module.global_batch_size
 
     # Providing 8B and 70B here for debugging purpose
     # Actual benchmark should use 405B
@@ -142,10 +147,15 @@ def get_pretrain(
 
         pretrain.trainer.strategy.virtual_pipeline_model_parallel_size = 7
 
+        base_lr = 8e-5
+        warmup_tokens = 8000 * base_gbs * 8192
+
+        max_lr = (gbs / base_gbs) * base_lr
+
         pretrain.optim = distributed_fused_adam_with_cosine_annealing(
-            max_lr=8e-5, 
-            warmup_steps=8000,
-            min_lr=8e-7
+            max_lr = max_lr, 
+            warmup_steps = math.ceil(warmup_tokens / 8192 / gbs),
+            min_lr = 8e-7
         )
 
         from nemo.collections.llm.recipes.tp_overlap_configs.userbuffers import (
@@ -166,7 +176,8 @@ def get_pretrain(
         )
 
     # sets up everything else
-    pretrain.trainer.max_steps = 1_200_000 # Llama 3.1 paper section 3.4.1 - decays LR to 8e10-7 over 1,200,000 steps
+    max_tokens = 1_200_000 * 8192 * base_gbs # Llama 3.1 paper section 3.4.1 - decays LR to 8e10-7 over 1,200,000 steps
+    pretrain.trainer.max_steps = math.ceil(max_tokens / 8192 / gbs)
 
     pretrain.data = data_module
     pretrain.trainer.val_check_interval = eval_every
@@ -279,27 +290,28 @@ def get_parser() -> argparse.ArgumentParser:
         ])
 
     model_group.add_argument("--initial_ckpt_path", type=str, default=None)
-    model_group.add_argument("--use_ckpt", action="store_true")
-    model_group.add_argument("--ckpt_start_step", type=int, default=0)
-    model_group.add_argument("--continual_ckpt_path", type=str, default=None)
-    model_group.add_argument("--save_ckpt", action="store_true")
+    model_group.add_argument("--use_ckpt", action="store_true", help="If set, then resume from the initial checkpoint path")
+    model_group.add_argument("--resume_from_hf", action="store_true", help="Setting this knob indicates that we are resuming from a weight-only checkpoint")
+    model_group.add_argument("--ckpt_start_step", type=int, default=0, help="Sets this value to how many steps the resumed checkpoint is already trained on")
+    model_group.add_argument("--continual_ckpt_path", type=str, default=None, help="Sets this to the path that saves the checkpoint")
+    model_group.add_argument("--save_ckpt", action="store_true", help="If set, then we save the checkpoint at the end of the experiment")
 
     data_group = parser.add_argument_group("Dataset arguments")
 
-    data_group.add_argument("--gbs", type=int, default=288, help="Global batch size, should be divisible by PP")
+    data_group.add_argument("--gbs", type=int, default=1152, help="Global batch size, should be divisible by PP")
     data_group.add_argument("--mbs", type=int, default=1, help="Micro batch size")
-    data_group.add_argument("--eval_every", type=int, default=20)
-    data_group.add_argument("--eval_batches", type=int, default=None)
-    data_group.add_argument('--max_steps', type=int, default=None)
-    data_group.add_argument("--use_full_dataset", action="store_true", help="Whether we use the full dataset or use the last 256/1024 dataset")
+    data_group.add_argument("--eval_every", type=int, default=377_487_360, help="Evaluate at least every N training tokens")
+    data_group.add_argument("--eval_tokens", type=int, default=47_185_920, help="Evaluate using at least N evaluation tokens")
+    data_group.add_argument('--max_steps', type=int, default=None, help="Maximum number of steps that each experiment partition will train on. None means no restriction on max steps. ")
+    data_group.add_argument("--use_full_dataset", action="store_true", help="If set, then we use the full dataset, instead of the last 256/1024 shards")
     data_group.add_argument("--tokenizer_path", type=str, help="Tokenizer path that's used to tokenize the dataset")
 
     experiment_group = parser.add_argument_group("Experiment management arguments")
     experiment_group.add_argument("--dryrun", action="store_true", help="Whether we are launching dryrun or actual runs")
     experiment_group.add_argument("--seeds", type=int, nargs="*", default=[], help="random seeds")
     experiment_group.add_argument("--num_exps", type=int, default=1)
     experiment_group.add_argument("--num_pars", type=int, default=1)
-    experiment_group.add_argument("--target_log_ppl", type=float, default=1)
+    experiment_group.add_argument("--target_log_ppl", type=float, default=5.6)
 
     return parser
 
@@ -339,13 +351,16 @@ def get_parser() -> argparse.ArgumentParser:
         use_full_dataset=args.use_full_dataset,
     )
 
+    eval_every_n_batches = math.ceil(args.eval_every / (args.gbs * 8192))
+    eval_batches = math.ceil(args.eval_tokens / (args.gbs * 8192))
+
     exp_prefix, pretrain = get_pretrain(
         size=args.size,
         nnodes=args.nodes, 
         ngpus_per_node=args.gpus_per_node,
         data_module=data,
-        eval_every=args.eval_every,
-        eval_batches=args.eval_batches,
+        eval_every=eval_every_n_batches,
+        eval_batches=eval_batches,
     )
 
     assert args.gbs % pretrain.trainer.strategy.pipeline_model_parallel_size == 0, "GBS should be divisible by PP"
@@ -365,7 +380,7 @@ def get_parser() -> argparse.ArgumentParser:
         constants.GLOBAL_BATCH_SIZE: args.gbs,
         constants.GRADIENT_ACCUMULATION_STEPS: grad_accumulation_steps,
         constants.MAX_SEQUENCE_LENGTH: 8192,
-        constants.EVAL_SAMPLES: "to be determined",
+        constants.EVAL_SAMPLES: args.eval_tokens,
 
         # Optimizers
         constants.OPT_NAME: pretrain.optim.config.optimizer,
@@ -396,7 +411,7 @@ def get_parser() -> argparse.ArgumentParser:
     # max steps
     pretrain.data.num_train_samples = pretrain.trainer.max_steps * pretrain.data.global_batch_size
     datamodule = pretrain.data.clone()
-    datamodule.num_dataset_builder_threads = 8
+    datamodule.num_dataset_builder_threads = 32
     build_data_index = run.Partial(
         build_pretraining_datamodule,
         datamodule=datamodule,
@@ -410,7 +425,7 @@ def get_parser() -> argparse.ArgumentParser:
     data_index_executor.nodes = 1
     data_index_executor.ntasks_per_node = 1
     data_index_executor.retries = 1
-    data_index_executor.time = "02:00:00"
+    data_index_executor.time = "01:00:00"
 
     static_read_from_path = args.initial_ckpt_path if args.use_ckpt else None
     static_write_to_path = args.continual_ckpt_path
@@ -441,19 +456,22 @@ def get_parser() -> argparse.ArgumentParser:
         experiment_max_steps = args.ckpt_start_step
 
         with run.Experiment(exp_name) as exp:
-            exp.add(build_data_index, executor=data_index_executor, name="build_data_index")
+            exp.add(build_data_index, executor=data_index_executor, name=f"build_data_index")
 
             for j in range(args.num_pars):
                 ending_steps = ""
                 starting_steps = f"{experiment_max_steps}"
                 if static_max_steps is not None:
                     ending_steps = f"-{experiment_max_steps + static_max_steps}-steps"
 
-                checkpoint_name = "checkpoint" + f"-par-{j}{ending_steps}"
+                checkpoint_name = "checkpoint" + f"-seed-{seed}-par-{j}{ending_steps}"
                 experiment_write_to_path = static_write_to_path + "/" + checkpoint_name
 
-                pretrain.resume.resume_from_directory = experiment_read_from_path
-                pretrain.resume.resume_from_path = experiment_read_from_path
+                if not args.resume_from_hf:
+                    pretrain.resume.resume_from_directory = experiment_read_from_path
+                    pretrain.resume.resume_from_path = experiment_read_from_path
+                else:
+                    pretrain.resume = run.Config(nl.AutoResume, restore_config = run.Config(nl.RestoreConfig, path=experiment_read_from_path))
                 pretrain.log.ckpt.train_time_interval = None
 
                 if args.save_ckpt: