Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 55 additions & 21 deletions large_language_model_pretraining/nemo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ To use this repository, please install a supported version of PyTorch with GPU s

We recommend using the latest NeMo FW container. The latest tested compatible version is `nvcr.io/nvidia/nemo:24.12-rc0`).

#### Container Setup
#### Container setup

All of the following codes are assumed to be run within a container. A [Dockerfile](./Dockerfile) is available for building containers on top of `nvcr.io/nvidia/nemo:24.12-rc0`.

Expand All @@ -33,7 +33,10 @@ Note: it's recommended to map your `.ssh` folder to inside the container, so tha

### Steps to download and verify data

The current codebase is still using GPT3's train/val datasets and SentencePieceModel tokenizer. Please refer to [GPT3 instructions](https://github.com/mlcommons/training/tree/master/large_language_model/megatron-lm#preprocessed-data-download) to download **the raw C4 dataset** that we can preprocess later.
The current codebase is using C4 dataset for train and evaluation. Please refer to [Section 3](#preprocessed-data-download) for downloading the preprocessed dataset and [Section 6](#data-preprocessing) if you would like to perform manual tokenization.


### Steps to download the checkpoint

### Steps to run and time

Expand Down Expand Up @@ -100,20 +103,22 @@ After the download is complete, you should see five files under `TOKENIZER_PATH`

### Training and test data separation

To be determined. For now, we are using the default split from the C4 dataset.
We use the default split from the C4 dataset. This means that we use `c4-train.<x>-of-01024.json.gz` files for training and `c4-validation.<x>-of-00008.json.gz` files for evaluation.

### Training data order

To be determined. Current plan is to use the last 256 of 1024 files (shards 6 and 7) for the benchmarked area.
We randomly shuffle the **last 256 of 1024 shards** for the benchmarking area.

### Test data order

To be determined.
We use the first 47M tokens in the validation dataset for validation. We **do not shuffle** the validation dataset.

# 4. Model
### Publication/Attribution

The model largely follows the Llama 3.1 405B [paper](https://arxiv.org/abs/2407.21783). The main difference is that the model parameters is *to be determined from experiments*.
The model largely follows the Llama 3.1 405B [paper](https://arxiv.org/abs/2407.21783). Two noticeable differences are:
1. We replace the paper's TikTokenizer with the **Mixtral 8x22b tokenizer** in this benchmark. Please refer to the [Tokenizer](#tokenizer) section for more details.
1. We replace the paper's AdamW with the **Adam optimizer** in this benchmark. Please refer to the [Optimizer](#optimizer-spec) section for more details.

### Model details

Expand All @@ -128,24 +133,24 @@ The model largely follows the Llama 3.1 405B [paper](https://arxiv.org/abs/2407.
| Hidden Dimension | 53248 |
| Activation | SwiGLU |
| Normalization | RMSNorm |
| Tokenizer | TikTokenizer |
| Vocab size | 128,000 |
| Tokenizer | Mixtral 8x22B tokenizer |
| Vocab size | 32,000 |
| Context Length | 8192 |


### Checkpoint download and conversion
### Checkpoint download

To be determined. For now, we are not using Llama 3.1 default checkpoint.

~~To experiment with a given checkpoint, we have added a `--ckpt` argument that loads the pretrained checkpoint from a **NeMo checkpoint path**, which requires some checkpoint format conversion if the original checkpoint is in LlamaStack or HuggingFace format.~~
MLCommons hosts the checkpoint for download **exclusively by MLCommons Members**. You must first agree to the [confidentiality notice](https://llama3-1.mlcommons.org) using your organizational email address, then you will receive a link to a directory containing Rclone download instructions. _If you cannot access the form but you are part of a MLCommons Member organization, submit the [MLCommons subscription form](https://mlcommons.org/community/subscribe/) with your organizational email address and [associate a Google account](https://accounts.google.com/SignUpWithoutGmail) with your organizational email address. You should then be able to access the confidentiality form using that Google account._

#### Saving and restoring a checkpoint

Large runs might need to span across multiple Slurm jobs, and we need to save and load checkpoints with contexts so that training can resume between jobs. To support this, we have added some environment variables. Please refer to `config.sh` for more details.

### Optimizer
### Optimizer spec

Adam
1. Optimizer type: **Adam**
2. Warmup steps computed as $8000 \times \lceil {1152 \over GBS} \rceil$.
3. LR Scheduler's maximum number of steps computed as $1,200,000 \times \lceil {1152 \over GBS} \rceil$

# 5. Quality
### Quality metric
Expand All @@ -154,28 +159,28 @@ Log Perplexity

### Quality target

To be determined.
Validation log perplexity = 5.6

### Evaluation frequency

To be determined.
We perform evaluation every **377,487,360** tokens.

### Evaluation thoroughness

To be determined.
We evaluate using **47,185,920** tokens from the validation dataset.


# 6. Other

### Data Preprocessing
### Data preprocessing

Here are the instructions to prepare the preprocessed dataset from scratch. Data preprocessing is already done and the final dataset can be accessed by following instructions in the [Preprocessed data download]() section.
Here are the instructions to prepare the preprocessed dataset from scratch. Data preprocessing is already done and the final dataset can be accessed by following instructions in the [Preprocessed data download](#preprocessed-data-download) section.

#### Tokenizer
#### Prepare tokenizer

We use Mixtral 8x22B tokenizer in this benchmark. Tokenizer files can be downloaded [here](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1/tree/main). Only the five files containing tokenizer-related contents (`special_tokens_map.json`, `tokenizer.json`, `tokenizer.model`, `tokenizer.model.v1`, `tokenizer_config.json`) are needed.

#### Run Data preprocessing
#### Run data preprocessing

Run the following commands to merge all 1024 training files into 8 `json.gz` files and all 8 validation files into a single `json.gz` file. Each of the `json.gz` files will be preprocessed into a pair of megatron dataset files (`.bin` and `.idx`).

Expand All @@ -199,5 +204,34 @@ export MERGED_C4_PATH=""
# this path is used for storing the preprocessed .bin and .idx files
export PREPROCESSED_PATH=""

# Extra Slurm-related arguments can be provided here
sbatch preprocess.sh
```

### HuggingFace Checkpoint Preprocessing

Here are the instructions to prepare the NeMo-formatted checkpoint from scratch. Checkpoint conversion is already done and the converted checkpoint can be accessed by following instructions in the [Checkpoint download](#checkpoint-download) section.

#### HuggingFace checkpoint downloading

We use the HuggingFace Llama 3.1 405B checkpoint as the initial checkpoint in this benchmark. Original HuggingFace checkpoint can be downloaded [here](https://huggingface.co/meta-llama/Llama-3.1-405B). **Notice that we are downloading the BF16 not the FP8 version of the model**.

#### Run model conversion

Assuming that we have downloaded the HuggingFace checkpoint to a `<SRC_PATH>` directory, we can run [this script](./utils/launch_nemo_convert.sh) (which calls [this python script](./utils/nemo_convert.py)) to perform checkpoint format conversion. After such conversion is done, you should be able to find the converted checkpoint under `<DST_PATH>` directory, and there should be two subfolders inside this directory - `context` and `weights`.

```bash
# fill in the built container path here
export CONT_IMAGE_URL=""
# fill in the folder that holds the HF checkpoint here
# under this folder, you should see a lot of safetensors
export SRC_PATH=""
# fill in the destination folder of your choice here
# after conversion is done, you can find context and weights under this path
export DST_PATH=""

# Extra Slurm-related arguments can be provided here
sbatch launch_nemo_convert.sh
```

After the model conversion is done, we can then set `MODEL_CKPT=$DST_PATH` together with `FROM_HF=1` when launching our job, so that we can resume training from the converted HF checkpoint.
3 changes: 2 additions & 1 deletion large_language_model_pretraining/nemo/callbacks.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ def version(self):

### MLPerf callbacks
def compute_consumed_mllog_tokens(trainer, init_global_step, global_batch_size, seq_length):
steps_since_resume = trainer.global_step - init_global_step
steps_since_resume = trainer.global_step + 1 - init_global_step # global steps are 0-indexed
consumed_samples = (
steps_since_resume * global_batch_size
)
Expand Down Expand Up @@ -205,6 +205,7 @@ def on_validation_end(self, trainer, pl_module):
if isinstance(logger, MetricsLogger):
if logger.is_target_reached:
trainer.should_stop = True
self.set_success_status()

if not trainer.should_stop:
mllogger.start(key=constants.BLOCK_START, metadata={"epoch_num": self.consumed_tokens(trainer)})
Expand Down
13 changes: 4 additions & 9 deletions large_language_model_pretraining/nemo/config.sh
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,9 @@ export MODEL_CKPT=""
export CONTINUAL_CKPT=""
# Model: Whether we want to restore from MODEL_CKPT path. If 0, then we are not restoring.
export USE_CKPT=0
# Model: Whether we are resuming from a NeMo-formatted HuggingFace checkpoint (weights only).
# If set to 1, then checkpoint resuming code will not try to load the optimizer states.
export FROM_HF=1
# Model: Whether we want to save a checkpoint. Must be 1 if NPAR > 1. If 1, then we save a checkpoint at the end.
export SAVE_CKPT=0

Expand All @@ -71,18 +74,10 @@ export SIZE="405b"
export GBS=1152
# Dataloader: Micro batch size
export MBS=1
# Dataloader: Evaluate every N batches, optional
# defaults to evaluate every 20 batches, or 188_743_680 tokens
export EVAL_EVERY="20"
# Dataloader: Evaluate using N batches, optional
# defaults to use 10 batches for evaluation, or 94_371_840 tokens
# If an empty string is provided (""), then we use full validation dataset for evaluation
export EVAL_BATCHES="10"
# Dataloader: Max run N batches, optional
# defaults to train 425 steps, or 4_010_803_200 tokens
# If an empty string is provided (""), then the training will continue until time limit
# If we want to save a checkpoint, then this value must be set
export MAX_STEPS="425"
export MAX_STEPS=""

# Experiment: starting steps
# This is the starting "offset" step from the checkpoint.
Expand Down
68 changes: 43 additions & 25 deletions large_language_model_pretraining/nemo/pretrain_llama31.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@

import argparse
from typing import Optional
import math
from nemo.collections import llm
from nemo.collections.common.tokenizers import AutoTokenizer
from nemo import lightning as nl
Expand Down Expand Up @@ -70,14 +71,16 @@ def slurm_executor(
),
nodes=nodes,
ntasks_per_node=devices,
gpus_per_node=devices,
mem="0",
exclusive=True,
gres="gpu:8",
packager=run.GitArchivePackager(),
dependencies=dependencies,
)

if devices != 0:
executor.gpus_per_node=devices
executor.gres = "gpu:8"

executor.launcher = None
executor.container_image = container_image
executor.container_mounts = mounts
Expand All @@ -97,6 +100,8 @@ def get_pretrain(
) -> run.Partial:

exp_name = size
base_gbs = 1152
gbs = data_module.global_batch_size

# Providing 8B and 70B here for debugging purpose
# Actual benchmark should use 405B
Expand Down Expand Up @@ -142,10 +147,15 @@ def get_pretrain(

pretrain.trainer.strategy.virtual_pipeline_model_parallel_size = 7

base_lr = 8e-5
warmup_tokens = 8000 * base_gbs * 8192

max_lr = (gbs / base_gbs) * base_lr

pretrain.optim = distributed_fused_adam_with_cosine_annealing(
max_lr=8e-5,
warmup_steps=8000,
min_lr=8e-7
max_lr = max_lr,
warmup_steps = math.ceil(warmup_tokens / 8192 / gbs),
min_lr = 8e-7
)

from nemo.collections.llm.recipes.tp_overlap_configs.userbuffers import (
Expand All @@ -166,7 +176,8 @@ def get_pretrain(
)

# sets up everything else
pretrain.trainer.max_steps = 1_200_000 # Llama 3.1 paper section 3.4.1 - decays LR to 8e10-7 over 1,200,000 steps
max_tokens = 1_200_000 * 8192 * base_gbs # Llama 3.1 paper section 3.4.1 - decays LR to 8e10-7 over 1,200,000 steps
pretrain.trainer.max_steps = math.ceil(max_tokens / 8192 / gbs)

pretrain.data = data_module
pretrain.trainer.val_check_interval = eval_every
Expand Down Expand Up @@ -279,27 +290,28 @@ def get_parser() -> argparse.ArgumentParser:
])

model_group.add_argument("--initial_ckpt_path", type=str, default=None)
model_group.add_argument("--use_ckpt", action="store_true")
model_group.add_argument("--ckpt_start_step", type=int, default=0)
model_group.add_argument("--continual_ckpt_path", type=str, default=None)
model_group.add_argument("--save_ckpt", action="store_true")
model_group.add_argument("--use_ckpt", action="store_true", help="If set, then resume from the initial checkpoint path")
model_group.add_argument("--resume_from_hf", action="store_true", help="Setting this knob indicates that we are resuming from a weight-only checkpoint")
model_group.add_argument("--ckpt_start_step", type=int, default=0, help="Sets this value to how many steps the resumed checkpoint is already trained on")
model_group.add_argument("--continual_ckpt_path", type=str, default=None, help="Sets this to the path that saves the checkpoint")
model_group.add_argument("--save_ckpt", action="store_true", help="If set, then we save the checkpoint at the end of the experiment")

data_group = parser.add_argument_group("Dataset arguments")

data_group.add_argument("--gbs", type=int, default=288, help="Global batch size, should be divisible by PP")
data_group.add_argument("--gbs", type=int, default=1152, help="Global batch size, should be divisible by PP")
data_group.add_argument("--mbs", type=int, default=1, help="Micro batch size")
data_group.add_argument("--eval_every", type=int, default=20)
data_group.add_argument("--eval_batches", type=int, default=None)
data_group.add_argument('--max_steps', type=int, default=None)
data_group.add_argument("--use_full_dataset", action="store_true", help="Whether we use the full dataset or use the last 256/1024 dataset")
data_group.add_argument("--eval_every", type=int, default=377_487_360, help="Evaluate at least every N training tokens")
data_group.add_argument("--eval_tokens", type=int, default=47_185_920, help="Evaluate using at least N evaluation tokens")
data_group.add_argument('--max_steps', type=int, default=None, help="Maximum number of steps that each experiment partition will train on. None means no restriction on max steps. ")
data_group.add_argument("--use_full_dataset", action="store_true", help="If set, then we use the full dataset, instead of the last 256/1024 shards")
data_group.add_argument("--tokenizer_path", type=str, help="Tokenizer path that's used to tokenize the dataset")

experiment_group = parser.add_argument_group("Experiment management arguments")
experiment_group.add_argument("--dryrun", action="store_true", help="Whether we are launching dryrun or actual runs")
experiment_group.add_argument("--seeds", type=int, nargs="*", default=[], help="random seeds")
experiment_group.add_argument("--num_exps", type=int, default=1)
experiment_group.add_argument("--num_pars", type=int, default=1)
experiment_group.add_argument("--target_log_ppl", type=float, default=1)
experiment_group.add_argument("--target_log_ppl", type=float, default=5.6)

return parser

Expand Down Expand Up @@ -339,13 +351,16 @@ def get_parser() -> argparse.ArgumentParser:
use_full_dataset=args.use_full_dataset,
)

eval_every_n_batches = math.ceil(args.eval_every / (args.gbs * 8192))
eval_batches = math.ceil(args.eval_tokens / (args.gbs * 8192))

exp_prefix, pretrain = get_pretrain(
size=args.size,
nnodes=args.nodes,
ngpus_per_node=args.gpus_per_node,
data_module=data,
eval_every=args.eval_every,
eval_batches=args.eval_batches,
eval_every=eval_every_n_batches,
eval_batches=eval_batches,
)

assert args.gbs % pretrain.trainer.strategy.pipeline_model_parallel_size == 0, "GBS should be divisible by PP"
Expand All @@ -365,7 +380,7 @@ def get_parser() -> argparse.ArgumentParser:
constants.GLOBAL_BATCH_SIZE: args.gbs,
constants.GRADIENT_ACCUMULATION_STEPS: grad_accumulation_steps,
constants.MAX_SEQUENCE_LENGTH: 8192,
constants.EVAL_SAMPLES: "to be determined",
constants.EVAL_SAMPLES: args.eval_tokens,

# Optimizers
constants.OPT_NAME: pretrain.optim.config.optimizer,
Expand Down Expand Up @@ -396,7 +411,7 @@ def get_parser() -> argparse.ArgumentParser:
# max steps
pretrain.data.num_train_samples = pretrain.trainer.max_steps * pretrain.data.global_batch_size
datamodule = pretrain.data.clone()
datamodule.num_dataset_builder_threads = 8
datamodule.num_dataset_builder_threads = 32
build_data_index = run.Partial(
build_pretraining_datamodule,
datamodule=datamodule,
Expand All @@ -410,7 +425,7 @@ def get_parser() -> argparse.ArgumentParser:
data_index_executor.nodes = 1
data_index_executor.ntasks_per_node = 1
data_index_executor.retries = 1
data_index_executor.time = "02:00:00"
data_index_executor.time = "01:00:00"

static_read_from_path = args.initial_ckpt_path if args.use_ckpt else None
static_write_to_path = args.continual_ckpt_path
Expand Down Expand Up @@ -441,19 +456,22 @@ def get_parser() -> argparse.ArgumentParser:
experiment_max_steps = args.ckpt_start_step

with run.Experiment(exp_name) as exp:
exp.add(build_data_index, executor=data_index_executor, name="build_data_index")
exp.add(build_data_index, executor=data_index_executor, name=f"build_data_index")

for j in range(args.num_pars):
ending_steps = ""
starting_steps = f"{experiment_max_steps}"
if static_max_steps is not None:
ending_steps = f"-{experiment_max_steps + static_max_steps}-steps"

checkpoint_name = "checkpoint" + f"-par-{j}{ending_steps}"
checkpoint_name = "checkpoint" + f"-seed-{seed}-par-{j}{ending_steps}"
experiment_write_to_path = static_write_to_path + "/" + checkpoint_name

pretrain.resume.resume_from_directory = experiment_read_from_path
pretrain.resume.resume_from_path = experiment_read_from_path
if not args.resume_from_hf:
pretrain.resume.resume_from_directory = experiment_read_from_path
pretrain.resume.resume_from_path = experiment_read_from_path
else:
pretrain.resume = run.Config(nl.AutoResume, restore_config = run.Config(nl.RestoreConfig, path=experiment_read_from_path))
pretrain.log.ckpt.train_time_interval = None

if args.save_ckpt:
Expand Down
Loading