Skip to content

Commit cdd928d

Browse files
Remove gs://mlperf-llm-public2/ dependency and make reproducibility instructions clear (#761)
1 parent d3bf70b commit cdd928d

File tree

1 file changed

+3
-6
lines changed

1 file changed

+3
-6
lines changed

large_language_model/megatron-lm/README.md

+3-6
Original file line numberDiff line numberDiff line change
@@ -193,9 +193,6 @@ rclone copy mlc-training:mlcommons-training-wg-public/gpt3/megatron-lm/checkpoin
193193
### Model conversion from Paxml checkpoints
194194
Alternatively to downloading the checkpoint in Megatron ready format, it can be obtained by converting a Paxml checkpoint.
195195

196-
Paxml Checkpoint is available at: `gs://mlperf-llm-public2/gpt3_spmd1x64x24_tpuv4-3072_v84_20221101/checkpoints/checkpoint_00004000`
197-
To resume training from the above checkpoint on Megatron, it should be converted into a format suitable for Megatron (this step only needs to be done once).
198-
199196
To convert Paxml checkpoint to the Megatron's format, a [script](scripts/convert_paxml_to_megatron_distributed.py) has been provided:
200197
```bash
201198
# Convert model and optimizer parameters to Megatron format (runs in ~40 minutes on DGXA100, requires 1TB of CPU memory):
@@ -206,7 +203,7 @@ python json_to_torch.py -i common_fp32.json -o $EXTERNAL_MODEL_CHECKPOINT_DIR/co
206203
This should result in the same checkpoint as described in the "Checkpoint download" section above.
207204

208205
### Dataset preprocessing
209-
Here are the instructions to prepare the preprocessed dataset from scratch.
206+
Here are the instructions to prepare the preprocessed dataset from scratch. Data preprocessing is already done and the final dataset can be accessed by following instructions in [S3 artifacts download](#s3-artifacts-download) section.
210207

211208
#### Data Download
212209
Training dataset -
@@ -220,7 +217,7 @@ git lfs pull --include "en/c4-train.009*.json.gz"
220217
git lfs pull --include "en/c4-train.01*.json.gz"
221218
```
222219

223-
Validation dataset needs to be downloaded from `gs://mlperf-llm-public2/c4/en_val_subset_json/c4-validation_24567exp.json` to ${C4_PATH}.
220+
Validation data subset can be downloaded from `gs://mlperf-llm-public2/c4/en_val_subset_json/c4-validation_24567exp.json` to ${C4_PATH}.
224221

225222
#### Data Preprocessing for Megatron-LM
226223

@@ -247,7 +244,7 @@ for shard in {6..7}; do
247244
done
248245
```
249246

250-
After preparing the data folder, download tokenizer model. The tokenizer model should be downloaded from `gs://mlperf-llm-public2/vocab/c4_en_301_5Mexp2_spm.model` and renamed as `${C4_PATH}/tokenizers/c4_spm/sentencepiece.model`. Make sure an output directory `${C4_PATH}/preprocessed_c4_spm` exists before the next step.
247+
After preparing the data folder, download tokenizer model. The tokenizer model `c4_en_301_5Mexp2_spm.model` can be downloaded by following instructions in [S3 artifacts download](#s3-artifacts-download) and renamed as `${C4_PATH}/tokenizers/c4_spm/sentencepiece.model`. Make sure an output directory `${C4_PATH}/preprocessed_c4_spm` exists before the next step.
251248

252249
Modify `C4_PATH` in `preprocess.sh` and `preprocess_val.sh` to specify
253250
the correct input/output paths and run preprocessing as follows

0 commit comments

Comments
 (0)