You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Alternatively to downloading the checkpoint in Megatron ready format, it can be obtained by converting a Paxml checkpoint.
195
195
196
-
Paxml Checkpoint is available at: `gs://mlperf-llm-public2/gpt3_spmd1x64x24_tpuv4-3072_v84_20221101/checkpoints/checkpoint_00004000`
197
-
To resume training from the above checkpoint on Megatron, it should be converted into a format suitable for Megatron (this step only needs to be done once).
198
-
199
196
To convert Paxml checkpoint to the Megatron's format, a [script](scripts/convert_paxml_to_megatron_distributed.py) has been provided:
200
197
```bash
201
198
# Convert model and optimizer parameters to Megatron format (runs in ~40 minutes on DGXA100, requires 1TB of CPU memory):
This should result in the same checkpoint as described in the "Checkpoint download" section above.
207
204
208
205
### Dataset preprocessing
209
-
Here are the instructions to prepare the preprocessed dataset from scratch.
206
+
Here are the instructions to prepare the preprocessed dataset from scratch. Data preprocessing is already done and the final dataset can be accessed by following instructions in [S3 artifacts download](#s3-artifacts-download) section.
Validation dataset needs to be downloaded from `gs://mlperf-llm-public2/c4/en_val_subset_json/c4-validation_24567exp.json` to ${C4_PATH}.
220
+
Validation data subset can be downloaded from `gs://mlperf-llm-public2/c4/en_val_subset_json/c4-validation_24567exp.json` to ${C4_PATH}.
224
221
225
222
#### Data Preprocessing for Megatron-LM
226
223
@@ -247,7 +244,7 @@ for shard in {6..7}; do
247
244
done
248
245
```
249
246
250
-
After preparing the data folder, download tokenizer model. The tokenizer model should be downloaded from `gs://mlperf-llm-public2/vocab/c4_en_301_5Mexp2_spm.model` and renamed as `${C4_PATH}/tokenizers/c4_spm/sentencepiece.model`. Make sure an output directory `${C4_PATH}/preprocessed_c4_spm` exists before the next step.
247
+
After preparing the data folder, download tokenizer model. The tokenizer model `c4_en_301_5Mexp2_spm.model` can be downloaded by following instructions in [S3 artifacts download](#s3-artifacts-download) and renamed as `${C4_PATH}/tokenizers/c4_spm/sentencepiece.model`. Make sure an output directory `${C4_PATH}/preprocessed_c4_spm` exists before the next step.
251
248
252
249
Modify `C4_PATH` in `preprocess.sh` and `preprocess_val.sh` to specify
253
250
the correct input/output paths and run preprocessing as follows
0 commit comments