Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion large_language_model_pretraining/nemo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ The model largely follows the Llama 3.1 405B [paper](https://arxiv.org/abs/2407.

### Checkpoint download

MLCommons hosts the checkpoint for download **exclusively by MLCommons Members**. You must first agree to the [confidentiality notice](https://llama3-1.mlcommons.org) using your organizational email address, then you will receive a link to a download directions page with MLCommons R2 Downloader commands. _If you cannot access the form but you are part of a MLCommons Member organization, submit the [MLCommons subscription form](https://mlcommons.org/community/subscribe/) with your organizational email address and [associate a Google account](https://accounts.google.com/SignUpWithoutGmail) with your organizational email address. You should then be able to access the confidentiality form using that Google account._
MLCommons hosts the checkpoint for download **exclusively by MLCommons Members**. You must first agree to the [confidentiality notice](https://llama3-1.mlcommons.org) using your organizational email address, then you will receive a link to a download instructions page with [MLCommons R2 Downloader](https://github.com/mlcommons/r2_downloader) commands. _If you cannot access the form but you are part of a MLCommons Member organization, submit the [MLCommons subscription form](https://mlcommons.org/community/subscribe/) with your organizational email address and [associate a Google account](https://accounts.google.com/SignUpWithoutGmail) with your organizational email address. You should then be able to access the confidentiality form using that Google account._

#### Saving and restoring a checkpoint

Expand Down
15 changes: 3 additions & 12 deletions llama2_70b_lora/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,18 +41,9 @@ git clone https://github.com/mlperf/logging.git mlperf-logging
pip install -e mlperf-logging
```
## Download Data and Model
MLCommons hosts the model and preprocessed dataset for download **exclusively by MLCommons Members**. You must first agree to the [confidentiality notice](https://llama2.mlcommons.org) using your organizational email address, then you will receive a link to a directory containing Rclone download instructions. _If you cannot access the form but you are part of a MLCommons Member organization, submit the [MLCommons subscription form](https://mlcommons.org/community/subscribe/) with your organizational email address and [associate a Google account](https://accounts.google.com/SignUpWithoutGmail) with your organizational email address._ Once you have access to the Rclone download instructions, follow steps 1-3 to install and set up and authenticate Rclone. Finally, download the model to the desired download directory (default ./models):
```
mkdir models
cd models
rclone copy mlc-llama2:Llama2-70b-fused-qkv-mlperf ./Llama2-70b-fused-qkv-mlperf -P
```
Similarly download the data to the desired download directory (default ./dataset):
```
mkdir dataset
cd dataset
rclone copy mlc-llama2:training/scrolls_gov_report_8k ./scrolls_gov_report_8k -P
```
MLCommons hosts the model and preprocessed dataset for download **exclusively by MLCommons Members**. You must first agree to the [confidentiality notice](https://llama2.mlcommons.org) using your organizational email address, then you will receive a link to a download instructions page with [MLCommons R2 Downloader](https://github.com/mlcommons/r2_downloader) commands. _If you cannot access the form but you are part of a MLCommons Member organization, submit the [MLCommons subscription form](https://mlcommons.org/community/subscribe/) with your organizational email address and [associate a Google account](https://accounts.google.com/SignUpWithoutGmail) with your organizational email address._

Once you have access to the download instructions, download the Training model to the desired download directory (default ./models). Similarly, download the Training data to the desired download directory (default ./dataset):

## Llama2-70B on 8 devices

Expand Down
34 changes: 6 additions & 28 deletions retired_benchmarks/bert/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,23 +18,12 @@ The following files are available for download in a Cloudflare R2 bucket.
* License.txt
* vocab.txt: Contains WordPiece to id mapping

### Download from bucket
### Download with MLC R2 Downloader

You can access the bucket and download the files with Rclone
Navigate in the terminal to your desired download directory and run the following commands to download the input files. More information about the MLCommons R2 Downloader, including how to run it on Windows and in the dedicated container image, can be found [here](https://training.mlcommons-storage.org).

To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
To install Rclone on Linux/macOS/BSD systems, run:
```
sudo -v ; curl https://rclone.org/install.sh | sudo bash
```
Once Rclone is installed, run the following command to authenticate with the bucket:
```
rclone config create mlc-training s3 provider=Cloudflare access_key_id=76ea42eadb867e854061a1806220ee1e secret_access_key=a53625c4d45e3ca8ac0df8a353ea3a41ffc3292aa25259addd8b7dc5a6ce2936 endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
```
You can then navigate in the terminal to your desired download directory and run the following command to download the input files:

```
rclone copy mlc-training:mlcommons-training-wg-public/wikipedia_for_bert/input_files ./input_files -P
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/bert-input-files.uri
```

### Alternatively generate the checkpoints
Expand All @@ -57,23 +46,12 @@ The dataset was prepared using Python 3.7.6, nltk 3.4.5 and the [tensorflow/tens

Files after the download, uncompress, extract, clean up and dataset seperation steps are available for download in a Cloudflare R2 bucket. The main reason is that, WikiExtractor.py replaces some of the tags present in XML such as {CURRENTDAY}, {CURRENTMONTHNAMEGEN} with the current values obtained from time.strftime ([code](https://github.com/attardi/wikiextractor/blob/e4abb4cbd019b0257824ee47c23dd163919b731b/WikiExtractor.py#L632)). Hence, one might see slighly different preprocessed files after the WikiExtractor.py file is invoked. This means the md5sum hashes of these files will also be different each time WikiExtractor is called.

### Download from bucket
### Download with MLC R2 Downloader

You can access the bucket and download the files with Rclone

To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
To install Rclone on Linux/macOS/BSD systems, run:
```
sudo -v ; curl https://rclone.org/install.sh | sudo bash
```
Once Rclone is installed, run the following command to authenticate with the bucket:
```
rclone config create mlc-training s3 provider=Cloudflare access_key_id=76ea42eadb867e854061a1806220ee1e secret_access_key=a53625c4d45e3ca8ac0df8a353ea3a41ffc3292aa25259addd8b7dc5a6ce2936 endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
```
You can then navigate in the terminal to your desired download directory and run the following command to download the dataset:
Navigate in the terminal to your desired download directory and run the following commands to download the dataset. More information about the MLCommons R2 Downloader, including how to run it on Windows and in the dedicated container image, can be found [here](https://training.mlcommons-storage.org).

```
rclone copy mlc-training:mlcommons-training-wg-public/wikipedia_for_bert/processed_dataset ./processed_dataset -P
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/bert-preprocessed-wikipedia-dataset.uri
```

### Files in ./results directory:
Expand Down
19 changes: 5 additions & 14 deletions retired_benchmarks/gpt3/megatron-lm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,30 +164,21 @@ Evaluation on the validation subset that consists of 24567 examples.
# 6. Other

### S3 artifacts download
The dataset and the checkpoints are available to download from an S3 bucket. You can download this data from the bucket using Rclone as follows:
The dataset and the checkpoints are available to download from an S3-compatible bucket. You can download this data from the bucket using the MLCommons R2 Downloader. More information about the MLCommons R2 Downloader, including how to run it on Windows and in the dedicated container image, can be found [here](https://training.mlcommons-storage.org).

To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
To install Rclone on Linux/macOS/BSD systems, run:
```
sudo -v ; curl https://rclone.org/install.sh | sudo bash
```
Once Rclone is installed, run the following command to authenticate with the bucket:
```
rclone config create mlc-training s3 provider=Cloudflare access_key_id=76ea42eadb867e854061a1806220ee1e secret_access_key=a53625c4d45e3ca8ac0df8a353ea3a41ffc3292aa25259addd8b7dc5a6ce2936 endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
```
You can then navigate in the terminal to your desired download directory and run the following commands to download the dataset and checkpoints:
Navigate in the terminal to your desired download directory and run the following commands to download the dataset and checkpoints:

**`dataset_c4_spm.tar`**
```
rclone copy mlc-training:mlcommons-training-wg-public/gpt3/megatron-lm/dataset_c4_spm.tar ./ -P
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/gpt-3-megatron-preprocessed-dataset.uri
```
**`checkpoint_megatron_fp32.tar`**
```
rclone copy mlc-training:mlcommons-training-wg-public/gpt3/megatron-lm/checkpoint_megatron_fp32.tar ./ -P
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/gpt-3-megatron-fp32-checkpoint.uri
```
**`checkpoint_nemo_bf16`**
```
rclone copy mlc-training:mlcommons-training-wg-public/gpt3/megatron-lm/checkpoint_nemo_bf16.tar ./ -P
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/gpt-3-megatron-bf16-checkpoint.uri
```

### Model conversion from Paxml checkpoints
Expand Down
36 changes: 27 additions & 9 deletions retired_benchmarks/stable_diffusion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,10 +111,21 @@ The benchmark employs two datasets:
### Laion 400m
The benchmark uses a CC-BY licensed subset of the Laion400 dataset.

The LAION datasets comprise lists of URLs for original images, paired with the ALT text linked to those images. As downloading millions of images from the internet is not a deterministic process and to ensure the replicability of the benchmark results, submitters are asked to download the subset from the MLCommons storage. The dataset is provided in two formats:
The LAION datasets comprise lists of URLs for original images, paired with the ALT text linked to those images. As downloading millions of images from the internet is not a deterministic process and to ensure the replicability of the benchmark results, submitters are asked to download the subset from MLCommons storage using the MLCommons R2 Downloader. More information about the MLCommons R2 Downloader, including how to run it on Windows and in the dedicated container image, can be found [here](https://training.mlcommons-storage.org).

1. Preprocessed moments (recommended):`scripts/datasets/laion400m-filtered-download-moments.sh --output-dir /datasets/laion-400m/webdataset-moments-filtered`
2. Raw images: `scripts/datasets/laion400m-filtered-download-images.sh --output-dir /datasets/laion-400m/webdataset-filtered`
The dataset is provided in two formats, which can be downloaded with the following commands.

1. Preprocessed moments (recommended):

```bash
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d datasets/laion-400m/webdataset-moments-filtered https://training.mlcommons-storage.org/metadata/stable-diffusion-laion-400m-filtered-moments-dataset.uri
```

2. Raw images:

```bash
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d datasets/laion-400m/webdataset-filtered https://training.mlcommons-storage.org/metadata/table-diffusion-laion-400m-filtered-images-dataset.uri
```

While the benchmark code is compatible with both formats, we recommend using the preprocessed moments to save on computational resources.

Expand All @@ -123,11 +134,16 @@ For additional information about Laion 400m, the CC-BY subset, and the scripts u
### COCO-2014
The COCO-2014-validation dataset consists of 40,504 images and 202,654 annotations. However, our benchmark uses only a subset of 30,000 images and annotations chosen at random with a preset seed. It's not necessary to download the entire COCO dataset as our focus is primarily on the labels (prompts) and the inception activation for the corresponding images (used for the FID score).

To ensure reproducibility, we ask the submitters to download the relevant files from the MLCommons storage:
To ensure reproducibility, we ask the submitters to download the relevant files from MLCommons storage using the MLCommons R2 Downloader. More information about the MLCommons R2 Downloader, including how to run it on Windows and in the dedicated container image, can be found [here](https://training.mlcommons-storage.org).

The files can be downloaded with the following commands.

```bash
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d datasets/coco2014 https://training.mlcommons-storage.org/metadata/stable-diffusion-coco2014-validation-prompts-dataset.uri
```

```bash
scripts/datasets/coco2014-validation-download-prompts.sh --output-dir /datasets/coco2014
scripts/datasets/coco2014-validation-download-stats.sh --output-dir /datasets/coco2014
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d datasets/coco2014 https://training.mlcommons-storage.org/metadata/stable-diffusion-coco2014-validation-stats-dataset.uri
```

While the benchmark code can work with raw images, we recommend using the preprocessed inception weights to save on computational resources.
Expand All @@ -138,19 +154,21 @@ For additional information about the validation process and the used metrics, re

## Downloading the checkpoints

You can download the checkpoints with the MLCommons R2 Downloader. More information about the MLCommons R2 Downloader, including how to run it on Windows and in the dedicated container image, can be found [here](https://training.mlcommons-storage.org).

The benchmark utilizes several network architectures for both the training and validation processes:

1. **Stable Diffusion**: This component leverages StabilityAI's 512-base-ema.ckpt checkpoint from HuggingFace. While the checkpoint includes weights for the UNet, VAE, and OpenCLIP text embedder, the UNet weights are not used and are discarded when loading the weights. The checkpoint can be downloaded with the following command:
```bash
scripts/checkpoints/download_sd.sh --output-dir /checkpoints/sd
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d checkpoints/sd https://training.mlcommons-storage.org/metadata/stable-diffusion-sd-checkpoint.uri
```
2. **Inception**: The Inception network is employed during validation to compute the Fréchet Inception Distance (FID) score. The necessary weights can be downloaded with the following command:
```bash
scripts/checkpoints/download_inception.sh --output-dir /checkpoints/inception
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d checkpoints/inception https://training.mlcommons-storage.org/metadata/stable-diffusion-inception-checkpoint.uri
```
3. **OpenCLIP ViT-H-14 Model**: This model is utilized for the computation of the CLIP score. The required weights can be downloaded using the command:
```bash
scripts/checkpoints/download_clip.sh --output-dir /checkpoints/clip
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d checkpoints/clip https://training.mlcommons-storage.org/metadata/stable-diffusion-clip-checkpoint.uri
```

The aforementioned scripts will handle both the download and integrity verification of the checkpoints.
Expand Down