mlcommons · ShriyaRishab · Nov 5, 2025 · Nov 3, 2025 · Nov 3, 2025 · Nov 3, 2025
@@ -126,7 +126,7 @@ The model largely follows the Llama 3.1 405B [paper](https://arxiv.org/abs/2407.
 
 ### Checkpoint download
 
-MLCommons hosts the checkpoint for download **exclusively by MLCommons Members**. You must first agree to the [confidentiality notice](https://llama3-1.mlcommons.org) using your organizational email address, then you will receive a link to a download directions page with MLCommons R2 Downloader commands. _If you cannot access the form but you are part of a MLCommons Member organization, submit the [MLCommons subscription form](https://mlcommons.org/community/subscribe/) with your organizational email address and [associate a Google account](https://accounts.google.com/SignUpWithoutGmail) with your organizational email address. You should then be able to access the confidentiality form using that Google account._
+MLCommons hosts the checkpoint for download **exclusively by MLCommons Members**. You must first agree to the [confidentiality notice](https://llama3-1.mlcommons.org) using your organizational email address, then you will receive a link to a download instructions page with [MLCommons R2 Downloader](https://github.com/mlcommons/r2_downloader) commands. _If you cannot access the form but you are part of a MLCommons Member organization, submit the [MLCommons subscription form](https://mlcommons.org/community/subscribe/) with your organizational email address and [associate a Google account](https://accounts.google.com/SignUpWithoutGmail) with your organizational email address. You should then be able to access the confidentiality form using that Google account._
 
 #### Saving and restoring a checkpoint
 

@@ -41,18 +41,9 @@ git clone https://github.com/mlperf/logging.git mlperf-logging
 pip install -e mlperf-logging
 ```
 ## Download Data and Model
-MLCommons hosts the model and preprocessed dataset for download **exclusively by MLCommons Members**. You must first agree to the [confidentiality notice](https://llama2.mlcommons.org) using your organizational email address, then you will receive a link to a directory containing Rclone download instructions. _If you cannot access the form but you are part of a MLCommons Member organization, submit the [MLCommons subscription form](https://mlcommons.org/community/subscribe/) with your organizational email address and [associate a Google account](https://accounts.google.com/SignUpWithoutGmail) with your organizational email address._ Once you have access to the Rclone download instructions, follow steps 1-3 to install and set up and authenticate Rclone. Finally, download the model to the desired download directory (default ./models):
-```
-mkdir models
-cd models
-rclone copy mlc-llama2:Llama2-70b-fused-qkv-mlperf ./Llama2-70b-fused-qkv-mlperf -P
-```
-Similarly download the data to the desired download directory (default ./dataset):
-```
-mkdir dataset
-cd dataset
-rclone copy mlc-llama2:training/scrolls_gov_report_8k ./scrolls_gov_report_8k -P
-```
+MLCommons hosts the model and preprocessed dataset for download **exclusively by MLCommons Members**. You must first agree to the [confidentiality notice](https://llama2.mlcommons.org) using your organizational email address, then you will receive a link to a download instructions page with [MLCommons R2 Downloader](https://github.com/mlcommons/r2_downloader) commands. _If you cannot access the form but you are part of a MLCommons Member organization, submit the [MLCommons subscription form](https://mlcommons.org/community/subscribe/) with your organizational email address and [associate a Google account](https://accounts.google.com/SignUpWithoutGmail) with your organizational email address._ 
+
+Once you have access to the download instructions, download the Training model to the desired download directory (default ./models). Similarly, download the Training data to the desired download directory (default ./dataset):
 
 ## Llama2-70B on 8 devices
 

@@ -18,23 +18,12 @@ The following files are available for download in a Cloudflare R2 bucket.
 * License.txt
 * vocab.txt: Contains WordPiece to id mapping
 
-### Download from bucket
+### Download with MLC R2 Downloader
 
-You can access the bucket and download the files with Rclone
+Navigate in the terminal to your desired download directory and run the following commands to download the input files. More information about the MLCommons R2 Downloader, including how to run it on Windows and in the dedicated container image, can be found [here](https://training.mlcommons-storage.org).
 
-To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
-To install Rclone on Linux/macOS/BSD systems, run:
 ```
-sudo -v ; curl https://rclone.org/install.sh | sudo bash
-```
-Once Rclone is installed, run the following command to authenticate with the bucket:
-```
-rclone config create mlc-training s3 provider=Cloudflare access_key_id=76ea42eadb867e854061a1806220ee1e secret_access_key=a53625c4d45e3ca8ac0df8a353ea3a41ffc3292aa25259addd8b7dc5a6ce2936 endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
-```
-You can then navigate in the terminal to your desired download directory and run the following command to download the input files:
-
-```
-rclone copy mlc-training:mlcommons-training-wg-public/wikipedia_for_bert/input_files ./input_files -P
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/bert-input-files.uri
 ```
 
 ### Alternatively generate the checkpoints
@@ -57,23 +46,12 @@ The dataset was prepared using Python 3.7.6, nltk 3.4.5 and the [tensorflow/tens
 
 Files after the download, uncompress, extract, clean up and dataset seperation steps are available for download in a Cloudflare R2 bucket. The main reason is that, WikiExtractor.py replaces some of the tags present in XML such as {CURRENTDAY}, {CURRENTMONTHNAMEGEN} with the current values obtained from time.strftime ([code](https://github.com/attardi/wikiextractor/blob/e4abb4cbd019b0257824ee47c23dd163919b731b/WikiExtractor.py#L632)). Hence, one might see slighly different preprocessed files after the WikiExtractor.py file is invoked. This means the md5sum hashes of these files will also be different each time WikiExtractor is called.
 
-### Download from bucket
+### Download with MLC R2 Downloader
 
-You can access the bucket and download the files with Rclone
-
-To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
-To install Rclone on Linux/macOS/BSD systems, run:
-```
-sudo -v ; curl https://rclone.org/install.sh | sudo bash
-```
-Once Rclone is installed, run the following command to authenticate with the bucket:
-```
-rclone config create mlc-training s3 provider=Cloudflare access_key_id=76ea42eadb867e854061a1806220ee1e secret_access_key=a53625c4d45e3ca8ac0df8a353ea3a41ffc3292aa25259addd8b7dc5a6ce2936 endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
-```
-You can then navigate in the terminal to your desired download directory and run the following command to download the dataset:
+Navigate in the terminal to your desired download directory and run the following commands to download the dataset. More information about the MLCommons R2 Downloader, including how to run it on Windows and in the dedicated container image, can be found [here](https://training.mlcommons-storage.org).
 
 ```
-rclone copy mlc-training:mlcommons-training-wg-public/wikipedia_for_bert/processed_dataset ./processed_dataset -P
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/bert-preprocessed-wikipedia-dataset.uri
 ```
 
 ### Files in ./results directory:

@@ -164,30 +164,21 @@ Evaluation on the validation subset that consists of 24567 examples.
 # 6. Other
 
 ### S3 artifacts download
-The dataset and the checkpoints are available to download from an S3 bucket. You can download this data from the bucket using Rclone as follows:
+The dataset and the checkpoints are available to download from an S3-compatible bucket. You can download this data from the bucket using the MLCommons R2 Downloader. More information about the MLCommons R2 Downloader, including how to run it on Windows and in the dedicated container image, can be found [here](https://training.mlcommons-storage.org).
 
-To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
-To install Rclone on Linux/macOS/BSD systems, run:
-```
-sudo -v ; curl https://rclone.org/install.sh | sudo bash
-```
-Once Rclone is installed, run the following command to authenticate with the bucket:
-```
-rclone config create mlc-training s3 provider=Cloudflare access_key_id=76ea42eadb867e854061a1806220ee1e secret_access_key=a53625c4d45e3ca8ac0df8a353ea3a41ffc3292aa25259addd8b7dc5a6ce2936 endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
-```
-You can then navigate in the terminal to your desired download directory and run the following commands to download the dataset and checkpoints:
+Navigate in the terminal to your desired download directory and run the following commands to download the dataset and checkpoints:
 
 **`dataset_c4_spm.tar`**
 ```
-rclone copy mlc-training:mlcommons-training-wg-public/gpt3/megatron-lm/dataset_c4_spm.tar ./ -P
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/gpt-3-megatron-preprocessed-dataset.uri
 ```
 **`checkpoint_megatron_fp32.tar`**
 ```
-rclone copy mlc-training:mlcommons-training-wg-public/gpt3/megatron-lm/checkpoint_megatron_fp32.tar ./ -P
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/gpt-3-megatron-fp32-checkpoint.uri
 ```
 **`checkpoint_nemo_bf16`**
 ```
-rclone copy mlc-training:mlcommons-training-wg-public/gpt3/megatron-lm/checkpoint_nemo_bf16.tar ./ -P
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/gpt-3-megatron-bf16-checkpoint.uri
 ```
 
 ### Model conversion from Paxml checkpoints

@@ -111,10 +111,21 @@ The benchmark employs two datasets:
 ### Laion 400m
 The benchmark uses a CC-BY licensed subset of the Laion400 dataset.
 
-The LAION datasets comprise lists of URLs for original images, paired with the ALT text linked to those images. As downloading millions of images from the internet is not a deterministic process and to ensure the replicability of the benchmark results, submitters are asked to download the subset from the MLCommons storage. The dataset is provided in two formats:
+The LAION datasets comprise lists of URLs for original images, paired with the ALT text linked to those images. As downloading millions of images from the internet is not a deterministic process and to ensure the replicability of the benchmark results, submitters are asked to download the subset from MLCommons storage using the MLCommons R2 Downloader. More information about the MLCommons R2 Downloader, including how to run it on Windows and in the dedicated container image, can be found [here](https://training.mlcommons-storage.org). 
 
-1. Preprocessed moments (recommended):`scripts/datasets/laion400m-filtered-download-moments.sh --output-dir /datasets/laion-400m/webdataset-moments-filtered`
-2. Raw images: `scripts/datasets/laion400m-filtered-download-images.sh --output-dir /datasets/laion-400m/webdataset-filtered`
+The dataset is provided in two formats, which can be downloaded with the following commands.
+
+1. Preprocessed moments (recommended):
+
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d datasets/laion-400m/webdataset-moments-filtered https://training.mlcommons-storage.org/metadata/stable-diffusion-laion-400m-filtered-moments-dataset.uri
+```
+
+2. Raw images:
+
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d datasets/laion-400m/webdataset-filtered https://training.mlcommons-storage.org/metadata/table-diffusion-laion-400m-filtered-images-dataset.uri
+```
 
 While the benchmark code is compatible with both formats, we recommend using the preprocessed moments to save on computational resources.
 
@@ -123,11 +134,16 @@ For additional information about Laion 400m, the CC-BY subset, and the scripts u
 ### COCO-2014
 The COCO-2014-validation dataset consists of 40,504 images and 202,654 annotations. However, our benchmark uses only a subset of 30,000 images and annotations chosen at random with a preset seed. It's not necessary to download the entire COCO dataset as our focus is primarily on the labels (prompts) and the inception activation for the corresponding images (used for the FID score).
 
-To ensure reproducibility, we ask the submitters to download the relevant files from the MLCommons storage:
+To ensure reproducibility, we ask the submitters to download the relevant files from MLCommons storage using the MLCommons R2 Downloader. More information about the MLCommons R2 Downloader, including how to run it on Windows and in the dedicated container image, can be found [here](https://training.mlcommons-storage.org).
+
+The files can be downloaded with the following commands.
+
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d datasets/coco2014 https://training.mlcommons-storage.org/metadata/stable-diffusion-coco2014-validation-prompts-dataset.uri
+```
 
 ```bash
-scripts/datasets/coco2014-validation-download-prompts.sh --output-dir /datasets/coco2014
-scripts/datasets/coco2014-validation-download-stats.sh --output-dir /datasets/coco2014
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d datasets/coco2014 https://training.mlcommons-storage.org/metadata/stable-diffusion-coco2014-validation-stats-dataset.uri
 ```
 
 While the benchmark code can work with raw images, we recommend using the preprocessed inception weights to save on computational resources.
@@ -138,19 +154,21 @@ For additional information about the validation process and the used metrics, re
 
 ## Downloading the checkpoints
 
+You can download the checkpoints with the MLCommons R2 Downloader. More information about the MLCommons R2 Downloader, including how to run it on Windows and in the dedicated container image, can be found [here](https://training.mlcommons-storage.org).
+
 The benchmark utilizes several network architectures for both the training and validation processes:
 
 1. **Stable Diffusion**: This component leverages StabilityAI's 512-base-ema.ckpt checkpoint from HuggingFace. While the checkpoint includes weights for the UNet, VAE, and OpenCLIP text embedder, the UNet weights are not used and are discarded when loading the weights. The checkpoint can be downloaded with the following command:
 ```bash
-scripts/checkpoints/download_sd.sh --output-dir /checkpoints/sd
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d checkpoints/sd https://training.mlcommons-storage.org/metadata/stable-diffusion-sd-checkpoint.uri
 ```
 2. **Inception**: The Inception network is employed during validation to compute the Fréchet Inception Distance (FID) score. The necessary weights can be downloaded with the following command:
 ```bash
-scripts/checkpoints/download_inception.sh --output-dir /checkpoints/inception
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d checkpoints/inception https://training.mlcommons-storage.org/metadata/stable-diffusion-inception-checkpoint.uri
 ```
 3. **OpenCLIP ViT-H-14 Model**: This model is utilized for the computation of the CLIP score. The required weights can be downloaded using the command:
 ```bash
-scripts/checkpoints/download_clip.sh --output-dir /checkpoints/clip
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d checkpoints/clip https://training.mlcommons-storage.org/metadata/stable-diffusion-clip-checkpoint.uri
 ```
 
 The aforementioned scripts will handle both the download and integrity verification of the checkpoints.