- MosaicML's Resnet50 Recipes Docker Image
- Tag:
mosaicml/pytorch_vision:resnet50_recipes - The image comes pre-configured with the following dependencies:
- Mosaic ResNet Training recipes
- Training entrypoint:
train.py - Composer Version: 0.7.1
- PyTorch Version: 1.11.0
- CUDA Version: 11.3
- Python Version: 1.9
- Ubuntu Version: 20.04
- Tag:
- Docker or your container orchestration framework of choice
- Imagenet Dataset
- System with Nvidia GPUs
As described in our blog post:
We actually cooked up three Mosaic ResNet recipes – which we call Mild, Medium, and Hot – to suit a range of requirements. The Mild recipe is for shorter training runs, the Medium recipe is for longer training runs, and the Hot recipe is for the very longest training runs that maximize accuracy.
To reproduce a specific run, two pieces of information are required:
-
recipe_yaml_path: Path to the configuration file specifying the model and training parameters unique to each recipe. -
scale_schedule_ratio: Factor which scales the duration of a particular run.
Note: The scale_schedule_ratio is a scaling factor for max_duration, each recipe sets a default max_duration = 90ep(epochs). Thus a run with scale_schedule_ratio = 0.3 will run for 90 * 0.3 = 27 epochs.
First, choose a recipe you would like to work with: [Mild, Medium, Hot]. This will determine which configuration file, recipe_yaml_path, you will need to specify.
Next, determine the proper scale_schedule_ratio to specify to reproduce the desired run by using MosaicML's Explorer. Explorer enables users to identify the most cost effective way to run training workloads across clouds and on different types of hardware backends for a variety of models and datasets. For this tutorial, we will focus on the Mosaic ResNet run data.
The table below provides the recipe_yaml_path for the selected recipe and a link to the corresponding Explorer page which can be used to select a specific run and obtain the corresponding value for scale_schedule_ratio:
| Recipe | recipe_yaml_path |
Explorer link |
|---|---|---|
| Mild | recipes/resnet50_mild.yaml |
Mosaic Resnet Mild |
| Medium | recipes/renset50_medium.yaml |
Mosaic Resnet Medium |
| Hot | recipes/resnet50_hot.yaml |
Mosaic Resnet Hot |
You can also compare all three recipes here.
In this tutorial we will using the Mild recipe and reproduce this run which results in a Top-1 accuracy of 76.19%. Thus, we see from the table above that the recipe_yaml_path = recipes/resnet50_mild.yaml and from Explorer that scale_schedule_ratio = 0.32 for the desired run.
Now that we've selected a recipe and determined the recipe_yaml_path and scale_schedule_ratio to specify, let's kick off a training run.
-
Launch a Docker container using the
mosaicml/pytorch_vision:resnet50_recipesimage on your training system.docker pull mosaicml/pytorch_vision:resnet50_recipes docker run -it mosaicml/pytorch_vision:resnet50_recipesNote: The
mosaicml/resnet50_recipesDocker image can also be used with your container orchestration framework of choice. -
Download the ImageNet dataset from http://www.image-net.org/.
-
Create the dataset folder and extract training and validation images to the appropriate subfolders. The following script can be used to faciliate this process. Be sure to note the directory path of where you extracted the dataset.
Note: This tutorial assumes that the dataset is installed to the
/tmp/ImageNetpath. -
The
MildandMediumrecipes require converting the ImageNet dataset to FFCV format. This conversion step is only required to be performed once, once converted files can be stashed away for reuse with subsequent runs. TheHotrecipe uses the original ImageNet data.-
Download the helper conversion script:
wget -P /tmp https://raw.githubusercontent.com/mosaicml/composer/v0.7.1/scripts/ffcv/create_ffcv_datasets.py -
Convert the training and validation datasets.
python /tmp/create_ffcv_datasets.py --dataset imagenet --split train --datadir /tmp/ImageNet/ python /tmp/create_ffcv_datasets.py --dataset imagenet --split val --datadir /tmp/ImageNet/Note: The helper script output the FFCV formatted dataset files to
/tmp/imagenet_train.ffcvand/tmp/imagenet_val.ffcvfor the training and validation data, respectively.
-
-
Launch the training run.
composer -n {num_gpus} train.py -f {recipe_yaml_path} --scale_schedule_ratio {scale_schedule_ratio}Replace
num_gpus,recipe_yaml_pathandscale_schedule_ratiowith the total number of GPU's, the recipe configuration, and the scale schedule ratio we determined in the previous section for the desired run, respectively.Note: The
MildandMediumrecipes assume the training and validation data is stored at the/tmp/imagenet_train.ffcvand/tmp/imagenet_val.ffcvpaths while theHotrecipe assumes the original ImageNet dataset is stored at the/tmp/ImageNetpath. The default dataset paths can be overridden, please runcomposer -n {num_gpus} train.py -f {recipe_yaml_path} --helpfor more detailed recipe specific configuration information.Example:
composer -n 8 train.py -f recipes/resnet50_mild.yaml --scale_schedule_ratio 0.32The example above will train on 8 GPU's using the
Mildrecipe with a scale schedule ratio of 0.32. You can compare your run's final Top-1 accuracy and time to train to our result.