Skip to content

nileshsawant/mlperf-deepcam

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

166 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NOTE: this implementation is outdated. The most recent published version of the DeepCAM benchmark can be found on the MLCommons repo. An unoptimized reference version is also found there, see this link. This implementation also contains parameters sets which show good training convergence for a variety of batch sizes.

Deep Learning Climate Segmentation Benchmark

PyTorch implementation for the climate segmentation benchmark, based on the Exascale Deep Learning for Climate Analytics codebase here: https://github.com/azrael417/ClimDeepLearn, and the paper: https://arxiv.org/abs/1810.01993 . This is a fork of the repository https://github.com/azrael417/mlperf-deepcam , with changes for running the benchmark on Kestrel.

Dataset

The dataset for this benchmark comes from CAM5 [1] simulations and is hosted at NERSC. The samples are stored in HDF5 files with input images of shape (768, 1152, 16) and pixel-level labels of shape (768, 1152). The labels have three target classes (background, atmospheric river, tropical cycline) and were produced with TECA [2].

The current recommended way to get the data is to use GLOBUS and the following globus endpoint:

https://app.globus.org/file-manager?origin_id=0b226e2c-4de0-11ea-971a-021304b0cca7&origin_path=%2F

The dataset folder contains a README with some technical description of the dataset and an All-Hist folder containing all of the data files.

Preprocessing

The dataset is already pre-split in train/val/test and a summary statistics file is shipped as well. Note that the test split is never used by the benchmark (i.e. not required for submission) but can be used to perform additional crosschecks in terms of generalization and training stability.

The split was generated by a selection of scripts which achieves this. The splitting scripts are under src/utils. If you want to manually split and summarize the dataset, feel free to use this scripts, but note that for a valid benchmark submission the dataset split and preprocessing has to be the same as in the reference implementation.

For splitting the dataset, please change the lines 5 and 6 (inputdir and outputdir) in split_data.py accordingly. The first variable should specify the absolute path to the full dataset, the second variable specifies the parent directory of where the train/validation/test splits end up. Instead of copying the files, symbolic links will be created. Therefore, if you plan to run the code from a container or system with different mount points than those used for the splitting, the links might be invalid and files not found. In this case, perform the splitting in the same environment used for the runs later.

For summarizing the dataset (i.e. computing summary statistics for input normalization), use script summarize_data.py in the same directory. Please modify line 85, data_path_prefix accordingly. It should point to the parent directory which hosts all the split, i.e. is equal to the output_dir from the above mentioned splitting script. Note that the summary script uses mpi4py for distributed computing, as the whole summarization on a single CPU can take a few hours. Once the stats.h5 file is created, place it inside the training, test and validation directories.

Before you run

For Kestrel users, do the following to get the python environment:

cd /projects/<projectname>/<username>/
cp /nopt/nrel/apps/examples/python_envs/deepcamKestrel.tar.gz . 
mkdir -p deepcam_env
tar -xzf deepcamKestrel.tar.gz -C deepcam_env
source deepcam_env/bin/activate
conda-unpack
echo "import numpy; numpy.version.version" > ${CONDA_PREFIX}/lib/python3.13/site-packages/00-preload-numpy.pth

The deepcamKestrel environment has all the dependencies to run to run this benchmark.

How to run the benchmark on Kestrel

A tip about the file paths: For this benchmark, a folder called /scratch/<username>/deepcam was created. After that, the dataset was downloaded through globus into the /scratch/<username>/deepcam directory. This repository was also cloned inside the /scratch/<username>/deepcam directory. This is how my file structure looks like:

/scratch/<username>/deepcam/                # Main project directory in HPC scratch space
├── All-Hist/                              # Dataset directory (downloaded via Globus)
│   ├── train/                              # Training data files
│   ├── validation/                         # Validation data files  
│   ├── test/                               # Test data files
│   └── stats.h5                           # Summary statistics file
├── allhist_file_summary.txt               # Dataset file summary
├── cam5_runs/                             # CAM5 simulation runs
├── deepcam-data-mini/                     # Mini dataset for testing
├── make_summary.sh                        # Script to create data summaries
├── mlperf-deepcam/                        # This repository (cloned here)
│   ├── LICENSE
│   ├── README.md
│   ├── analysis/
│   │   ├── process_nsight_deepcam.ipynb
│   │   ├── roofline_plot.ipynb
│   │   ├── training_analysis.ipynb
│   │   └── utils.py
│   ├── docker/
│   │   ├── build_docker.sh
│   │   ├── Dockerfile.profile.public
│   │   ├── Dockerfile.train
│   │   ├── pull_image_cori-gpu.sh
│   │   ├── run_docker_circe.sh
│   │   ├── run_docker_coccobello.sh
│   │   └── run_docker_dgx2.sh
│   └── src/
│       ├── deepCam/
│       │   ├── profile_hdf5_ddp.py
│       │   ├── train_hdf5_ddp.py
│       │   ├── architecture/
│       │   ├── data/
│       │   ├── run_scripts/
│       │   │   ├── run_training_kestrel.sh  # Main execution script
│       │   │   └── ... (other run scripts)
│       │   └── utils/
│       └── utils/
│           ├── run_summarize_circe.sh
│           ├── split_data.py
│           └── summarize_data.py
└── README                                 # Additional project documentation

The Python environment is located separately in the projects directory:

/projects/<projectname>/<username>/
└── deepcam_env/                           # Python environment directory
    ├── bin/
    │   └── activate                       # Environment activation script
    ├── lib/
    └── share/

Following a similar structure can greatly simplify the process of editing paths in the job submission scripts.

Job submission scripts are in mlperf-deepcam/src/deepCam/run_scripts. The script run_training_kestrel.sh is meant to run the benchmark with 4 nodes and 4 GPUs per node.

Please check all the paths carefully and modify according to the locations of your training data and the location of your python environment. At the minimum, you are needed to replace the username nsawant with your own usename and the name of the project directory hpcapps with your own. Please be informed that these two appear at multiple places in the jobscipt.

You are now ready to run the job:

sbatch run_training_kestrel.sh

The job should attain the target 82% accuracy over both training and validation data within 4 hours. You should see an output like the one below at the end of the run:

:::MLLOG {"namespace": "", "time_ms": 1761791169839, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 0.8238630855784697, "metadata": {"file": "/kfs3/scratch/nsawant/deepcam/mlperf-deepcam/src/deepCam/run_scripts/../train_hdf5_ddp.py", "lineno": 497, "epoch_num": 25, "step_num": 23500}}
:::MLLOG {"namespace": "", "time_ms": 1761791169840, "event_type": "POINT_IN_TIME", "key": "eval_loss", "value": 0.014751269536859849, "metadata": {"file": "/kfs3/scratch/nsawant/deepcam/mlperf-deepcam/src/deepCam/run_scripts/../train_hdf5_ddp.py", "lineno": 498, "epoch_num": 25, "step_num": 23500}}
:::MLLOG {"namespace": "", "time_ms": 1761791169841, "event_type": "POINT_IN_TIME", "key": "target_accuracy_reached", "value": 0.82, "metadata": {"file": "/kfs3/scratch/nsawant/deepcam/mlperf-deepcam/src/deepCam/run_scripts/../train_hdf5_ddp.py", "lineno": 506, "epoch_num": 25, "step_num": 23500}}
:::MLLOG {"namespace": "", "time_ms": 1761791169842, "event_type": "INTERVAL_END", "key": "eval_stop", "value": null, "metadata": {"file": "/kfs3/scratch/nsawant/deepcam/mlperf-deepcam/src/deepCam/run_scripts/../train_hdf5_ddp.py", "lineno": 512, "epoch_num": 25}}
:::MLLOG {"namespace": "", "time_ms": 1761791169897, "event_type": "INTERVAL_END", "key": "epoch_stop", "value": null, "metadata": {"file": "/kfs3/scratch/nsawant/deepcam/mlperf-deepcam/src/deepCam/run_scripts/../train_hdf5_ddp.py", "lineno": 534, "epoch_num": 25, "step_num": 23500}}
:::MLLOG {"namespace": "", "time_ms": 1761791169898, "event_type": "INTERVAL_END", "key": "run_stop", "value": null, "metadata": {"file": "/kfs3/scratch/nsawant/deepcam/mlperf-deepcam/src/deepCam/run_scripts/../train_hdf5_ddp.py", "lineno": 542, "status": "success"}}

References

  1. Wehner, M. F., Reed, K. A., Li, F., Bacmeister, J., Chen, C.-T., Paciorek, C., Gleckler, P. J., Sperber, K. R., Collins, W. D., Gettelman, A., et al.: The effect of horizontal resolution on simulation quality in the Community Atmospheric Model, CAM5. 1, Journal of Advances in Modeling Earth Systems, 6, 980-997, 2014.
  2. Prabhat, Byna, S., Vishwanath, V., Dart, E., Wehner, M., Collins, W. D., et al.: TECA: Petascale pattern recognition for climate science, in: International Conference on Computer Analysis of Images and Patterns, pp. 426-436, Springer, 2015b.

About

This is the public repo for the MLPerf DeepCAM climate data segmentation proposal.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 51.3%
  • Jupyter Notebook 24.9%
  • Shell 23.8%