|
| 1 | +<div align="center"> |
| 2 | + <img src="https://raw.githubusercontent.com/PolymathicAI/the_well/master/docs/assets/images/the_well_color.svg" width="60%"/> |
| 3 | +</div> |
| 4 | + |
| 5 | +<br> |
| 6 | + |
| 7 | +# The Well: 15TB of Physics Simulations |
| 8 | + |
| 9 | +Welcome to the Well, a large-scale collection of machine learning datasets containing numerical simulations of a wide variety of spatiotemporal physical systems. The Well draws from domain scientists and numerical software developers to provide 15TB of data across 16 datasets covering diverse domains such as biological systems, fluid dynamics, acoustic scattering, as well as magneto-hydrodynamic simulations of extra-galactic fluids or supernova explosions. These datasets can be used individually or as part of a broader benchmark suite for accelerating research in machine learning and computational sciences. |
| 10 | + |
| 11 | +## Tap into the Well |
| 12 | + |
| 13 | +Once the Well package installed and the data downloaded you can use them in your training pipeline. |
| 14 | + |
| 15 | +```python |
| 16 | +from the_well.data import WellDataset |
| 17 | +from torch.utils.data import DataLoader |
| 18 | + |
| 19 | +trainset = WellDataset( |
| 20 | + well_base_path="path/to/base", |
| 21 | + well_dataset_name="name_of_the_dataset", |
| 22 | + well_split_name="train" |
| 23 | +) |
| 24 | +train_loader = DataLoader(trainset) |
| 25 | + |
| 26 | +for batch in train_loader: |
| 27 | + ... |
| 28 | +``` |
| 29 | + |
| 30 | +For more information regarding the interface, please refer to the [API](https://github.com/PolymathicAI/the_well/tree/master/docs/api.md) and the [tutorials](https://github.com/PolymathicAI/the_well/blob/master/docs/tutorials/dataset.ipynb). |
| 31 | + |
| 32 | +### Installation |
| 33 | + |
| 34 | +If you plan to use The Well datasets to train or evaluate deep learning models, we recommend to use a machine with enough computing resources. |
| 35 | +We also recommend creating a new Python (>=3.10) environment to install the Well. For instance, with [venv](https://docs.python.org/3/library/venv.html): |
| 36 | + |
| 37 | +``` |
| 38 | +python -m venv path/to/env |
| 39 | +source path/to/env/activate/bin |
| 40 | +``` |
| 41 | + |
| 42 | +#### From PyPI |
| 43 | + |
| 44 | +The Well package can be installed directly from PyPI. |
| 45 | + |
| 46 | +``` |
| 47 | +pip install the_well |
| 48 | +``` |
| 49 | + |
| 50 | +#### From Source |
| 51 | + |
| 52 | +It can also be installed from source. For this, clone the [repository](https://github.com/PolymathicAI/the_well) and install the package with its dependencies. |
| 53 | + |
| 54 | +``` |
| 55 | +git clone https://github.com/PolymathicAI/the_well |
| 56 | +cd the_well |
| 57 | +pip install . |
| 58 | +``` |
| 59 | + |
| 60 | +Depending on your acceleration hardware, you can specify `--extra-index-url` to install the relevant PyTorch version. For example, use |
| 61 | + |
| 62 | +``` |
| 63 | +pip install . --extra-index-url https://download.pytorch.org/whl/cu121 |
| 64 | +``` |
| 65 | + |
| 66 | +to install the dependencies built for CUDA 12.1. |
| 67 | + |
| 68 | +#### Benchmark Dependencies |
| 69 | + |
| 70 | +If you want to run the benchmarks, you should install additional dependencies. |
| 71 | + |
| 72 | +``` |
| 73 | +pip install the_well[benchmark] |
| 74 | +``` |
| 75 | + |
| 76 | +### Downloading the Data |
| 77 | + |
| 78 | +The Well datasets range between 6.9GB and 5.1TB of data each, for a total of 15TB for the full collection. Ensure that your system has enough free disk space to accomodate the datasets you wish to download. |
| 79 | + |
| 80 | +Once `the_well` is installed, you can use the `the-well-download` command to download any dataset of The Well. |
| 81 | + |
| 82 | +``` |
| 83 | +the-well-download --base-path path/to/base --dataset active_matter --split train |
| 84 | +``` |
| 85 | + |
| 86 | +If `--dataset` and `--split` are omitted, all datasets and splits will be downloaded. This could take a while! |
| 87 | + |
| 88 | +### Streaming from Hugging Face |
| 89 | + |
| 90 | +Most of the Well datasets are also hosted on [Hugging Face](https://huggingface.co/polymathic-ai). Data can be streamed directly from the hub using the following code. |
| 91 | + |
| 92 | +```python |
| 93 | +from the_well.data import WellDataset |
| 94 | +from torch.utils.data import DataLoader |
| 95 | + |
| 96 | +# The following line may take a couple of minutes to instantiate the datamodule |
| 97 | +trainset = WellDataset( |
| 98 | + well_base_path="hf://datasets/polymathic-ai/", # access from HF hub |
| 99 | + well_dataset_name="active_matter", |
| 100 | + well_split_name="train", |
| 101 | +) |
| 102 | +train_loader = DataLoader(trainset) |
| 103 | + |
| 104 | +for batch in train_loader: |
| 105 | + ... |
| 106 | +``` |
| 107 | + |
| 108 | +For better performance in large training, we advise [downloading the data locally](#downloading-the-data) instead of streaming it over the network. |
| 109 | + |
| 110 | +## Benchmark |
| 111 | + |
| 112 | +The repository allows benchmarking surrogate models on the different datasets that compose the Well. Some state-of-the-art models are already implemented in [`models`](https://github.com/PolymathicAI/the_well/tree/master/the_well/benchmark/models), while [dataset classes](https://github.com/PolymathicAI/the_well/tree/master/the_well/data) handle the raw data of the Well. |
| 113 | +The benchmark relies on [a training script](https://github.com/PolymathicAI/the_well/blob/master/the_well/benchmark/train.py) that uses [hydra](https://hydra.cc/) to instantiate various classes (e.g. dataset, model, optimizer) from [configuration files](https://github.com/PolymathicAI/the_well/tree/master/the_well/benchmark/configs). |
| 114 | + |
| 115 | +For instance, to run the training script of default FNO architecture on the active matter dataset, launch the following commands: |
| 116 | + |
| 117 | +```bash |
| 118 | +cd the_well/benchmark |
| 119 | +python train.py experiment=fno server=local data=active_matter |
| 120 | +``` |
| 121 | + |
| 122 | +Each argument corresponds to a specific configuration file. In the command above `server=local` indicates the training script to use [`local.yaml`](https://github.com/PolymathicAI/the_well/tree/master/the_well/benchmark/configs/server/local.yaml), which just declares the relative path to the data. The configuration can be overridden directly or edited with new YAML files. Please refer to [hydra documentation](https://hydra.cc/) for editing configuration. |
| 123 | + |
| 124 | +You can use this command within a sbatch script to launch the training with Slurm. |
| 125 | + |
| 126 | +## Citation |
| 127 | + |
| 128 | +This project has been led by the <a href="https://polymathic-ai.org/">Polymathic AI</a> organization, in collaboration with researchers from the Flatiron Institute, University of Colorado Boulder, University of Cambridge, New York University, Rutgers University, Cornell University, University of Tokyo, Los Alamos Natioinal Laboratory, University of Califronia, Berkeley, Princeton University, CEA DAM, and University of Liège. |
| 129 | + |
| 130 | +If you find this project useful for your research, please consider citing |
| 131 | + |
| 132 | +``` |
| 133 | +@inproceedings{ohana2024thewell, |
| 134 | + title={The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning}, |
| 135 | + author={Ruben Ohana and Michael McCabe and Lucas Thibaut Meyer and Rudy Morel and Fruzsina Julia Agocs and Miguel Beneitez and Marsha Berger and Blakesley Burkhart and Stuart B. Dalziel and Drummond Buschman Fielding and Daniel Fortunato and Jared A. Goldberg and Keiya Hirashima and Yan-Fei Jiang and Rich Kerswell and Suryanarayana Maddu and Jonah M. Miller and Payel Mukhopadhyay and Stefan S. Nixon and Jeff Shen and Romain Watteaux and Bruno R{\'e}galdo-Saint Blancard and Fran{\c{c}}ois Rozet and Liam Holden Parker and Miles Cranmer and Shirley Ho}, |
| 136 | + booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, |
| 137 | + year={2024}, |
| 138 | + url={https://openreview.net/forum?id=00Sx577BT3} |
| 139 | +} |
| 140 | +``` |
| 141 | + |
| 142 | +## Contact |
| 143 | + |
| 144 | +For questions regarding this project, please contact [Ruben Ohana](https://rubenohana.github.io/) and [Michael McCabe](https://mikemccabe210.github.io/) at $\small\texttt{\{rohana,mmcabe\}@flatironinstitute.org}$. |
| 145 | + |
| 146 | + |
| 147 | +## Bug Reports and Feature Requests |
| 148 | + |
| 149 | +To report a bug (in the data or the code), request a feature or simply ask a question, you can [open an issue](https://github.com/PolymathicAI/the_well/issues) on the [repository](https://github.com/PolymathicAI/the_well). |
0 commit comments