Skip to content

Commit 560d98c

Browse files
committed
Release of the Well
0 parents  commit 560d98c

264 files changed

Lines changed: 23045 additions & 0 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/tests.yaml

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
name: Tests
2+
3+
on: [push, pull_request]
4+
jobs:
5+
pre-commit:
6+
runs-on: ubuntu-latest
7+
steps:
8+
- uses: actions/checkout@v4
9+
- uses: actions/setup-python@v5
10+
with:
11+
python-version: "3.10"
12+
- name: Run pre-commit hooks
13+
uses: pre-commit/action@v3.0.1
14+
with:
15+
extra_args: --all-files
16+
pytest:
17+
runs-on: ubuntu-latest
18+
steps:
19+
- uses: actions/checkout@v4
20+
- uses: actions/setup-python@v5
21+
with:
22+
python-version: "3.10"
23+
cache: "pip"
24+
- name: Install the_well
25+
run: pip install .[benchmark,dev] --extra-index-url https://download.pytorch.org/whl/cpu
26+
- name: Run tests
27+
env:
28+
PYTHONPATH: ${{ github.workspace }}
29+
PY_COLORS: "1"
30+
run: pytest tests

.gitignore

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Ignore files generated by the build process
2+
build/
3+
dist/
4+
*.egg-info/
5+
6+
# Ignore system and IDE files
7+
.DS_Store
8+
Thumbs.db
9+
.idea/
10+
11+
#Ignoring the data
12+
datasets/active_matter/data/
13+
datasets/active_matter_v2/
14+
15+
datasets/euler_multi_quadrants_openBC/data/
16+
datasets/euler_multi_quadrants_periodicBC/data/
17+
18+
datasets/gray_scott_reaction_diffusion/data/
19+
datasets/datasets/
20+
21+
output_slurm/
22+
datasets/helmholtz_staircase/data/
23+
datasets/viscoelastic_instability/data/
24+
25+
2D/neutron_star_disks/
26+
2D/planetswe/data/
27+
28+
datasets/rayleigh_benard/data/
29+
datasets/acoustic_scattering_inclusions/old_and_problematic_data/
30+
datasets/shear_flow/data/
31+
datasets/supernova_explosion_128/data/
32+
datasets/supernova_explosion_64/data/
33+
datasets/turbulence_gravity_cooling/data/
34+
datasets/rayleigh_taylor_instability/data/
35+
datasets/turbulent_radiative_layer_3D/data/
36+
datasets/split_turbulent_radiative_layer_3D/
37+
datasets/turbulent_radiative_layer_2D/data/
38+
datasets/acoustic_scattering_discontinuous/data/
39+
datasets/acoustic_scattering_inclusions/data/
40+
datasets/acoustic_scattering_maze/data/
41+
datasets/planetswe/data/
42+
datasets/post_neutron_star_merger/data/
43+
datasets/acoustic_scattering_discontinuous/gif/
44+
datasets/acoustic_scattering_inclusions/gif/
45+
datasets/acoustic_scattering_maze/gif/
46+
the_well/benchmark/scripts_to_launch/
47+
the_well/benchmark/write_bash_script.ipynb
48+
the_well/benchmark/checkpoints/
49+
datasets/convective_envelope_rsg/data/
50+
datasets/MHD_64/data/
51+
datasets/MHD_256/data/
52+
datasets/convective_envelope_rsg/sim.mp4
53+
testing_before_adding/
54+
viz/
55+
venv_benchmark_well/
56+
wellbench/
57+
benchmarking_results/
58+
59+
# Ignore logs and temporary files
60+
*.log
61+
*.tmp
62+
*.pt
63+
*.gif
64+
65+
#ignore HDF5 files
66+
*.hdf5
67+
*.h5
68+
69+
# Ignore compiled binaries and libraries
70+
*.exe
71+
*.dll
72+
*.so
73+
74+
# Ignore package manager directories
75+
node_modules/
76+
vendor/
77+
78+
# Ignore environment-specific files
79+
.env
80+
.env.local
81+
.env.*.local
82+
83+
# Ignore sensitive or private information
84+
secrets.txt
85+
credentials.json
86+
87+
# Ignore backup files
88+
*.bak
89+
*.swp
90+
91+
# Ignore generated files
92+
*.min.js
93+
*.min.css
94+
__pycache__
95+
96+
# Ignore run generated output
97+
outputs/
98+
wandb/
99+
datasets/rt_experimental
100+
check_well_data_4059043.out

.pre-commit-config.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
repos:
2+
- repo: https://github.com/astral-sh/ruff-pre-commit
3+
rev: v0.6.4
4+
hooks:
5+
- id: ruff
6+
args: [--fix]
7+
- id: ruff-format
8+
- repo: https://github.com/pre-commit/pre-commit-hooks
9+
rev: v5.0.0
10+
hooks:
11+
- id: check-merge-conflict
12+
- id: check-toml
13+
- id: check-yaml
14+
args: [--unsafe]
15+
- id: end-of-file-fixer
16+
- id: mixed-line-ending
17+
args: [--fix=lf]
18+
- id: trailing-whitespace

LICENSE

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
BSD 3-Clause License
2+
3+
Copyright (c) 2024 Polymathic AI.
4+
All rights reserved.
5+
6+
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
7+
8+
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
9+
10+
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
11+
12+
3. Neither the name of Polymathic AI nor the names of the Well contributors may be used to endorse or promote products derived from this software without specific prior written permission.
13+
14+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

README.md

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
<div align="center">
2+
<img src="https://raw.githubusercontent.com/PolymathicAI/the_well/master/docs/assets/images/the_well_color.svg" width="60%"/>
3+
</div>
4+
5+
<br>
6+
7+
# The Well: 15TB of Physics Simulations
8+
9+
Welcome to the Well, a large-scale collection of machine learning datasets containing numerical simulations of a wide variety of spatiotemporal physical systems. The Well draws from domain scientists and numerical software developers to provide 15TB of data across 16 datasets covering diverse domains such as biological systems, fluid dynamics, acoustic scattering, as well as magneto-hydrodynamic simulations of extra-galactic fluids or supernova explosions. These datasets can be used individually or as part of a broader benchmark suite for accelerating research in machine learning and computational sciences.
10+
11+
## Tap into the Well
12+
13+
Once the Well package installed and the data downloaded you can use them in your training pipeline.
14+
15+
```python
16+
from the_well.data import WellDataset
17+
from torch.utils.data import DataLoader
18+
19+
trainset = WellDataset(
20+
well_base_path="path/to/base",
21+
well_dataset_name="name_of_the_dataset",
22+
well_split_name="train"
23+
)
24+
train_loader = DataLoader(trainset)
25+
26+
for batch in train_loader:
27+
...
28+
```
29+
30+
For more information regarding the interface, please refer to the [API](https://github.com/PolymathicAI/the_well/tree/master/docs/api.md) and the [tutorials](https://github.com/PolymathicAI/the_well/blob/master/docs/tutorials/dataset.ipynb).
31+
32+
### Installation
33+
34+
If you plan to use The Well datasets to train or evaluate deep learning models, we recommend to use a machine with enough computing resources.
35+
We also recommend creating a new Python (>=3.10) environment to install the Well. For instance, with [venv](https://docs.python.org/3/library/venv.html):
36+
37+
```
38+
python -m venv path/to/env
39+
source path/to/env/activate/bin
40+
```
41+
42+
#### From PyPI
43+
44+
The Well package can be installed directly from PyPI.
45+
46+
```
47+
pip install the_well
48+
```
49+
50+
#### From Source
51+
52+
It can also be installed from source. For this, clone the [repository](https://github.com/PolymathicAI/the_well) and install the package with its dependencies.
53+
54+
```
55+
git clone https://github.com/PolymathicAI/the_well
56+
cd the_well
57+
pip install .
58+
```
59+
60+
Depending on your acceleration hardware, you can specify `--extra-index-url` to install the relevant PyTorch version. For example, use
61+
62+
```
63+
pip install . --extra-index-url https://download.pytorch.org/whl/cu121
64+
```
65+
66+
to install the dependencies built for CUDA 12.1.
67+
68+
#### Benchmark Dependencies
69+
70+
If you want to run the benchmarks, you should install additional dependencies.
71+
72+
```
73+
pip install the_well[benchmark]
74+
```
75+
76+
### Downloading the Data
77+
78+
The Well datasets range between 6.9GB and 5.1TB of data each, for a total of 15TB for the full collection. Ensure that your system has enough free disk space to accomodate the datasets you wish to download.
79+
80+
Once `the_well` is installed, you can use the `the-well-download` command to download any dataset of The Well.
81+
82+
```
83+
the-well-download --base-path path/to/base --dataset active_matter --split train
84+
```
85+
86+
If `--dataset` and `--split` are omitted, all datasets and splits will be downloaded. This could take a while!
87+
88+
### Streaming from Hugging Face
89+
90+
Most of the Well datasets are also hosted on [Hugging Face](https://huggingface.co/polymathic-ai). Data can be streamed directly from the hub using the following code.
91+
92+
```python
93+
from the_well.data import WellDataset
94+
from torch.utils.data import DataLoader
95+
96+
# The following line may take a couple of minutes to instantiate the datamodule
97+
trainset = WellDataset(
98+
well_base_path="hf://datasets/polymathic-ai/", # access from HF hub
99+
well_dataset_name="active_matter",
100+
well_split_name="train",
101+
)
102+
train_loader = DataLoader(trainset)
103+
104+
for batch in train_loader:
105+
...
106+
```
107+
108+
For better performance in large training, we advise [downloading the data locally](#downloading-the-data) instead of streaming it over the network.
109+
110+
## Benchmark
111+
112+
The repository allows benchmarking surrogate models on the different datasets that compose the Well. Some state-of-the-art models are already implemented in [`models`](https://github.com/PolymathicAI/the_well/tree/master/the_well/benchmark/models), while [dataset classes](https://github.com/PolymathicAI/the_well/tree/master/the_well/data) handle the raw data of the Well.
113+
The benchmark relies on [a training script](https://github.com/PolymathicAI/the_well/blob/master/the_well/benchmark/train.py) that uses [hydra](https://hydra.cc/) to instantiate various classes (e.g. dataset, model, optimizer) from [configuration files](https://github.com/PolymathicAI/the_well/tree/master/the_well/benchmark/configs).
114+
115+
For instance, to run the training script of default FNO architecture on the active matter dataset, launch the following commands:
116+
117+
```bash
118+
cd the_well/benchmark
119+
python train.py experiment=fno server=local data=active_matter
120+
```
121+
122+
Each argument corresponds to a specific configuration file. In the command above `server=local` indicates the training script to use [`local.yaml`](https://github.com/PolymathicAI/the_well/tree/master/the_well/benchmark/configs/server/local.yaml), which just declares the relative path to the data. The configuration can be overridden directly or edited with new YAML files. Please refer to [hydra documentation](https://hydra.cc/) for editing configuration.
123+
124+
You can use this command within a sbatch script to launch the training with Slurm.
125+
126+
## Citation
127+
128+
This project has been led by the <a href="https://polymathic-ai.org/">Polymathic AI</a> organization, in collaboration with researchers from the Flatiron Institute, University of Colorado Boulder, University of Cambridge, New York University, Rutgers University, Cornell University, University of Tokyo, Los Alamos Natioinal Laboratory, University of Califronia, Berkeley, Princeton University, CEA DAM, and University of Liège.
129+
130+
If you find this project useful for your research, please consider citing
131+
132+
```
133+
@inproceedings{ohana2024thewell,
134+
title={The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning},
135+
author={Ruben Ohana and Michael McCabe and Lucas Thibaut Meyer and Rudy Morel and Fruzsina Julia Agocs and Miguel Beneitez and Marsha Berger and Blakesley Burkhart and Stuart B. Dalziel and Drummond Buschman Fielding and Daniel Fortunato and Jared A. Goldberg and Keiya Hirashima and Yan-Fei Jiang and Rich Kerswell and Suryanarayana Maddu and Jonah M. Miller and Payel Mukhopadhyay and Stefan S. Nixon and Jeff Shen and Romain Watteaux and Bruno R{\'e}galdo-Saint Blancard and Fran{\c{c}}ois Rozet and Liam Holden Parker and Miles Cranmer and Shirley Ho},
136+
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
137+
year={2024},
138+
url={https://openreview.net/forum?id=00Sx577BT3}
139+
}
140+
```
141+
142+
## Contact
143+
144+
For questions regarding this project, please contact [Ruben Ohana](https://rubenohana.github.io/) and [Michael McCabe](https://mikemccabe210.github.io/) at $\small\texttt{\{rohana,mmcabe\}@flatironinstitute.org}$.
145+
146+
147+
## Bug Reports and Feature Requests
148+
149+
To report a bug (in the data or the code), request a feature or simply ask a question, you can [open an issue](https://github.com/PolymathicAI/the_well/issues) on the [repository](https://github.com/PolymathicAI/the_well).

assets/the_well_color_icon.svg

Lines changed: 199 additions & 0 deletions
Loading

assets/the_well_logo.png

22.7 KB
Loading

datasets/MHD_256/MHD_256.yaml

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
dataset_name: MHD_256
2+
n_spatial_dims: 3
3+
spatial_resolution:
4+
- 256
5+
- 256
6+
- 256
7+
scalar_names: []
8+
constant_scalar_names:
9+
- Ma
10+
- Ms
11+
field_names:
12+
0:
13+
- density
14+
1:
15+
- magnetic_field_x
16+
- magnetic_field_y
17+
- magnetic_field_z
18+
- velocity_x
19+
- velocity_y
20+
- velocity_z
21+
2: []
22+
constant_field_names:
23+
0: []
24+
1: []
25+
2: []
26+
boundary_condition_types:
27+
- PERIODIC
28+
n_files: 10
29+
n_trajectories_per_file:
30+
- 1
31+
- 1
32+
- 1
33+
- 1
34+
- 1
35+
- 1
36+
- 1
37+
- 1
38+
- 1
39+
- 1
40+
n_steps_per_trajectory:
41+
- 100
42+
- 100
43+
- 100
44+
- 100
45+
- 100
46+
- 100
47+
- 100
48+
- 100
49+
- 100
50+
- 100
51+
grid_type: cartesian

0 commit comments

Comments
 (0)