Skip to content

Commit eb70de6

Browse files
committed
pre-training docs
1 parent afcc2b0 commit eb70de6

3 files changed

Lines changed: 82 additions & 366 deletions

File tree

docs/Pretraining.md

Lines changed: 60 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,9 @@ This guide walks you through setting up and running pretraining jobs for OlmoEar
1919
1. [Environment Setup](#environment-setup) - External users start here
2020
2. [Launching Scripts](#launching-scripts) - All users
2121
3. [Dataset Setup](#dataset-setup) - Required for external users
22-
4. [Main Training Scripts](#main-training-scripts) - All users
22+
4. [Official Training Scripts](#official-training-scripts) - All users
2323
5. [Overrides and Experiments](#overrides-and-experiments) - All users
24-
6. [Gotchas and Troubleshooting](#gotchas-and-troubleshooting) - All users
25-
7. [Additional Resources](#additional-resources)
26-
8. [Quick Reference](#quick-reference)
24+
6. [Helpful Files for Understanding](#helpful-files-for-understanding) - All users
2725

2826
---
2927

@@ -159,8 +157,6 @@ Alternatively, you can disable W&B logging in your configuration:
159157
--trainer.callbacks.wandb.enabled=False
160158
```
161159

162-
See the [Reference Guide](Reference.md#troubleshooting-guide) for more details.
163-
164160
#### 2. Evaluation Dataset Paths
165161

166162
If you want to run evaluations with custom dataset locations, you can override the default evaluation dataset paths using environment variables:
@@ -200,6 +196,46 @@ External users must specify the dataset path when launching training scripts:
200196
--dataset.h5py_dir=/your/path/to/h5data/num_samples
201197
```
202198

199+
### Dataset Directory and File Structure
200+
201+
The H5 dataset follows a hierarchical directory structure (see [`set_h5py_dir` in convert_to_h5py.py](../olmoearth_pretrain/dataset/convert_to_h5py.py)):
202+
203+
```
204+
<tile_path>/
205+
h5py_data_w_missing_timesteps[_compression_settings][_tilesize_x_numsubtiles]/
206+
<sorted_modality_names>[_required_<required_mods>]/
207+
<num_samples>/
208+
sample_0.h5
209+
sample_1.h5
210+
...
211+
sample_metadata.csv
212+
latlon_distribution.npy
213+
compression_settings.json
214+
```
215+
216+
**Example path:**
217+
```
218+
/path/to/data/h5py_data_w_missing_timesteps_gzip_9_shuffle_256_x_16/era5_10_naip_sentinel2/4096/
219+
```
220+
221+
#### Core Files in Each Dataset
222+
223+
1. **`sample_{index}.h5`** - Individual sample files containing:
224+
- **`latlon`**: Float32 array `[lat, lon]` - geographic coordinates
225+
- **`timestamps`**: Integer array `[T, 3]` where T=time steps, columns are `[day, month, year]`
226+
- **Modality datasets**: Named by modality (e.g., `"sentinel2"`, `"era5_10"`, `"naip"`, `"landsat"` - see all available modalities in [`constants.py`](../olmoearth_pretrain/data/constants.py))
227+
- Spatial modalities: Shape `[H, W, T, C]` or `[H, W, C]` depending on temporal variation
228+
- Non-spatial modalities: Shape `[T, C]`
229+
- **`missing_timesteps_masks/`** group: Boolean masks per modality (shape `[T]`) indicating which timestamps from the longest timestamp array are present for that specific modality (see [`_create_missing_timesteps_masks` in convert_to_h5py.py](../olmoearth_pretrain/dataset/convert_to_h5py.py))
230+
231+
2. **`sample_metadata.csv`** - CSV with columns `sample_index, <modality1>, <modality2>...` where values are 1 (present) or 0 (absent), tracking which modalities exist in each sample (see [`save_sample_metadata` in convert_to_h5py.py](../olmoearth_pretrain/dataset/convert_to_h5py.py))
232+
233+
3. **`latlon_distribution.npy`** - NumPy array `[N, 2]` of all sample lat/lons for dataset statistics (see [`save_latlon_distribution` in convert_to_h5py.py](../olmoearth_pretrain/dataset/convert_to_h5py.py))
234+
235+
4. **`compression_settings.json`** - Stores compression algorithm, compression level options, and shuffle filter settings used for all H5 files (see [`save_compression_settings` in convert_to_h5py.py](../olmoearth_pretrain/dataset/convert_to_h5py.py))
236+
237+
**Key Invariant:** All H5 files follow the same schema with `latlon`, `timestamps`, modality datasets, and `missing_timesteps_masks` group structure, ensuring consistency across the entire dataset.
238+
203239
### Evaluation Datasets
204240

205241
Evaluation datasets have default paths set in [`olmoearth_pretrain/evals/datasets/paths.py`](../olmoearth_pretrain/evals/datasets/paths.py).
@@ -210,14 +246,16 @@ Evaluation datasets have default paths set in [`olmoearth_pretrain/evals/dataset
210246
2. Set environment variables (see [Environment Variables](#environment-variables))
211247
3. If not using all evaluations, enable only the ones you have set up by adding an override:
212248

213-
e.g to only run mados and pastis_sentinel2 evals add the following overide.
249+
For example, to only run mados and pastis_sentinel2 evals add the following override:
214250
```bash
215251
--trainer.callbacks.downstream_evaluator.tasks_to_run=\[mados,pastis_sentinel2\]
216252
```
217253
The task names correspond to the user-chosen names specified in the training configuration
218254

219255
---
220-
### Official Training Scripts
256+
257+
## Official Training Scripts
258+
221259
> **🏢 AI2 Researchers - Choose Your Launch Method:**
222260
>
223261
> **For Beaker Batch Jobs (Pre-emptible):**
@@ -277,15 +315,11 @@ torchrun --nproc_per_node=8 scripts/official/base.py train base_run local \
277315
> When using `local` as the cluster argument, checkpoints are automatically saved to `./local_output`. You can override this location with `--common.save_folder=path/to/savefolder`.
278316
279317

280-
281-
282-
283-
284318
## Overrides and Experiments
285319

286320
### How Overrides Work
287321

288-
The experiment framework uses a builder pattern with override capabilities. You can override any configuration parameter via CLI arguments using dotted notation.
322+
The experiment framework uses a builder pattern with override capabilities. Launch scripts can be edited to change the configuration or you can override any configuration parameter via CLI arguments using dotted notation.
289323

290324
### Common Overrides
291325

@@ -348,61 +382,25 @@ torchrun --nproc_per_node=8 scripts/official/base.py train custom_experiment loc
348382
--trainer.max_duration.epochs=100
349383
```
350384

351-
For more override patterns and examples, see the [Reference Guide](Reference.md#override-patterns).
352385

353386
---
354387

355-
## Gotchas and Troubleshooting
356-
357-
When adapting the training setup to your hardware, the following parameters commonly require adjustment:
358-
359-
- **Batch size** (`--data_loader.global_batch_size` and `--train_module.rank_microbatch_size`): Reduce these if you encounter out-of-memory errors
360-
- **Number of workers** (`--data_loader.num_workers`): Adjust based on available CPU cores for data loading
361-
- **Number of GPUs** (`--nproc_per_node` in torchrun): Set to match your available GPU count
362-
363-
For detailed troubleshooting guidance, consult the [Reference Guide](Reference.md#troubleshooting-guide).
364-
365-
## Additional Resources
366-
367-
### Documentation
368388

369-
- **[Setup-Internal.md](Setup-Internal.md)** - AI2 researchers: Beaker, sessions, internal infrastructure
370-
- **[Reference.md](Reference.md)** - Deep configuration reference, troubleshooting, helpful files
371-
- **[README.md](../README.md)** - Project overview, datasets, evaluation suite
389+
## Helpful Files for Understanding
372390

373-
### Key Files
391+
### Configuration Files
392+
- [`scripts/official/base.py`](../scripts/official/base.py) - Main entry point, model config
393+
- [`scripts/official/script.py`](../scripts/official/script.py) - All component builders (dataset, dataloader, trainer, callbacks)
394+
- [`olmoearth_pretrain/evals/datasets/paths.py`](../olmoearth_pretrain/evals/datasets/paths.py) - Evaluation dataset path configuration
374395

375-
- `scripts/official/base.py` - Main entry point and model config
376-
- `scripts/official/script.py` - Component builders (dataset, dataloader, trainer)
377-
- `olmoearth_pretrain/evals/datasets/paths.py` - Evaluation dataset paths
378-
- `olmoearth_pretrain/data/dataset.py` - Dataset implementation
396+
### Dataset Files
397+
- [`olmoearth_pretrain/data/dataset.py`](../olmoearth_pretrain/data/dataset.py) - Dataset implementation and configuration
398+
- [`olmoearth_pretrain/data/dataloader.py`](../olmoearth_pretrain/data/dataloader.py) - Dataloader implementation and configuration
399+
- [`olmoearth_pretrain/data/constants.py`](../olmoearth_pretrain/data/constants.py) - Modality definitions and constants
379400

380-
### External Resources
381-
382-
- [olmo-core documentation](https://github.com/allenai/olmo-core) - Underlying training framework
383-
- [PyTorch Distributed](https://pytorch.org/docs/stable/distributed.html) - Multi-GPU/multi-node training
401+
### Training Files
402+
- [`olmoearth_pretrain/internal/experiment.py`](../olmoearth_pretrain/internal/experiment.py) - Core experiment orchestration
403+
- [`olmoearth_pretrain/train/train_module/`](../olmoearth_pretrain/train/train_module/) - Training module implementations
404+
- [`olmoearth_pretrain/train/masking.py`](../olmoearth_pretrain/train/masking.py) - Masking strategy implementations
384405

385406
---
386-
387-
## Quick Reference
388-
389-
### Minimal Working Example
390-
391-
```bash
392-
# 1. Set up environment
393-
export WANDB_API_KEY="your_key" # or use --trainer.callbacks.wandb.enabled=False
394-
395-
# 2. Launch training
396-
torchrun \
397-
--nproc_per_node=1 \
398-
scripts/official/base.py train test_run local \
399-
--dataset.h5py_dir=/your/path/to/h5data/num_samples \
400-
--data_loader.global_batch_size=64 \
401-
--train_module.rank_microbatch_size=16
402-
```
403-
404-
### Getting Help
405-
406-
- **Open an issue:** [GitHub Issues](https://github.com/allenai/olmoearth_pretrain/issues)
407-
- **Check documentation:** See [Reference.md](Reference.md) for detailed troubleshooting
408-
- **AI2 researchers:** See internal Slack channels or [Setup-Internal.md](Setup-Internal.md)

0 commit comments

Comments
 (0)