This guide explains the script structure and configuration system used in AutoCast.
Use the unified Python workflow command autocast as the primary interface.
Example usage:
# Train autoencoder locally
uv run autocast ae \
datamodule=reaction_diffusion \
--run-group rd
# Train EPD on SLURM
uv run autocast epd \
--mode slurm \
datamodule=reaction_diffusion \
--run-group rd \
trainer.max_epochs=10
# Re-run evaluation from an existing workdir
uv run autocast eval \
datamodule=reaction_diffusion \
--workdir outputs/rd/00To restart training, pass:
uv run autocast epd \
datamodule=reaction_diffusion \
--workdir outputs/rd/00 \
--resume-from outputs/rd/00/encoder_processor_decoder.ckptWhen running training and evaluation in a single command (train-eval), the provided arguments override training settings by default. Pass eval
settings with --eval-overrides, e.g.:
uv run autocast train-eval \
datamodule=reaction_diffusion \
--run-group rd \
trainer.max_epochs=1 \
--eval-overrides eval.batch_indices=[0,1]For SLURM train+eval submission:
uv run autocast train-eval \
--mode slurm \
datamodule=reaction_diffusion \
--run-group rdThis submits one SLURM job via sbatch; the CLI exits immediately after
submission.
-
Hydra launcher config path:
src/autocast/configs/hydra/launcher/slurm.yaml -
Cluster preset available:
local_hydra/hydra/launcher/slurm_baskerville.yaml(repo-level) -
Mapping rule: config key
Xmaps to CLI overridehydra.launcher.X=<value>timeout_min->hydra.launcher.timeout_min=...cpus_per_task->hydra.launcher.cpus_per_task=...gpus_per_node->hydra.launcher.gpus_per_node=...tasks_per_node->hydra.launcher.tasks_per_node=...use_srun->hydra.launcher.use_srun=<true|false>additional_parameters.mem->hydra.launcher.additional_parameters.mem=...
-
SLURM launch behavior:
- Default is auto: batch script uses
srunwhentasks_per_node > 1orgpus_per_node > 1. - Override explicitly with
hydra.launcher.use_srun=trueorhydra.launcher.use_srun=false.
- Default is auto: batch script uses
-
For
autocast train-evalspecifically:- Positional overrides apply to train.
--eval-overridesapplies to eval.--eval-overridesacts as a separator: put train overrides before it and eval overrides after it.
- If the same key appears in both, eval uses the eval value.
File permissions / group-write:
- Training/eval scripts read config key
umask(default0002insrc/autocast/configs/encoder_processor_decoder.yaml).
To avoid long CLI override lists, put experiment defaults in a preset config
under src/autocast/configs/experiment/ and enable it with experiment=<name>.
Example preset: src/autocast/configs/experiment/epd_flow_matching_64_fast.yaml
uv run autocast train-eval --mode slurm \
datamodule=advection_diffusion_multichannel_64_64 \
experiment=epd_flow_matching_64_fast \
autoencoder_checkpoint=/path/to/autoencoder.ckpt \
hydra.launcher.timeout_min=30 \
--eval-overrides +model.n_members=10To use Baskerville module setup + scheduler defaults:
uv run autocast epd --mode slurm datamodule=reaction_diffusion \
hydra/launcher=slurm_baskerville--run-group controls the top-level output folder (defaults to current date).
Use --run-group to set the top-level output folder label.
If --run-id is omitted, autocast auto-generates a legacy-style run id and
uses it for both output folder naming and default logging.wandb.name.
Backward-compatible aliases remain available: --run-label and --run-name.
W&B naming behavior:
--run-groupdoes not set W&B naming.--run-idsets the run folder name and defaultlogging.wandb.name.- Set
logging.wandb.name=...directly as a Hydra override to explicitly name the W&B run.
Private/local experiment presets can be placed under repo-level
local_hydra/local_experiment/ and enabled with local_experiment=<name>.
YAML files in this folder are git-ignored by default.
If you keep configs outside this repository (or when running from an installed package), set:
export AUTOCAST_CONFIG_PATH=/absolute/path/to/configsThis directory should contain the same Hydra group layout (e.g.
datamodule/, model/, experiment/) expected by AutoCast.
Use --dry-run with any command to print resolved commands/scripts without
executing them.
CLI equivalents of removed slurm_scripts/*.sh examples are provided in:
bash scripts/cli_equivalents.shFor launching many prewritten runs from a manifest list:
bash scripts/launch_from_manifest.sh run_manifests/example_runs.txtWhen using the adamw_half optimizer (half-period cosine LR schedule), the
learning rate decays from its initial value to zero over exactly
trainer.max_epochs epochs. If training is cut short by trainer.max_time
before all epochs complete, the schedule will not have reached zero.
The time-epochs subcommand solves this by running a short timing run (a few
epochs), measuring per-epoch wall-clock duration, and computing the
max_epochs that fits within a given budget:
# Time 3 EPD epochs (default) and compute max_epochs for a 24h budget
uv run autocast time-epochs datamodule=advection_diffusion_multichannel
# Time an autoencoder run
uv run autocast time-epochs --kind ae datamodule=reaction_diffusion
# Time a processor run
uv run autocast time-epochs --kind processor datamodule=reaction_diffusion
# Custom: 5 timing epochs, 12h budget, 2% safety margin
uv run autocast time-epochs -n 5 -b 12 -m 0.02 \
datamodule=shallow_water2d
# With experiment overrides
uv run autocast time-epochs experiment=epd_crps_vit_large_ps4_64
# Dry-run to inspect the generated command
uv run autocast time-epochs --dry-run datamodule=reaction_diffusion--kind selects the training type to time: ae, epd (default), or
processor. Use the same kind you intend to train so that the per-epoch
measurement reflects the actual model and data pipeline.
With --mode slurm the timing run is submitted as a SLURM job and the CLI
exits immediately, printing a follow-up command to retrieve results once the
job completes:
# Submit timing jobs for several configs at once
uv run autocast time-epochs --mode slurm --kind ae \
datamodule=reaction_diffusion --run-group timing
uv run autocast time-epochs --mode slurm --kind epd \
datamodule=shallow_water2d --run-group timing \
experiment=epd_crps_vit_large_ps4_64
# Once the SLURM jobs finish, compute results from the checkpoints
uv run autocast time-epochs --from-checkpoint outputs/timing/ae_.../timing.ckpt
uv run autocast time-epochs --from-checkpoint outputs/timing/epd_.../timing.ckpt--from-checkpoint reads an existing checkpoint, extracts the per-epoch
times, and prints the recommendation — no training is run. You can also
use it to recompute with a different budget or margin:
uv run autocast time-epochs --from-checkpoint outputs/timing/epd_.../timing.ckpt \
-b 12 -m 0.05The output includes recommended Hydra overrides ready to copy-paste:
============================================================
Seconds/epoch: 150.0s
Budget: 24.0h (margin: 2%)
max_epochs: 564
Expected time: 23.5h
Headroom: 0.5h
============================================================
Recommended overrides:
trainer.max_epochs=564 trainer.max_time=01:00:00:00 optimizer=adamw_half
The calculation is conservative:
- A 2% safety margin (configurable with
-m) is subtracted from the budget. - The result is rounded down to a whole epoch (
floor), so the cosine schedule always completes its full half-period. trainer.max_timeis set to the full (un-margined) budget as a hard stop.
Per-epoch times are extracted from the TrainingTimerCallback saved in the
checkpoint, which excludes model setup and data loading overhead.
Timing runs also emit a human-readable timing summary to logs at train end (epoch count, mean/min/max epoch seconds, and the per-epoch list for short runs), so you can quickly inspect timing quality without loading the checkpoint first.
The recommended overrides set two stopping conditions:
| Condition | Controlled by | What happens |
|---|---|---|
| Epoch limit | trainer.max_epochs |
Training stops cleanly after completing this many epochs. |
| Wall-clock limit | trainer.max_time |
Lightning hard-stops training when the clock runs out. |
Lightning stops at whichever fires first.
Faster than expected (each epoch takes less time than the timing run
measured): max_epochs fires first. All epochs complete, and the cosine LR
schedule reaches exactly zero. max_time is never triggered. This is the
ideal outcome.
Slower than expected (each epoch takes more time): max_time fires first,
cutting training short before all max_epochs have completed. The cosine
schedule has not reached zero — the final LR is positive.
The 2% default margin tolerates up to ~2% slower epochs before max_time
intervenes. The floor() rounding adds a small additional buffer (up to
one epoch's worth). For workloads where epoch duration is stable
(compute-bound, data in memory), 2% is sufficient. For I/O-bound workloads
that stream from a shared parallel filesystem, consider --margin 0.05 or
higher.
The cosine cannot overshoot and start increasing.
cosine_lambda(t) = 0.5 * (1 + cos(pi * t / max_epochs)) is monotonically
decreasing over [0, max_epochs]. Training terminates at max_epochs, so
the second half of the cosine period is never entered. If max_time
intervenes earlier, the LR is still on the decreasing branch — it simply
hasn't reached zero yet.
| Scenario | Recommended --margin |
|---|---|
| Data in memory, single GPU (very stable epoch times) | 0.02 (default) |
| Local NVMe data loading | 0.02 – 0.03 |
| Streaming from Lustre / GPFS | 0.05 – 0.10 |
To empirically check variance, run time-epochs twice at different cluster
load levels. If the two per-epoch estimates agree within 3%, 2% margin is
safe. If they diverge more, match the margin to the observed variance.
AutoCast uses a set of Python scripts located in src/autocast/scripts/ as entry points for training and evaluation. These scripts are exposed as CLI commands via pyproject.toml.
-
train_autoencoder(src/autocast/scripts/train/autoencoder.py)- Purpose: Trains an Autoencoder (Encoder + Decoder) on a given dataset.
- Config Group:
autoencoder(defaults tosrc/autocast/configs/autoencoder.yaml). - Key Output:
autoencoder.ckpt(Lightning checkpoint).
-
train_encoder_processor_decoder(src/autocast/scripts/train/encoder_processor_decoder.py)- Purpose: Trains a Processor model in the latent space of a pre-trained Autoencoder (or trains end-to-end).
- Config Group:
encoder_processor_decoder(defaults tosrc/autocast/configs/encoder_processor_decoder.yaml). - Key Dependencies: Takes a pre-trained Autoencoder checkpoint (optional, but recommended for latent training).
-
evaluate_encoder_processor_decoder(src/autocast/scripts/eval/encoder_processor_decoder.py)- Purpose: Evaluates a trained Encoder-Processor-Decoder stack.
- Config Group:
encoder_processor_decoder(usesevalsub-config). - Key Inputs: A checkpoint file (
.ckpt) and a dataset. - Outputs: Metrics CSV, rollout videos.
AutoCast uses Hydra for configuration management. All configurations are YAML files located in src/autocast/configs/.
src/autocast/configs/
├── autoencoder.yaml # default config for train_autoencoder
├── encoder_processor_decoder.yaml # default config for train_encoder_processor_decoder / autocast epd
├── processor.yaml # default config for autocast processor
├── backbone/ # Architectures (UNet, ViT)
├── datamodule/ # Datasets (ReactionDiffusion, The Well)
├── encoder/ # Encoder components (DC, PermuteConcat)
├── decoder/ # Decoder components (DC, ChannelsLast)
├── processor/ # Latent processors (FlowMatching, Diffusion)
├── model/ # Model assembly configs
├── optimizer/ # Optimizer settings (Adam, AdamW)
├── trainer/ # Lightning Trainer settings
├── logging/ # Weights & Biases configuration
├── eval/ # Evaluation-specific settings
├── experiment/ # Reusable experiment presets
├── hydra/ # Hydra launcher/runtime configs
├── input_noise_injector/ # Input perturbation configs
├── simulator/ # Simulator-related configs
└── external/ # External model/config integrations
Hydra allows you to compose configurations dynamically and override values from the command line.
You can change any value in the config tree using dot notation.
uv run train_autoencoder \
optimizer.learning_rate=0.001 \
trainer.max_epochs=50 \
datamodule.batch_size=32You can swap entire components (like the backbone or encoder) by selecting a different file from the config group.
Example: Use a Vision Transformer (ViT) processor instead of the default
uv run train_encoder_processor_decoder \
model.processor=vitExample: Change the encoder architecture
uv run train_encoder_processor_decoder \
encoder@model.encoder=permute_concatNote: The @ syntax specifies where in the config tree to mount the selected config file. The format is group@destination.
Controls where the output of the run is saved. By default, Hydra creates a hierarchy based on date/time. We recommend overriding this to a meaningful path.
uv run train_autoencoder hydra.run.dir=outputs/my_experiment/version_1Defines the neural network architecture.
_target_: The Python class to instantiate.encoder,decoder,processor: Sub-configs for specific modules.
Defines the data source.
data_path: Path to the dataset on disk.n_steps_input: Number of context frames.n_steps_output: Number of frames to predict.
checkpoint: Path to the trained model checkpoint to load.metrics: List of metrics to compute (e.g.,["mse", "rmse", "spread", "skill"]for ensemble evaluation). For ensemble predictions,skillis the RMSE of the ensemble mean, so it matchesrmsenumerically and is mainly kept as explicit spread/skill terminology.video_dir: Where to save rollout visualizations.
uv run train_autoencoder \
hydra.run.dir=outputs/autoencoder_v1 \
datamodule=reaction_diffusion \
model.encoder=dc \
model.decoder=dcUses the autoencoder from step 1.
uv run train_encoder_processor_decoder \
hydra.run.dir=outputs/processor_v1 \
datamodule=reaction_diffusion \
autoencoder_checkpoint=outputs/autoencoder_v1/autoencoder.ckpt \
model.processor=flow_matchingUse Hydra multi-run directly (or the manifest launcher) for sweeps, e.g. uv run autocast epd --mode slurm datamodule=reaction_diffusion trainer.max_epochs=5,10.