Scripts and config updates for comparisons by sgreenbury · Pull Request #329 · alan-turing-institute/autocast

sgreenbury · 2026-04-18T07:37:44Z

This pull request introduces scripts and standardized experiment configs for each dataset and model variant, ensuring reproducibility and alignment. The changes also update documentation to guide users on the new configuration structure.

New Experiment Configurations

Cache Latents (Latent Space Preparation)

Added cache_latents.yaml configs for each dataset (advection_diffusion, conditioned_navier_stokes, gpe_laser_wake_only, gray_scott) to standardize latent caching with architectures matching the corresponding autoencoder training configs. [1] [2] [3] [4]

Ambient-Space Experiments (epd/)

Added CRPS (crps_vit_azula_large.yaml) and FM (fm_vit_large.yaml) configs for each dataset in the epd/ directory, specifying model architectures, batch sizes, optimizer settings, and metrics for both training paradigms. [1] [2] [3] [4] [5] [6] [7] [8]

Latent-Space Processor Experiments (processor/)

Added corresponding CRPS and FM configs for each dataset in the processor/ directory, using cached latents as input and mirroring the ambient-space experiment structure for direct comparison. [1] [2] [3] [4] [5] [6] [7]

Adds per-dataset local_experiment configs for the 4 target datasets: - cache_latents/<dataset>/cache_latents.yaml bakes in the datamodule plus encoder/decoder periodic + pixel_shuffle so they match the paired AE training config (ae/<dataset>/ae_dc_large.yaml) without bash overrides. - processor/<dataset>/fm_vit_large.yaml captures the reference FM-in-latent setup: ddp_4gpu, cached_latents datamodule, flow_matching_vit processor (hid_channels=640, flow_ode_steps=50), adamw_half (lr=1e-4, warmup=0), bs=256/GPU, float32_matmul_precision=high, val_metrics disabled.

Top-level folders under local_hydra/local_experiment/ now match the autocast CLI subcommands (ae/, cache_latents/, epd/, processor/) rather than mixing experiment types (crps/) with CLI kinds. This keeps ambient-space EPD variants (CRPS and upcoming FM-in-ambient) under epd/<dataset>/ and latent-space processor variants (FM and upcoming CRPS-in-latent) under processor/<dataset>/. No content changes — pure rename. Nothing has been launched against these configs yet, so the paths can change safely.

Adds processor/<dataset>/crps_vit_azula_large.yaml for the 4 target datasets: AzulaViTProcessor (hidden_dim=632, n_noise_channels=1024) trained on cached latents, with n_members=8 + AlphaFairCRPSLoss / AlphaFairCRPS (matches the ambient-space CRPS head under epd/<dataset>/crps_vit_azula_large.yaml). Configs are self-contained: ddp_4gpu_slurm, cached_latents datamodule, adamw_half (lr=2e-4, warmup=0), bs=32/GPU (ProcessorModelEnsemble expands the batch by n_members internally, so 32 mirrors the ambient-space sizing), float32_matmul_precision=high, val_metrics disabled.

Adds epd/<dataset>/fm_vit_large.yaml for the 4 target datasets: the same permute_concat encoder + channels_last decoder as ambient CRPS (epd/<dataset>/crps_vit_azula_large.yaml), but with FlowMatchingProcessor + ViT backbone in place of AzulaViTProcessor. Configs are self-contained: ddp_4gpu_slurm, dataset datamodule, adamw_half (lr=1e-4, warmup=0), bs=32/GPU, flow_ode_steps=50, hid_channels=640, patch_size=4 (keeps 16x16=256 tokens — architecture parity with vit_azula_large), train_in_latent_space=true (MSE on permute-concat features), val_metrics disabled. Model sizes across CRPS/FM and ambient/latent still need to be balanced — this commit just wires the variant up.

ProcessorModelEnsemble / EncoderProcessorDecoderEnsemble repeat the batch by n_members=8 internally, so the ambient CRPS baseline already runs at an effective 256 samples per GPU per step. FM/diffusion has no such multiplier and needs large raw batches to keep the velocity-field estimate low-variance — matching at bs=32 understated the FM budget. Setting FM-in-ambient bs=256/GPU so it matches CRPS effective batch. FM-in-latent was already at 256 (no change).

Adds slurm_scripts/comparison/ containing the full SLURM submission tree for the CRPS/FM x ambient/latent x 4-datasets study: slurm_scripts/comparison/ ae/ AE timing + 24h final runs cache_latents/ cache generation + FM-in-latent + CRPS-in-latent epd/ CRPS-in-ambient + FM-in-ambient (timing + large) These live in-repo for provenance — which configs were submitted with which AE run dirs, budgets, and overrides. Scripts are launched directly from the repo (paths like $HOME/autocast/outputs/... are absolute, so working directory doesn't matter). Drops slurm_scripts from .gitignore so future comparison-study scripts are tracked the same way. run_scripts stays ignored (used as a /projects/ workspace symlink for outputs/artifacts).

…iants Sets all 16 local_experiment configs to DiT-canonical proportions (depth=12, heads=8, head_dim≈64) targeting ~80M processor params: CRPS (ambient + latent): hidden_dim=568, n_layers=12, num_heads=8 FM (ambient + latent): hid_channels=704, hid_blocks=12, attention_heads=8 Patch sizes: ambient variants: patch=4 -> 16x16=256 tokens on 64x64 input latent variants: patch=1 -> 8x8=64 tokens on 8x8 latent (CRPS-latent gets explicit patch_size: 1 override -- vit_azula_large defaults to 4, which gives only 4 tokens on an 8x8 latent) Verified param counts by instantiation (AzulaViTProcessor / TemporalViTBackbone with dc_large latent dims: 8 channels, 8x8 spatial): CRPS ambient 80.75M FM ambient 80.04M CRPS latent 80.72M FM latent 79.91M Adds slurm_scripts/comparison/README.md documenting the full study design: variant layout, submission order, model-size matrix, batch-parity rule, and the latent patch_size gotcha. Updates local_hydra/local_experiment/README.md with a short pointer to the comparison README.

Updates all 8 timing + large script headers to match the DiT-aligned model matrix (hid_channels=704/hid_blocks=12 for FM; hidden_dim=568/ n_layers=12/num_heads=8 for CRPS; n_members=8) and points at the authoritative yaml paths under local_hydra/local_experiment/<kind>/ <dataset>/ so future drift is caught by reading the config directly. submit_cache_latents.sh now falls back to the newest <ae_run_dir>/autocast/*/checkpoints/latest-*.ckpt when autoencoder.ckpt is not yet written. This unblocks the latent timing runs while AE training is still in progress -- compute throughput is independent of AE quality so a temp checkpoint is fine for timing. Final large/ runs must wait for autoencoder.ckpt (warning added inline).

Default trainer callbacks now include: - Rolling ModelCheckpoint with save_last: true (real file, not symlink), so last.ckpt always captures the final-epoch state even when every_n_epochs doesn't land on max_epochs. - Best-val ModelCheckpoint (monitor=val_loss), with save_last: false so it doesn't contend with the rolling callback for last.ckpt ownership. - EMACallback with decay=0.999. Half-life ~700 steps is a sensible fraction of our shortest runs (FM-ambient ~12k steps) without being wildly reactive on the longest (CRPS-latent ~420k steps). The four final-run submit scripts (FM/CRPS x ambient/latent large) now override the rolling callback to save at 25/50/75/100% of the cosine schedule (every_n_epochs = COSINE_EPOCHS / 4, save_top_k=-1). Combined with save_last: true on the rolling callback, this guarantees a final-state checkpoint plus three learning-curve snapshots per run.

sgreenbury added 11 commits April 17, 2026 11:50

Add CRPS configs

ec5c00b

Fix ambient FM configs

2b9a145

sgreenbury merged commit b103316 into main Apr 18, 2026
3 checks passed

sgreenbury deleted the 2026-04-16/config-updates branch April 18, 2026 07:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scripts and config updates for comparisons#329

Scripts and config updates for comparisons#329
sgreenbury merged 11 commits intomainfrom
2026-04-16/config-updates

sgreenbury commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sgreenbury commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant