Skip to content

Scripts and config updates for comparisons#329

Merged
sgreenbury merged 11 commits intomainfrom
2026-04-16/config-updates
Apr 18, 2026
Merged

Scripts and config updates for comparisons#329
sgreenbury merged 11 commits intomainfrom
2026-04-16/config-updates

Conversation

@sgreenbury
Copy link
Copy Markdown
Contributor

This pull request introduces scripts and standardized experiment configs for each dataset and model variant, ensuring reproducibility and alignment. The changes also update documentation to guide users on the new configuration structure.

New Experiment Configurations

Cache Latents (Latent Space Preparation)

  • Added cache_latents.yaml configs for each dataset (advection_diffusion, conditioned_navier_stokes, gpe_laser_wake_only, gray_scott) to standardize latent caching with architectures matching the corresponding autoencoder training configs. [1] [2] [3] [4]

Ambient-Space Experiments (epd/)

  • Added CRPS (crps_vit_azula_large.yaml) and FM (fm_vit_large.yaml) configs for each dataset in the epd/ directory, specifying model architectures, batch sizes, optimizer settings, and metrics for both training paradigms. [1] [2] [3] [4] [5] [6] [7] [8]

Latent-Space Processor Experiments (processor/)

  • Added corresponding CRPS and FM configs for each dataset in the processor/ directory, using cached latents as input and mirroring the ambient-space experiment structure for direct comparison. [1] [2] [3] [4] [5] [6] [7]

Adds per-dataset local_experiment configs for the 4 target datasets:

- cache_latents/<dataset>/cache_latents.yaml bakes in the datamodule plus
  encoder/decoder periodic + pixel_shuffle so they match the paired AE
  training config (ae/<dataset>/ae_dc_large.yaml) without bash overrides.

- processor/<dataset>/fm_vit_large.yaml captures the reference FM-in-latent
  setup: ddp_4gpu, cached_latents datamodule, flow_matching_vit processor
  (hid_channels=640, flow_ode_steps=50), adamw_half (lr=1e-4, warmup=0),
  bs=256/GPU, float32_matmul_precision=high, val_metrics disabled.
Top-level folders under local_hydra/local_experiment/ now match the
autocast CLI subcommands (ae/, cache_latents/, epd/, processor/) rather
than mixing experiment types (crps/) with CLI kinds. This keeps
ambient-space EPD variants (CRPS and upcoming FM-in-ambient) under
epd/<dataset>/ and latent-space processor variants (FM and upcoming
CRPS-in-latent) under processor/<dataset>/.

No content changes — pure rename. Nothing has been launched against
these configs yet, so the paths can change safely.
Adds processor/<dataset>/crps_vit_azula_large.yaml for the 4 target
datasets: AzulaViTProcessor (hidden_dim=632, n_noise_channels=1024)
trained on cached latents, with n_members=8 + AlphaFairCRPSLoss /
AlphaFairCRPS (matches the ambient-space CRPS head under
epd/<dataset>/crps_vit_azula_large.yaml).

Configs are self-contained: ddp_4gpu_slurm, cached_latents datamodule,
adamw_half (lr=2e-4, warmup=0), bs=32/GPU (ProcessorModelEnsemble
expands the batch by n_members internally, so 32 mirrors the
ambient-space sizing), float32_matmul_precision=high, val_metrics
disabled.
Adds epd/<dataset>/fm_vit_large.yaml for the 4 target datasets: the
same permute_concat encoder + channels_last decoder as ambient CRPS
(epd/<dataset>/crps_vit_azula_large.yaml), but with
FlowMatchingProcessor + ViT backbone in place of AzulaViTProcessor.

Configs are self-contained: ddp_4gpu_slurm, dataset datamodule,
adamw_half (lr=1e-4, warmup=0), bs=32/GPU, flow_ode_steps=50,
hid_channels=640, patch_size=4 (keeps 16x16=256 tokens — architecture
parity with vit_azula_large), train_in_latent_space=true (MSE on
permute-concat features), val_metrics disabled.

Model sizes across CRPS/FM and ambient/latent still need to be
balanced — this commit just wires the variant up.
ProcessorModelEnsemble / EncoderProcessorDecoderEnsemble repeat the batch
by n_members=8 internally, so the ambient CRPS baseline already runs at
an effective 256 samples per GPU per step. FM/diffusion has no such
multiplier and needs large raw batches to keep the velocity-field
estimate low-variance — matching at bs=32 understated the FM budget.

Setting FM-in-ambient bs=256/GPU so it matches CRPS effective batch.
FM-in-latent was already at 256 (no change).
Adds slurm_scripts/comparison/ containing the full SLURM submission
tree for the CRPS/FM x ambient/latent x 4-datasets study:

  slurm_scripts/comparison/
    ae/                   AE timing + 24h final runs
    cache_latents/        cache generation + FM-in-latent + CRPS-in-latent
    epd/                  CRPS-in-ambient + FM-in-ambient (timing + large)

These live in-repo for provenance — which configs were submitted with
which AE run dirs, budgets, and overrides. Scripts are launched directly
from the repo (paths like $HOME/autocast/outputs/... are absolute, so
working directory doesn't matter).

Drops slurm_scripts from .gitignore so future comparison-study scripts
are tracked the same way. run_scripts stays ignored (used as a
/projects/ workspace symlink for outputs/artifacts).
…iants

Sets all 16 local_experiment configs to DiT-canonical proportions
(depth=12, heads=8, head_dim≈64) targeting ~80M processor params:

  CRPS (ambient + latent):  hidden_dim=568, n_layers=12, num_heads=8
  FM   (ambient + latent):  hid_channels=704, hid_blocks=12, attention_heads=8

Patch sizes:
  ambient variants: patch=4 -> 16x16=256 tokens on 64x64 input
  latent  variants: patch=1 -> 8x8=64 tokens on 8x8 latent
  (CRPS-latent gets explicit patch_size: 1 override -- vit_azula_large
   defaults to 4, which gives only 4 tokens on an 8x8 latent)

Verified param counts by instantiation (AzulaViTProcessor / TemporalViTBackbone
with dc_large latent dims: 8 channels, 8x8 spatial):

  CRPS ambient  80.75M  FM ambient  80.04M
  CRPS latent   80.72M  FM latent   79.91M

Adds slurm_scripts/comparison/README.md documenting the full study design:
variant layout, submission order, model-size matrix, batch-parity rule, and
the latent patch_size gotcha.

Updates local_hydra/local_experiment/README.md with a short pointer to the
comparison README.
Updates all 8 timing + large script headers to match the DiT-aligned
model matrix (hid_channels=704/hid_blocks=12 for FM; hidden_dim=568/
n_layers=12/num_heads=8 for CRPS; n_members=8) and points at the
authoritative yaml paths under local_hydra/local_experiment/<kind>/
<dataset>/ so future drift is caught by reading the config directly.

submit_cache_latents.sh now falls back to the newest
<ae_run_dir>/autocast/*/checkpoints/latest-*.ckpt when autoencoder.ckpt
is not yet written. This unblocks the latent timing runs while AE
training is still in progress -- compute throughput is independent of
AE quality so a temp checkpoint is fine for timing. Final large/ runs
must wait for autoencoder.ckpt (warning added inline).
Default trainer callbacks now include:
- Rolling ModelCheckpoint with save_last: true (real file, not symlink),
  so last.ckpt always captures the final-epoch state even when
  every_n_epochs doesn't land on max_epochs.
- Best-val ModelCheckpoint (monitor=val_loss), with save_last: false so
  it doesn't contend with the rolling callback for last.ckpt ownership.
- EMACallback with decay=0.999. Half-life ~700 steps is a sensible
  fraction of our shortest runs (FM-ambient ~12k steps) without being
  wildly reactive on the longest (CRPS-latent ~420k steps).

The four final-run submit scripts (FM/CRPS x ambient/latent large) now
override the rolling callback to save at 25/50/75/100% of the cosine
schedule (every_n_epochs = COSINE_EPOCHS / 4, save_top_k=-1). Combined
with save_last: true on the rolling callback, this guarantees a
final-state checkpoint plus three learning-curve snapshots per run.
@sgreenbury sgreenbury merged commit b103316 into main Apr 18, 2026
3 checks passed
@sgreenbury sgreenbury deleted the 2026-04-16/config-updates branch April 18, 2026 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant