Skip to content

Commit 1913815

Browse files
committed
Add Delta_p prediction
1 parent 84e20e5 commit 1913815

23 files changed

Lines changed: 2838 additions & 52 deletions

docs/user/alpha_d_surrogate.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -318,6 +318,11 @@ cd src && python train.py --config-name alpha_d_mlp hpo=null
318318
See [Hyperparameter Optimization Guide](hyperparameter_optimization.md)
319319
for details on search-space format, study settings, and output artifacts.
320320

321+
Before training, use the
322+
[Case Distribution Analysis](case_distribution_analysis.md) tool to
323+
preview how much data you have in each ``Dr`` / ``Re`` / ``Lr`` bin --
324+
especially after applying ``min_Dr`` or ``exclude_cases`` filters.
325+
321326
After running multiple HPO versions, use the
322327
[Version Comparison](version_comparison.md) tool to review training
323328
progress and compare evaluation metrics across versions.
Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
# Case Distribution Analysis
2+
3+
This guide covers the ``analyze_case_distribution.py`` preprocessing tool,
4+
which previews how simulation cases are distributed across the three
5+
design parameters (``Dr``, ``Re``, ``Lr``) before training.
6+
7+
## Why it matters
8+
9+
The alpha-D and case-level pressure-drop surrogates generalise only as
10+
well as their training support permits. Bins with few training samples
11+
produce unreliable predictions regardless of model capacity. Running
12+
this tool up-front answers:
13+
14+
- Do I have enough cases in each ``Dr`` / ``Re`` / ``Lr`` bin?
15+
- Which bins will my ``min_Dr`` / ``exclude_cases`` filter remove?
16+
- Does a recorded train / test split cover every bin?
17+
18+
It inspects the Zarr directory (and optionally a ``run_meta.json``) and
19+
prints coloured tables so under-supported regions are obvious.
20+
21+
## How it works
22+
23+
```
24+
data/flow_contraction_expansion/parametric_study/processed/
25+
Re_*__Dr_*__Lr_*.zarr <-- discovered by the tool
26+
(parsed from the case name)
27+
28+
data/models/.../run_meta.json <-- optional; when provided,
29+
Train/Test columns are populated
30+
from split.train_sims / test_sims
31+
```
32+
33+
Support thresholds (based on the **train** count when a split is
34+
provided, or the total count otherwise):
35+
36+
| Marker | Train cases | Meaning |
37+
|--------|-------------|---------|
38+
| ``✗ none`` | 0 | Bin will not be learned at all |
39+
| ``⚠ very low`` | < 3 | Extreme extrapolation risk |
40+
| ``⚠ low`` | < 10 | Generalisation in this bin is unreliable |
41+
| ``◦ ok`` | < 30 | Usable but watch for drift |
42+
| ``✓ good`` | ≥ 30 | Adequate support |
43+
44+
## Quick start
45+
46+
From inside the container:
47+
48+
```bash
49+
cd src && python analyze_case_distribution.py \
50+
--run-meta ../data/models/case_pressure_drop/run_meta.json
51+
```
52+
53+
From the host with Apptainer:
54+
55+
```bash
56+
apptainer exec th-holo-gpu.sif bash -c \
57+
'cd src && python analyze_case_distribution.py \
58+
--run-meta ../data/models/case_pressure_drop/run_meta.json'
59+
```
60+
61+
## Usage examples
62+
63+
### Inspect the raw Zarr directory (before training)
64+
65+
```bash
66+
cd src && python analyze_case_distribution.py \
67+
--zarr-dir ../data/flow_contraction_expansion/parametric_study/processed
68+
```
69+
70+
Supports the whole dataset with a single ``Total`` column. Use this to
71+
check the raw simulation inventory.
72+
73+
### Preview filters that will be applied during training
74+
75+
```bash
76+
cd src && python analyze_case_distribution.py \
77+
--zarr-dir ../data/flow_contraction_expansion/parametric_study/processed \
78+
--min-Dr 0.333
79+
```
80+
81+
Mirrors the filtering logic in ``TabularPairDataset``. Useful when
82+
deciding the ``data.min_Dr`` value in ``alpha_d_mlp.yaml``: run it with
83+
different thresholds and see which bins disappear.
84+
85+
### Exclude specific problematic cases
86+
87+
```bash
88+
cd src && python analyze_case_distribution.py \
89+
--zarr-dir ../data/flow_contraction_expansion/parametric_study/processed \
90+
--exclude Re_11927__Dr_0p05__Lr_0p052 \
91+
--exclude Re_7722__Dr_0p05__Lr_0p052
92+
```
93+
94+
``--exclude`` can be repeated to drop any number of case names. The
95+
filter is applied by exact ``{stem}`` match against the Zarr files.
96+
97+
### Inspect a prior train / test split
98+
99+
```bash
100+
cd src && python analyze_case_distribution.py \
101+
--run-meta ../data/models/case_pressure_drop/run_meta.json
102+
```
103+
104+
Populates ``Train`` and ``Test`` columns from the recorded split and
105+
classifies Support on the train count. If ``--zarr-dir`` is omitted,
106+
the tool reads ``data.zarr_dir`` from ``run_meta.json``.
107+
108+
### Restrict to a subset of axes
109+
110+
```bash
111+
cd src && python analyze_case_distribution.py \
112+
--run-meta ../data/models/case_pressure_drop/run_meta.json \
113+
--axes Dr
114+
```
115+
116+
Useful when you only care about one parameter (e.g. diagnosing poor
117+
performance at large ``Dr``).
118+
119+
## CLI reference
120+
121+
| Flag | Default | Description |
122+
|------|---------|-------------|
123+
| ``--zarr-dir`` | from ``run_meta`` if provided | Directory of processed ``*.zarr`` case stores |
124+
| ``--run-meta`` | ``null`` | ``run_meta.json`` to read a recorded train / test split |
125+
| ``--min-Dr`` | ``null`` | Drop cases whose ``Dr`` is below this value |
126+
| ``--exclude`` | ``[]`` | Case name to exclude (repeatable) |
127+
| ``--axes`` | ``Dr Re Lr`` | Which parameter axes to report |
128+
129+
At least one of ``--zarr-dir`` or ``--run-meta`` must be provided.
130+
131+
## Output sections
132+
133+
### Header panel
134+
135+
Summarises the total case count, the Zarr directory in use, and the
136+
train / test split (when a ``run_meta.json`` is supplied).
137+
138+
### Per-axis distribution tables
139+
140+
One table per axis (``Dr``, ``Re``, ``Lr``). Columns:
141+
142+
- **Axis value** (e.g. ``Dr = 0.900``) -- rounded to 3 decimals, except
143+
``Re`` which is shown as an integer.
144+
- **Train / Test** -- present only when a run-meta is provided.
145+
- **Total** -- the union of train, test, and any other cases in the
146+
Zarr directory.
147+
- **Support** -- coloured marker classifying the training support.
148+
149+
Bins flagged ``⚠ very low`` or ``✗ none`` are likely to show outsized
150+
evaluation errors. Cross-reference them with the
151+
[version comparison tool](version_comparison.md) to confirm.
152+
153+
## Typical workflow
154+
155+
1. **Before the first ETL→training pass**, run with ``--zarr-dir`` only
156+
to see the raw simulation inventory. Look for bins with fewer than
157+
3 cases and decide whether to gather more simulations or drop them.
158+
159+
2. **Before each HPO run**, run with ``--zarr-dir`` plus the
160+
``--min-Dr`` and ``--exclude`` values from your config. Confirm
161+
you still have ``◦ ok`` or better support in every bin you care
162+
about.
163+
164+
3. **After a training run**, run with ``--run-meta`` to verify the
165+
stratified split gave each bin at least one train and one test case.
166+
167+
4. **When diagnosing a worst-case list** (see
168+
``evaluate_case_pressure_drop.py`` output), look up the failing
169+
cases' ``Dr`` / ``Re`` / ``Lr`` in this table. If they land in a
170+
``⚠ low``-support bin, the fix is data, not model.
171+
172+
## Adding a new axis
173+
174+
The axis set is currently hard-coded to ``("Dr", "Re", "Lr")`` to match
175+
the case-name convention (``Re_*__Dr_*__Lr_*``). If you add a new
176+
design parameter to the simulation campaign:
177+
178+
1. Extend the case-name pattern in the ETL.
179+
2. Update ``parse_case_params`` in ``src/case_pressure_drop/distribution.py``
180+
to extract the new key.
181+
3. Add the key to the ``AXES`` tuple and to the ``axis`` index maps
182+
inside ``bin_by``.
183+
184+
## Related guides
185+
186+
- [Alpha-D Surrogate Tutorial](alpha_d_surrogate.md) -- end-to-end ETL,
187+
training, and evaluation workflow.
188+
- [Hyperparameter Optimization](hyperparameter_optimization.md) --
189+
configuring ``data.min_Dr`` and ``data.exclude_cases`` filters that
190+
this tool previews.
191+
- [Version Comparison](version_comparison.md) -- review evaluation
192+
metrics across versions; cross-reference worst-case lists with the
193+
distribution tables produced here.

docs/user/hyperparameter_optimization.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -322,3 +322,7 @@ path is used.
322322
- After finishing an HPO run, use the
323323
[version comparison tool](version_comparison.md) to review progress
324324
and check for regressions across versions.
325+
- Before tuning ``data.min_Dr`` or ``data.exclude_cases``, preview the
326+
resulting distribution with
327+
[`analyze_case_distribution.py`](case_distribution_analysis.md) so
328+
you don't accidentally drop bins below ``⚠ low`` support.

src/analyze_case_distribution.py

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
"""CLI to preview training-dataset distribution before running HPO / training.
2+
3+
Usage (inside the container):
4+
# Inspect the raw zarr directory
5+
python analyze_case_distribution.py \\
6+
--zarr-dir ../data/flow_contraction_expansion/parametric_study/processed
7+
8+
# Apply the same filters the training pipeline will use
9+
python analyze_case_distribution.py \\
10+
--zarr-dir ../data/flow_contraction_expansion/parametric_study/processed \\
11+
--min-Dr 0.333
12+
13+
# Show the train / test split recorded in a previous run
14+
python analyze_case_distribution.py \\
15+
--run-meta ../data/models/case_pressure_drop/run_meta.json
16+
"""
17+
18+
from __future__ import annotations
19+
20+
import argparse
21+
import os
22+
import sys
23+
from pathlib import Path
24+
25+
sys.path.insert(0, os.path.dirname(__file__))
26+
27+
from case_pressure_drop.distribution import (
28+
AXES,
29+
load_sim_names_from_zarr,
30+
load_split_from_run_meta,
31+
print_distribution_rich,
32+
)
33+
34+
35+
def main(argv: list[str] | None = None) -> None:
36+
parser = argparse.ArgumentParser(
37+
description="Preview the distribution of cases by Dr, Re, Lr before training.",
38+
)
39+
parser.add_argument(
40+
"--zarr-dir",
41+
default=None,
42+
help=(
43+
"Directory of processed *.zarr case stores. If omitted but "
44+
"--run-meta is given, the zarr_dir recorded in run_meta.json is used."
45+
),
46+
)
47+
parser.add_argument(
48+
"--run-meta",
49+
default=None,
50+
help=(
51+
"Optional path to a run_meta.json; when provided, Train/Test columns "
52+
"are populated from its recorded split."
53+
),
54+
)
55+
parser.add_argument(
56+
"--min-Dr",
57+
type=float,
58+
default=None,
59+
help="Exclude cases whose Dr is below this value (matches TabularPairDataset).",
60+
)
61+
parser.add_argument(
62+
"--exclude",
63+
action="append",
64+
default=[],
65+
help="Case name to exclude (repeatable).",
66+
)
67+
parser.add_argument(
68+
"--axes",
69+
nargs="+",
70+
default=list(AXES),
71+
choices=list(AXES),
72+
help="Which parameter axes to report. Default: Dr Re Lr.",
73+
)
74+
args = parser.parse_args(argv)
75+
76+
train_sims: list[str] = []
77+
test_sims: list[str] = []
78+
zarr_dir: str | None = None
79+
80+
if args.run_meta:
81+
train_sims, test_sims = load_split_from_run_meta(args.run_meta)
82+
# If the user did not pass --zarr-dir, try to read it from run_meta.
83+
if not args.zarr_dir:
84+
import json
85+
86+
meta = json.loads(Path(args.run_meta).expanduser().resolve().read_text())
87+
zarr_dir = meta.get("data", {}).get("zarr_dir")
88+
89+
if args.zarr_dir:
90+
zarr_dir = args.zarr_dir
91+
92+
if zarr_dir:
93+
all_sims = load_sim_names_from_zarr(
94+
zarr_dir,
95+
exclude_cases=args.exclude,
96+
min_Dr=args.min_Dr,
97+
)
98+
elif train_sims or test_sims:
99+
all_sims = sorted(set(train_sims) | set(test_sims))
100+
else:
101+
parser.error("Provide --zarr-dir and/or --run-meta.")
102+
103+
print_distribution_rich(
104+
all_sims=all_sims,
105+
train_sims=train_sims or None,
106+
test_sims=test_sims or None,
107+
zarr_dir=zarr_dir,
108+
axes=tuple(args.axes),
109+
)
110+
111+
112+
if __name__ == "__main__":
113+
main()

src/case_pressure_drop/__init__.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
"""Case-level pressure-drop regression workflow."""
2+
3+
from case_pressure_drop.data import CANDIDATE_FEATURES, CasePressureDropDataset
4+
from case_pressure_drop.workflow import (
5+
evaluate_case_pressure_drop,
6+
train_case_pressure_drop,
7+
)
8+
9+
__all__ = [
10+
"CANDIDATE_FEATURES",
11+
"CasePressureDropDataset",
12+
"evaluate_case_pressure_drop",
13+
"train_case_pressure_drop",
14+
]

0 commit comments

Comments
 (0)