MengnanLi91
diff --git a/‎docs/user/alpha_d_surrogate.md‎
Lines changed: 5 additions & 0 deletions b/‎docs/user/alpha_d_surrogate.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/user/case_distribution_analysis.md‎
Lines changed: 193 additions & 0 deletions b/‎docs/user/case_distribution_analysis.md‎
Lines changed: 193 additions & 0 deletions
diff --git a/‎docs/user/hyperparameter_optimization.md‎
Lines changed: 4 additions & 0 deletions b/‎docs/user/hyperparameter_optimization.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎src/analyze_case_distribution.py‎
Lines changed: 113 additions & 0 deletions b/‎src/analyze_case_distribution.py‎
Lines changed: 113 additions & 0 deletions
diff --git a/‎src/case_pressure_drop/__init__.py‎
Lines changed: 14 additions & 0 deletions b/‎src/case_pressure_drop/__init__.py‎
Lines changed: 14 additions & 0 deletions
@@ -318,6 +318,11 @@ cd src && python train.py --config-name alpha_d_mlp hpo=null
 See [Hyperparameter Optimization Guide](hyperparameter_optimization.md)
 for details on search-space format, study settings, and output artifacts.
 
+Before training, use the
+[Case Distribution Analysis](case_distribution_analysis.md) tool to
+preview how much data you have in each ``Dr`` / ``Re`` / ``Lr`` bin --
+especially after applying ``min_Dr`` or ``exclude_cases`` filters.
+
 After running multiple HPO versions, use the
 [Version Comparison](version_comparison.md) tool to review training
 progress and compare evaluation metrics across versions.
 
@@ -0,0 +1,193 @@
+# Case Distribution Analysis
+
+This guide covers the ``analyze_case_distribution.py`` preprocessing tool,
+which previews how simulation cases are distributed across the three
+design parameters (``Dr``, ``Re``, ``Lr``) before training.
+
+## Why it matters
+
+The alpha-D and case-level pressure-drop surrogates generalise only as
+well as their training support permits.  Bins with few training samples
+produce unreliable predictions regardless of model capacity.  Running
+this tool up-front answers:
+
+- Do I have enough cases in each ``Dr`` / ``Re`` / ``Lr`` bin?
+- Which bins will my ``min_Dr`` / ``exclude_cases`` filter remove?
+- Does a recorded train / test split cover every bin?
+
+It inspects the Zarr directory (and optionally a ``run_meta.json``) and
+prints coloured tables so under-supported regions are obvious.
+
+## How it works
+
+```
+data/flow_contraction_expansion/parametric_study/processed/
+  Re_*__Dr_*__Lr_*.zarr        <-- discovered by the tool
+                                    (parsed from the case name)
+
+data/models/.../run_meta.json   <-- optional; when provided,
+                                    Train/Test columns are populated
+                                    from split.train_sims / test_sims
+```
+
+Support thresholds (based on the **train** count when a split is
+provided, or the total count otherwise):
+
+| Marker | Train cases | Meaning |
+|--------|-------------|---------|
+| ``✗ none`` | 0 | Bin will not be learned at all |
+| ``⚠ very low`` | < 3 | Extreme extrapolation risk |
+| ``⚠ low`` | < 10 | Generalisation in this bin is unreliable |
+| ``◦ ok`` | < 30 | Usable but watch for drift |
+| ``✓ good`` | ≥ 30 | Adequate support |
+
+## Quick start
+
+From inside the container:
+
+```bash
+cd src && python analyze_case_distribution.py \
+    --run-meta ../data/models/case_pressure_drop/run_meta.json
+```
+
+From the host with Apptainer:
+
+```bash
+apptainer exec th-holo-gpu.sif bash -c \
+    'cd src && python analyze_case_distribution.py \
+        --run-meta ../data/models/case_pressure_drop/run_meta.json'
+```
+
+## Usage examples
+
+### Inspect the raw Zarr directory (before training)
+
+```bash
+cd src && python analyze_case_distribution.py \
+    --zarr-dir ../data/flow_contraction_expansion/parametric_study/processed
+```
+
+Supports the whole dataset with a single ``Total`` column.  Use this to
+check the raw simulation inventory.
+
+### Preview filters that will be applied during training
+
+```bash
+cd src && python analyze_case_distribution.py \
+    --zarr-dir ../data/flow_contraction_expansion/parametric_study/processed \
+    --min-Dr 0.333
+```
+
+Mirrors the filtering logic in ``TabularPairDataset``.  Useful when
+deciding the ``data.min_Dr`` value in ``alpha_d_mlp.yaml``: run it with
+different thresholds and see which bins disappear.
+
+### Exclude specific problematic cases
+
+```bash
+cd src && python analyze_case_distribution.py \
+    --zarr-dir ../data/flow_contraction_expansion/parametric_study/processed \
+    --exclude Re_11927__Dr_0p05__Lr_0p052 \
+    --exclude Re_7722__Dr_0p05__Lr_0p052
+```
+
+``--exclude`` can be repeated to drop any number of case names.  The
+filter is applied by exact ``{stem}`` match against the Zarr files.
+
+### Inspect a prior train / test split
+
+```bash
+cd src && python analyze_case_distribution.py \
+    --run-meta ../data/models/case_pressure_drop/run_meta.json
+```
+
+Populates ``Train`` and ``Test`` columns from the recorded split and
+classifies Support on the train count.  If ``--zarr-dir`` is omitted,
+the tool reads ``data.zarr_dir`` from ``run_meta.json``.
+
+### Restrict to a subset of axes
+
+```bash
+cd src && python analyze_case_distribution.py \
+    --run-meta ../data/models/case_pressure_drop/run_meta.json \
+    --axes Dr
+```
+
+Useful when you only care about one parameter (e.g. diagnosing poor
+performance at large ``Dr``).
+
+## CLI reference
+
+| Flag | Default | Description |
+|------|---------|-------------|
+| ``--zarr-dir`` | from ``run_meta`` if provided | Directory of processed ``*.zarr`` case stores |
+| ``--run-meta`` | ``null`` | ``run_meta.json`` to read a recorded train / test split |
+| ``--min-Dr`` | ``null`` | Drop cases whose ``Dr`` is below this value |
+| ``--exclude`` | ``[]`` | Case name to exclude (repeatable) |
+| ``--axes`` | ``Dr Re Lr`` | Which parameter axes to report |
+
+At least one of ``--zarr-dir`` or ``--run-meta`` must be provided.
+
+## Output sections
+
+### Header panel
+
+Summarises the total case count, the Zarr directory in use, and the
+train / test split (when a ``run_meta.json`` is supplied).
+
+### Per-axis distribution tables
+
+One table per axis (``Dr``, ``Re``, ``Lr``).  Columns:
+
+- **Axis value** (e.g. ``Dr = 0.900``) -- rounded to 3 decimals, except
+  ``Re`` which is shown as an integer.
+- **Train / Test** -- present only when a run-meta is provided.
+- **Total** -- the union of train, test, and any other cases in the
+  Zarr directory.
+- **Support** -- coloured marker classifying the training support.
+
+Bins flagged ``⚠ very low`` or ``✗ none`` are likely to show outsized
+evaluation errors.  Cross-reference them with the
+[version comparison tool](version_comparison.md) to confirm.
+
+## Typical workflow
+
+1. **Before the first ETL→training pass**, run with ``--zarr-dir`` only
+   to see the raw simulation inventory.  Look for bins with fewer than
+   3 cases and decide whether to gather more simulations or drop them.
+
+2. **Before each HPO run**, run with ``--zarr-dir`` plus the
+   ``--min-Dr`` and ``--exclude`` values from your config.  Confirm
+   you still have ``◦ ok`` or better support in every bin you care
+   about.
+
+3. **After a training run**, run with ``--run-meta`` to verify the
+   stratified split gave each bin at least one train and one test case.
+
+4. **When diagnosing a worst-case list** (see
+   ``evaluate_case_pressure_drop.py`` output), look up the failing
+   cases' ``Dr`` / ``Re`` / ``Lr`` in this table.  If they land in a
+   ``⚠ low``-support bin, the fix is data, not model.
+
+## Adding a new axis
+
+The axis set is currently hard-coded to ``("Dr", "Re", "Lr")`` to match
+the case-name convention (``Re_*__Dr_*__Lr_*``).  If you add a new
+design parameter to the simulation campaign:
+
+1. Extend the case-name pattern in the ETL.
+2. Update ``parse_case_params`` in ``src/case_pressure_drop/distribution.py``
+   to extract the new key.
+3. Add the key to the ``AXES`` tuple and to the ``axis`` index maps
+   inside ``bin_by``.
+
+## Related guides
+
+- [Alpha-D Surrogate Tutorial](alpha_d_surrogate.md) -- end-to-end ETL,
+  training, and evaluation workflow.
+- [Hyperparameter Optimization](hyperparameter_optimization.md) --
+  configuring ``data.min_Dr`` and ``data.exclude_cases`` filters that
+  this tool previews.
+- [Version Comparison](version_comparison.md) -- review evaluation
+  metrics across versions; cross-reference worst-case lists with the
+  distribution tables produced here.
@@ -322,3 +322,7 @@ path is used.
 - After finishing an HPO run, use the
   [version comparison tool](version_comparison.md) to review progress
   and check for regressions across versions.
+- Before tuning ``data.min_Dr`` or ``data.exclude_cases``, preview the
+  resulting distribution with
+  [`analyze_case_distribution.py`](case_distribution_analysis.md) so
+  you don't accidentally drop bins below ``⚠ low`` support.
@@ -0,0 +1,113 @@
+"""CLI to preview training-dataset distribution before running HPO / training.
+
+Usage (inside the container):
+    # Inspect the raw zarr directory
+    python analyze_case_distribution.py \\
+        --zarr-dir ../data/flow_contraction_expansion/parametric_study/processed
+
+    # Apply the same filters the training pipeline will use
+    python analyze_case_distribution.py \\
+        --zarr-dir ../data/flow_contraction_expansion/parametric_study/processed \\
+        --min-Dr 0.333
+
+    # Show the train / test split recorded in a previous run
+    python analyze_case_distribution.py \\
+        --run-meta ../data/models/case_pressure_drop/run_meta.json
+"""
+
+from __future__ import annotations
+
+import argparse
+import os
+import sys
+from pathlib import Path
+
+sys.path.insert(0, os.path.dirname(__file__))
+
+from case_pressure_drop.distribution import (
+    AXES,
+    load_sim_names_from_zarr,
+    load_split_from_run_meta,
+    print_distribution_rich,
+)
+
+
+def main(argv: list[str] | None = None) -> None:
+    parser = argparse.ArgumentParser(
+        description="Preview the distribution of cases by Dr, Re, Lr before training.",
+    )
+    parser.add_argument(
+        "--zarr-dir",
+        default=None,
+        help=(
+            "Directory of processed *.zarr case stores.  If omitted but "
+            "--run-meta is given, the zarr_dir recorded in run_meta.json is used."
+        ),
+    )
+    parser.add_argument(
+        "--run-meta",
+        default=None,
+        help=(
+            "Optional path to a run_meta.json; when provided, Train/Test columns "
+            "are populated from its recorded split."
+        ),
+    )
+    parser.add_argument(
+        "--min-Dr",
+        type=float,
+        default=None,
+        help="Exclude cases whose Dr is below this value (matches TabularPairDataset).",
+    )
+    parser.add_argument(
+        "--exclude",
+        action="append",
+        default=[],
+        help="Case name to exclude (repeatable).",
+    )
+    parser.add_argument(
+        "--axes",
+        nargs="+",
+        default=list(AXES),
+        choices=list(AXES),
+        help="Which parameter axes to report.  Default: Dr Re Lr.",
+    )
+    args = parser.parse_args(argv)
+
+    train_sims: list[str] = []
+    test_sims: list[str] = []
+    zarr_dir: str | None = None
+
+    if args.run_meta:
+        train_sims, test_sims = load_split_from_run_meta(args.run_meta)
+        # If the user did not pass --zarr-dir, try to read it from run_meta.
+        if not args.zarr_dir:
+            import json
+
+            meta = json.loads(Path(args.run_meta).expanduser().resolve().read_text())
+            zarr_dir = meta.get("data", {}).get("zarr_dir")
+
+    if args.zarr_dir:
+        zarr_dir = args.zarr_dir
+
+    if zarr_dir:
+        all_sims = load_sim_names_from_zarr(
+            zarr_dir,
+            exclude_cases=args.exclude,
+            min_Dr=args.min_Dr,
+        )
+    elif train_sims or test_sims:
+        all_sims = sorted(set(train_sims) | set(test_sims))
+    else:
+        parser.error("Provide --zarr-dir and/or --run-meta.")
+
+    print_distribution_rich(
+        all_sims=all_sims,
+        train_sims=train_sims or None,
+        test_sims=test_sims or None,
+        zarr_dir=zarr_dir,
+        axes=tuple(args.axes),
+    )
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,14 @@
+"""Case-level pressure-drop regression workflow."""
+
+from case_pressure_drop.data import CANDIDATE_FEATURES, CasePressureDropDataset
+from case_pressure_drop.workflow import (
+    evaluate_case_pressure_drop,
+    train_case_pressure_drop,
+)
+
+__all__ = [
+    "CANDIDATE_FEATURES",
+    "CasePressureDropDataset",
+    "evaluate_case_pressure_drop",
+    "train_case_pressure_drop",
+]