|
| 1 | +# Case Distribution Analysis |
| 2 | + |
| 3 | +This guide covers the ``analyze_case_distribution.py`` preprocessing tool, |
| 4 | +which previews how simulation cases are distributed across the three |
| 5 | +design parameters (``Dr``, ``Re``, ``Lr``) before training. |
| 6 | + |
| 7 | +## Why it matters |
| 8 | + |
| 9 | +The alpha-D and case-level pressure-drop surrogates generalise only as |
| 10 | +well as their training support permits. Bins with few training samples |
| 11 | +produce unreliable predictions regardless of model capacity. Running |
| 12 | +this tool up-front answers: |
| 13 | + |
| 14 | +- Do I have enough cases in each ``Dr`` / ``Re`` / ``Lr`` bin? |
| 15 | +- Which bins will my ``min_Dr`` / ``exclude_cases`` filter remove? |
| 16 | +- Does a recorded train / test split cover every bin? |
| 17 | + |
| 18 | +It inspects the Zarr directory (and optionally a ``run_meta.json``) and |
| 19 | +prints coloured tables so under-supported regions are obvious. |
| 20 | + |
| 21 | +## How it works |
| 22 | + |
| 23 | +``` |
| 24 | +data/flow_contraction_expansion/parametric_study/processed/ |
| 25 | + Re_*__Dr_*__Lr_*.zarr <-- discovered by the tool |
| 26 | + (parsed from the case name) |
| 27 | +
|
| 28 | +data/models/.../run_meta.json <-- optional; when provided, |
| 29 | + Train/Test columns are populated |
| 30 | + from split.train_sims / test_sims |
| 31 | +``` |
| 32 | + |
| 33 | +Support thresholds (based on the **train** count when a split is |
| 34 | +provided, or the total count otherwise): |
| 35 | + |
| 36 | +| Marker | Train cases | Meaning | |
| 37 | +|--------|-------------|---------| |
| 38 | +| ``✗ none`` | 0 | Bin will not be learned at all | |
| 39 | +| ``⚠ very low`` | < 3 | Extreme extrapolation risk | |
| 40 | +| ``⚠ low`` | < 10 | Generalisation in this bin is unreliable | |
| 41 | +| ``◦ ok`` | < 30 | Usable but watch for drift | |
| 42 | +| ``✓ good`` | ≥ 30 | Adequate support | |
| 43 | + |
| 44 | +## Quick start |
| 45 | + |
| 46 | +From inside the container: |
| 47 | + |
| 48 | +```bash |
| 49 | +cd src && python analyze_case_distribution.py \ |
| 50 | + --run-meta ../data/models/case_pressure_drop/run_meta.json |
| 51 | +``` |
| 52 | + |
| 53 | +From the host with Apptainer: |
| 54 | + |
| 55 | +```bash |
| 56 | +apptainer exec th-holo-gpu.sif bash -c \ |
| 57 | + 'cd src && python analyze_case_distribution.py \ |
| 58 | + --run-meta ../data/models/case_pressure_drop/run_meta.json' |
| 59 | +``` |
| 60 | + |
| 61 | +## Usage examples |
| 62 | + |
| 63 | +### Inspect the raw Zarr directory (before training) |
| 64 | + |
| 65 | +```bash |
| 66 | +cd src && python analyze_case_distribution.py \ |
| 67 | + --zarr-dir ../data/flow_contraction_expansion/parametric_study/processed |
| 68 | +``` |
| 69 | + |
| 70 | +Supports the whole dataset with a single ``Total`` column. Use this to |
| 71 | +check the raw simulation inventory. |
| 72 | + |
| 73 | +### Preview filters that will be applied during training |
| 74 | + |
| 75 | +```bash |
| 76 | +cd src && python analyze_case_distribution.py \ |
| 77 | + --zarr-dir ../data/flow_contraction_expansion/parametric_study/processed \ |
| 78 | + --min-Dr 0.333 |
| 79 | +``` |
| 80 | + |
| 81 | +Mirrors the filtering logic in ``TabularPairDataset``. Useful when |
| 82 | +deciding the ``data.min_Dr`` value in ``alpha_d_mlp.yaml``: run it with |
| 83 | +different thresholds and see which bins disappear. |
| 84 | + |
| 85 | +### Exclude specific problematic cases |
| 86 | + |
| 87 | +```bash |
| 88 | +cd src && python analyze_case_distribution.py \ |
| 89 | + --zarr-dir ../data/flow_contraction_expansion/parametric_study/processed \ |
| 90 | + --exclude Re_11927__Dr_0p05__Lr_0p052 \ |
| 91 | + --exclude Re_7722__Dr_0p05__Lr_0p052 |
| 92 | +``` |
| 93 | + |
| 94 | +``--exclude`` can be repeated to drop any number of case names. The |
| 95 | +filter is applied by exact ``{stem}`` match against the Zarr files. |
| 96 | + |
| 97 | +### Inspect a prior train / test split |
| 98 | + |
| 99 | +```bash |
| 100 | +cd src && python analyze_case_distribution.py \ |
| 101 | + --run-meta ../data/models/case_pressure_drop/run_meta.json |
| 102 | +``` |
| 103 | + |
| 104 | +Populates ``Train`` and ``Test`` columns from the recorded split and |
| 105 | +classifies Support on the train count. If ``--zarr-dir`` is omitted, |
| 106 | +the tool reads ``data.zarr_dir`` from ``run_meta.json``. |
| 107 | + |
| 108 | +### Restrict to a subset of axes |
| 109 | + |
| 110 | +```bash |
| 111 | +cd src && python analyze_case_distribution.py \ |
| 112 | + --run-meta ../data/models/case_pressure_drop/run_meta.json \ |
| 113 | + --axes Dr |
| 114 | +``` |
| 115 | + |
| 116 | +Useful when you only care about one parameter (e.g. diagnosing poor |
| 117 | +performance at large ``Dr``). |
| 118 | + |
| 119 | +## CLI reference |
| 120 | + |
| 121 | +| Flag | Default | Description | |
| 122 | +|------|---------|-------------| |
| 123 | +| ``--zarr-dir`` | from ``run_meta`` if provided | Directory of processed ``*.zarr`` case stores | |
| 124 | +| ``--run-meta`` | ``null`` | ``run_meta.json`` to read a recorded train / test split | |
| 125 | +| ``--min-Dr`` | ``null`` | Drop cases whose ``Dr`` is below this value | |
| 126 | +| ``--exclude`` | ``[]`` | Case name to exclude (repeatable) | |
| 127 | +| ``--axes`` | ``Dr Re Lr`` | Which parameter axes to report | |
| 128 | + |
| 129 | +At least one of ``--zarr-dir`` or ``--run-meta`` must be provided. |
| 130 | + |
| 131 | +## Output sections |
| 132 | + |
| 133 | +### Header panel |
| 134 | + |
| 135 | +Summarises the total case count, the Zarr directory in use, and the |
| 136 | +train / test split (when a ``run_meta.json`` is supplied). |
| 137 | + |
| 138 | +### Per-axis distribution tables |
| 139 | + |
| 140 | +One table per axis (``Dr``, ``Re``, ``Lr``). Columns: |
| 141 | + |
| 142 | +- **Axis value** (e.g. ``Dr = 0.900``) -- rounded to 3 decimals, except |
| 143 | + ``Re`` which is shown as an integer. |
| 144 | +- **Train / Test** -- present only when a run-meta is provided. |
| 145 | +- **Total** -- the union of train, test, and any other cases in the |
| 146 | + Zarr directory. |
| 147 | +- **Support** -- coloured marker classifying the training support. |
| 148 | + |
| 149 | +Bins flagged ``⚠ very low`` or ``✗ none`` are likely to show outsized |
| 150 | +evaluation errors. Cross-reference them with the |
| 151 | +[version comparison tool](version_comparison.md) to confirm. |
| 152 | + |
| 153 | +## Typical workflow |
| 154 | + |
| 155 | +1. **Before the first ETL→training pass**, run with ``--zarr-dir`` only |
| 156 | + to see the raw simulation inventory. Look for bins with fewer than |
| 157 | + 3 cases and decide whether to gather more simulations or drop them. |
| 158 | + |
| 159 | +2. **Before each HPO run**, run with ``--zarr-dir`` plus the |
| 160 | + ``--min-Dr`` and ``--exclude`` values from your config. Confirm |
| 161 | + you still have ``◦ ok`` or better support in every bin you care |
| 162 | + about. |
| 163 | + |
| 164 | +3. **After a training run**, run with ``--run-meta`` to verify the |
| 165 | + stratified split gave each bin at least one train and one test case. |
| 166 | + |
| 167 | +4. **When diagnosing a worst-case list** (see |
| 168 | + ``evaluate_case_pressure_drop.py`` output), look up the failing |
| 169 | + cases' ``Dr`` / ``Re`` / ``Lr`` in this table. If they land in a |
| 170 | + ``⚠ low``-support bin, the fix is data, not model. |
| 171 | + |
| 172 | +## Adding a new axis |
| 173 | + |
| 174 | +The axis set is currently hard-coded to ``("Dr", "Re", "Lr")`` to match |
| 175 | +the case-name convention (``Re_*__Dr_*__Lr_*``). If you add a new |
| 176 | +design parameter to the simulation campaign: |
| 177 | + |
| 178 | +1. Extend the case-name pattern in the ETL. |
| 179 | +2. Update ``parse_case_params`` in ``src/case_pressure_drop/distribution.py`` |
| 180 | + to extract the new key. |
| 181 | +3. Add the key to the ``AXES`` tuple and to the ``axis`` index maps |
| 182 | + inside ``bin_by``. |
| 183 | + |
| 184 | +## Related guides |
| 185 | + |
| 186 | +- [Alpha-D Surrogate Tutorial](alpha_d_surrogate.md) -- end-to-end ETL, |
| 187 | + training, and evaluation workflow. |
| 188 | +- [Hyperparameter Optimization](hyperparameter_optimization.md) -- |
| 189 | + configuring ``data.min_Dr`` and ``data.exclude_cases`` filters that |
| 190 | + this tool previews. |
| 191 | +- [Version Comparison](version_comparison.md) -- review evaluation |
| 192 | + metrics across versions; cross-reference worst-case lists with the |
| 193 | + distribution tables produced here. |
0 commit comments