|
| 1 | +# OpenQuake h5py Extractor |
| 2 | + |
| 3 | +We use our own extractor to obtain data from OpenQuake hdf5 files, rather than using the `openquake.calculators.extract.Extractor` |
| 4 | + |
| 5 | +`openquake.calculators.extract.Extractor` drifts between OQ minor versions. The version matrix test showed that OQ 3.20–3.23 collapses the poe dimension for disaggregations if there is only one poe (which is typical for our use) while OQ 3.24+ includes all degenerate dimensions. Because the Extractor API is unstable, any new OQ minor release could silently break the extraction code. |
| 6 | + |
| 7 | +The OQ HDF5 **file** layout is more stable. We replaced the Extractor with direct h5py reads, dropped `openquake-engine` as a dependency entirely, and validated the new reader against a fixture matrix generated by docker-based OQ runs across seven minor releases. |
| 8 | + |
| 9 | +## `OqHdf5Reader` class |
| 10 | + |
| 11 | +`toshi_hazard_store/oq_import/h5py_reader.py` — an `OqHdf5Reader` class that wraps an HDF5 file and exposes exactly the data the extraction code needs: |
| 12 | + |
| 13 | +| Method | HDF5 path(s) | Notes | |
| 14 | +|---|---|---| |
| 15 | +| `oqparam()` | `oqparam[()]` | JSON blob decoded to dict | |
| 16 | +| `sitecol()` | `sitecol/{sids,lat,lon,...}` | Parallel 1-D arrays → DataFrame | |
| 17 | +| `hcurves_rlzs()` | `hcurves-rlzs` | Returns `{rlz-N: arr(sites, imts, levels)}` | |
| 18 | +| `gsim_branches()` | `full_lt/gsim_lt` | `{branch_id: uncertainty_str}` | |
| 19 | +| `source_branches()` | `full_lt/source_model_lt` | `{idx: sm_lt_path_str}` (values used as source_map keys) | |
| 20 | +| `realizations()` | `full_lt/sm_data` + `full_lt/gsim_lt` | List of `_RlzRecord(source_path, gsim_path, ordinal)` | |
| 21 | +| `disagg_rlzs(kind, ...)` | `disagg-rlzs/<kind>`, `disagg-bins/*`, `best_rlzs` | Returns a `_DisaggExtract` proxy | |
| 22 | + |
| 23 | +The class is tested against fixtures generated by OQ 3.19.1–3.25.1. It is also tested against the 3.25.1 `openquake.calculators.extract.Extractor` for structural and numerical equivalence. |
| 24 | + |
| 25 | +## HDF5 layout reference |
| 26 | + |
| 27 | +### `oqparam` |
| 28 | + |
| 29 | +Scalar dataset holding a UTF-8 JSON blob of the full `OqParam` dict. Read with: |
| 30 | +```python |
| 31 | +cfg = json.loads(f['oqparam'][()].decode()) |
| 32 | +``` |
| 33 | + |
| 34 | +Relevant keys: `calculation_mode`, `hazard_imtls`, `iml_disagg`, `disagg_outputs`. Cross-version alias: some older OQ versions may use `intensity_measure_types_and_levels` instead of `hazard_imtls`; the reader resolves this transparently. |
| 35 | + |
| 36 | +### `sitecol/` |
| 37 | + |
| 38 | +Parallel 1-D datasets per field: `sids`, `lon`, `lat`, `depth`, `vs30`, `vs30measured`, `z1pt0`, `z2pt5`, `backarc`. N rows = number of sites. |
| 39 | + |
| 40 | +### `full_lt/` |
| 41 | + |
| 42 | +| Dataset | dtype | Content | |
| 43 | +|---|---|---| |
| 44 | +| `gsim_lt` | compound `(trt, branch, uncertainty, weight)` | One row per GMM branch. `uncertainty` is the raw `[ClassName]\nparam=val` GSIM string as bytes. Both the raw and OQ-normalised form produce identical nzshm_model hash digests. | |
| 45 | +| `source_model_lt` | compound `(branchset, branch, utype, uvalue, weight)` | `branch` column = sm_lt_path string (e.g. `[dmgeologic, ...]`). | |
| 46 | +| `sm_data` | compound `(name, weight, path, samples)` | `path` = sm_lt_path; `samples` = number of realizations for this source model. | |
| 47 | + |
| 48 | +Realizations are reconstructed by iterating `sm_data` and for each source model, iterating the next `samples` rows of `gsim_lt` in declaration order. This matches OQ's enumeration for `number_of_logic_tree_samples = 0`. |
| 49 | + |
| 50 | +### `hcurves-rlzs` |
| 51 | + |
| 52 | +Shape `(n_sites, n_rlz, n_imts, n_levels)`. Carries a `json` attribute whose `shape_descr` lists axis names; the `imt` key gives ordered IMT names. |
| 53 | + |
| 54 | +### `disagg-rlzs/<kind>` |
| 55 | + |
| 56 | +Shape `(n_sites, <kind_axes>, n_imt, n_poe, n_rlz)` where `<kind_axes>` expands the underscore-separated kind name (e.g. `TRT_Mag_Dist_Eps` → `trt, mag, dist, eps`). No `json` attribute — axes are inferred from the kind string. |
| 57 | + |
| 58 | +`disagg-bins/{Axis}` contains bin edges (numeric axes) or labels (TRT). The reader computes bin centres as `(edges[:-1] + edges[1:]) / 2`. |
| 59 | + |
| 60 | +The rlz axis ordering follows `best_rlzs[site_idx]` — an integer array giving the ordinal of each rlz in the disagg result in descending-weight order. |
| 61 | + |
| 62 | +## Cross-version fixture matrix |
| 63 | + |
| 64 | +### Generating fixtures |
| 65 | + |
| 66 | +A developer may want to generate new test fixtures when either new functionality is added to `OqHdf5Reader` that reads features not present in the existing fixtures or they want to support new versions of OpenQuake. These fixtures are then used to make sure that `OqHdf5Reader` continues to behave as expected via tests in `tests/oq_import/test_cross_version_fixtures.py` and `tests/oq_import/test_extractor_snapshot_cross_version.py`. |
| 67 | + |
| 68 | +Prerequisites: Docker installed and `openquake/engine:<ver>` images pullable. |
| 69 | + |
| 70 | +OQ job inputs live in `scripts/oq_input/` (committed): |
| 71 | + |
| 72 | +``` |
| 73 | +scripts/oq_input/ |
| 74 | + sources/ ← shared NSHM source model |
| 75 | + gsim_model.xml ← shared GSIM logic tree |
| 76 | + job_classical.ini |
| 77 | + job_disagg.ini |
| 78 | + sites_classical.csv |
| 79 | + sites_disagg.csv |
| 80 | +``` |
| 81 | + |
| 82 | +Both calc modes mount this directory as `/job` inside the container and run the appropriate ini file. `export_dir = /tmp` is set in both ini files so OQ can write CSV exports to `/tmp` without touching the read-only mount. |
| 83 | + |
| 84 | +```bash |
| 85 | +uv run python scripts/regen_oq_fixtures.py --mode both |
| 86 | +``` |
| 87 | + |
| 88 | +This will: |
| 89 | +1. Pull `openquake/engine:<ver>` for each version in `OQ_VERSIONS`. |
| 90 | +2. Detect the image entrypoint (older images use `/bin/bash -c`; newer use `./oq-start.sh`) and build the docker CMD accordingly. |
| 91 | +3. Run `oq engine --run /job/job_{classical,disagg}.ini` inside the container. |
| 92 | +4. Use `docker cp` (host-side) to pull the resulting `calc_*.hdf5` out of the stopped container — avoids all container-side write-permission issues. |
| 93 | +5. Write `tests/fixtures/oq_cross_version/{classical,disaggregation}/oq_<ver>/calc.hdf5` alongside a `manifest.json` recording the image digest, generation timestamp and file checksum. |
| 94 | +6. Skip any (version, mode) pair whose `manifest.json` already exists and whose `hdf5_sha256` still matches. |
| 95 | + |
| 96 | +Flags: |
| 97 | +- `--version 3.25.1` — regenerate a single version |
| 98 | +- `--mode classical|disaggregation|both` |
| 99 | +- `--force` — overwrite existing fixtures |
| 100 | +- `--dry-run` — print docker commands without running them |
| 101 | + |
| 102 | +### Extractor snapshots |
| 103 | + |
| 104 | +Each fixture directory also contains two pre-baked snapshot files captured from the canonical OQ `Extractor` running inside the same Docker image that produced `calc.hdf5`: |
| 105 | + |
| 106 | +| File | Contents | |
| 107 | +|---|---| |
| 108 | +| `extractor_snapshot.npz` | Numpy arrays: `sitecol__lat/lon/vs30`, per-rlz `hcurves_rlzs__rlz_NNN` (classical), `disagg__array` (disagg). Load with `np.load(..., allow_pickle=False)`. | |
| 109 | +| `extractor_snapshot.json` | Non-array metadata: `oqparam_json`, `realizations`, `hcurves_rlzs_keys`, disagg `kind`/`imt`/`shape_descr`/`rlz_labels`/`disagg_bins`. | |
| 110 | + |
| 111 | +The snapshot is the within-version numerical ground truth consumed by `tests/oq_import/test_extractor_snapshot_cross_version.py` — no host-side OQ install needed at test time. `manifest.json` records `extractor_snapshot_npz_sha256` and `extractor_snapshot_json_sha256` for integrity checking. |
| 112 | + |
| 113 | +Snapshots are generated automatically by `regen_oq_fixtures.py` (a second `docker run` step after the OQ calculation). If snapshots are missing (e.g. for fixtures created before this feature), the corresponding tests skip with an actionable message. |
| 114 | + |
| 115 | +### Adding a new OQ version |
| 116 | + |
| 117 | +1. Append the version string to `OQ_VERSIONS` in `scripts/regen_oq_fixtures.py`. |
| 118 | +2. Run `uv run python scripts/regen_oq_fixtures.py --mode both --version <new_ver>`. |
| 119 | +3. Commit the new `calc.hdf5`, `extractor_snapshot.npz`, `extractor_snapshot.json`, and `manifest.json`. |
| 120 | +4. Run `uv run pytest tests/oq_import/test_cross_version_fixtures.py tests/oq_import/test_extractor_snapshot_cross_version.py -v` — new tests are auto-discovered from the fixture directory. |
| 121 | + |
| 122 | +### Inspecting a fixture by hand |
| 123 | + |
| 124 | +```bash |
| 125 | +uv run python -c " |
| 126 | +import h5py, json |
| 127 | +with h5py.File('tests/fixtures/oq_cross_version/disaggregation/oq_3.25.1/calc.hdf5') as f: |
| 128 | + f.visit(print) |
| 129 | + print(json.loads(f['oqparam'][()].decode())['calculation_mode']) |
| 130 | +" |
| 131 | +``` |
| 132 | + |
| 133 | +## Compatibility testing |
| 134 | + |
| 135 | +Two complementary suites compare `OqHdf5Reader` against the canonical OQ `Extractor`: |
| 136 | + |
| 137 | +**`tests/oq_import/test_extractor_compat.py`** — runs `OqHdf5Reader` and `openquake.calculators.extract.Extractor` live, side-by-side on the committed classical and disagg fixtures, asserting numerical and structural identity for every field including `bins_digest` and end-to-end RecordBatch output. Opt-in because it pulls `openquake-engine==3.25.1` (~200 MB); normal `uv run pytest` skips all tests via a `HAVE_OQ` guard. |
| 138 | + |
| 139 | +**`tests/oq_import/test_extractor_snapshot_cross_version.py`** — compares `OqHdf5Reader` against the pre-baked Extractor snapshots for all seven OQ versions. No host-side OQ install needed; runs in normal `uv run pytest`. Covers `oqparam`, `sitecol`, `realizations`, `hcurves_rlzs` (classical), and `disagg_rlzs` (disaggregation) for each version. Tests skip gracefully if a snapshot is absent. |
| 140 | + |
| 141 | +### Running |
| 142 | + |
| 143 | +```bash |
| 144 | +uv run tox -e oq-compat |
| 145 | +``` |
| 146 | + |
| 147 | +Or without tox: |
| 148 | + |
| 149 | +```bash |
| 150 | +uv sync --group oq-compat |
| 151 | +uv run pytest tests/oq_import/test_extractor_compat.py -v |
| 152 | +``` |
| 153 | + |
| 154 | +### When to run |
| 155 | + |
| 156 | +- After any change to `toshi_hazard_store/oq_import/h5py_reader.py`. |
| 157 | +- After bumping the pinned OQ version in `[dependency-groups] oq-compat` (`pyproject.toml`) — confirms our reader still matches the new reference. |
| 158 | +- Before releasing changes that touch `extract_classical_hdf5.py` or `extract_disagg_hdf5.py`. |
| 159 | + |
| 160 | +A failure pinpoints the exact field that drifted; fix the reader (not the test) unless the OQ Extractor behaviour itself has changed. |
0 commit comments