Skip to content

Commit 5f56287

Browse files
Merge pull request #169 from GNS-Science/fix/168-openquake-extractor
replace Extractor with h5py reader refactor disagg extraction to mirror hazard curve extraction
2 parents 89273f9 + becf502 commit 5f56287

91 files changed

Lines changed: 63032 additions & 754 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CHANGELOG.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,16 @@
11
# Changelog
22

3+
## [2.1.0] 2026-05-15
4+
### Added
5+
- `OqHdf5Reader` to replace OpenQuake `Extractor`
6+
- Test against hdf5 fixtures generated by range of OpenQuake versions
7+
- `scripts/regen_oq_fixtures.py` to (re)generate test fixtures
8+
- `tests/oq_import/test_extractor_compat.py` to test against OpenQuake `Extractor`
9+
- Documentation
10+
11+
### Changed
12+
- `oq-compat` dependency group used for running `tests/oq_import/test_extractor_compat.py` (used to be `openquake` extra dependency)
13+
314
## [2.0.1] 2026-05-11
415
### Added
516
- hatch-vcs for dynamic versioning from git tag (v*)

docs/h5py_extractor_migration.md

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# OpenQuake h5py Extractor
2+
3+
We use our own extractor to obtain data from OpenQuake hdf5 files, rather than using the `openquake.calculators.extract.Extractor`
4+
5+
`openquake.calculators.extract.Extractor` drifts between OQ minor versions. The version matrix test showed that OQ 3.20–3.23 collapses the poe dimension for disaggregations if there is only one poe (which is typical for our use) while OQ 3.24+ includes all degenerate dimensions. Because the Extractor API is unstable, any new OQ minor release could silently break the extraction code.
6+
7+
The OQ HDF5 **file** layout is more stable. We replaced the Extractor with direct h5py reads, dropped `openquake-engine` as a dependency entirely, and validated the new reader against a fixture matrix generated by docker-based OQ runs across seven minor releases.
8+
9+
## `OqHdf5Reader` class
10+
11+
`toshi_hazard_store/oq_import/h5py_reader.py` — an `OqHdf5Reader` class that wraps an HDF5 file and exposes exactly the data the extraction code needs:
12+
13+
| Method | HDF5 path(s) | Notes |
14+
|---|---|---|
15+
| `oqparam()` | `oqparam[()]` | JSON blob decoded to dict |
16+
| `sitecol()` | `sitecol/{sids,lat,lon,...}` | Parallel 1-D arrays → DataFrame |
17+
| `hcurves_rlzs()` | `hcurves-rlzs` | Returns `{rlz-N: arr(sites, imts, levels)}` |
18+
| `gsim_branches()` | `full_lt/gsim_lt` | `{branch_id: uncertainty_str}` |
19+
| `source_branches()` | `full_lt/source_model_lt` | `{idx: sm_lt_path_str}` (values used as source_map keys) |
20+
| `realizations()` | `full_lt/sm_data` + `full_lt/gsim_lt` | List of `_RlzRecord(source_path, gsim_path, ordinal)` |
21+
| `disagg_rlzs(kind, ...)` | `disagg-rlzs/<kind>`, `disagg-bins/*`, `best_rlzs` | Returns a `_DisaggExtract` proxy |
22+
23+
The class is tested against fixtures generated by OQ 3.19.1–3.25.1. It is also tested against the 3.25.1 `openquake.calculators.extract.Extractor` for structural and numerical equivalence.
24+
25+
## HDF5 layout reference
26+
27+
### `oqparam`
28+
29+
Scalar dataset holding a UTF-8 JSON blob of the full `OqParam` dict. Read with:
30+
```python
31+
cfg = json.loads(f['oqparam'][()].decode())
32+
```
33+
34+
Relevant keys: `calculation_mode`, `hazard_imtls`, `iml_disagg`, `disagg_outputs`. Cross-version alias: some older OQ versions may use `intensity_measure_types_and_levels` instead of `hazard_imtls`; the reader resolves this transparently.
35+
36+
### `sitecol/`
37+
38+
Parallel 1-D datasets per field: `sids`, `lon`, `lat`, `depth`, `vs30`, `vs30measured`, `z1pt0`, `z2pt5`, `backarc`. N rows = number of sites.
39+
40+
### `full_lt/`
41+
42+
| Dataset | dtype | Content |
43+
|---|---|---|
44+
| `gsim_lt` | compound `(trt, branch, uncertainty, weight)` | One row per GMM branch. `uncertainty` is the raw `[ClassName]\nparam=val` GSIM string as bytes. Both the raw and OQ-normalised form produce identical nzshm_model hash digests. |
45+
| `source_model_lt` | compound `(branchset, branch, utype, uvalue, weight)` | `branch` column = sm_lt_path string (e.g. `[dmgeologic, ...]`). |
46+
| `sm_data` | compound `(name, weight, path, samples)` | `path` = sm_lt_path; `samples` = number of realizations for this source model. |
47+
48+
Realizations are reconstructed by iterating `sm_data` and for each source model, iterating the next `samples` rows of `gsim_lt` in declaration order. This matches OQ's enumeration for `number_of_logic_tree_samples = 0`.
49+
50+
### `hcurves-rlzs`
51+
52+
Shape `(n_sites, n_rlz, n_imts, n_levels)`. Carries a `json` attribute whose `shape_descr` lists axis names; the `imt` key gives ordered IMT names.
53+
54+
### `disagg-rlzs/<kind>`
55+
56+
Shape `(n_sites, <kind_axes>, n_imt, n_poe, n_rlz)` where `<kind_axes>` expands the underscore-separated kind name (e.g. `TRT_Mag_Dist_Eps``trt, mag, dist, eps`). No `json` attribute — axes are inferred from the kind string.
57+
58+
`disagg-bins/{Axis}` contains bin edges (numeric axes) or labels (TRT). The reader computes bin centres as `(edges[:-1] + edges[1:]) / 2`.
59+
60+
The rlz axis ordering follows `best_rlzs[site_idx]` — an integer array giving the ordinal of each rlz in the disagg result in descending-weight order.
61+
62+
## Cross-version fixture matrix
63+
64+
### Generating fixtures
65+
66+
A developer may want to generate new test fixtures when either new functionality is added to `OqHdf5Reader` that reads features not present in the existing fixtures or they want to support new versions of OpenQuake. These fixtures are then used to make sure that `OqHdf5Reader` continues to behave as expected via tests in `tests/oq_import/test_cross_version_fixtures.py` and `tests/oq_import/test_extractor_snapshot_cross_version.py`.
67+
68+
Prerequisites: Docker installed and `openquake/engine:<ver>` images pullable.
69+
70+
OQ job inputs live in `scripts/oq_input/` (committed):
71+
72+
```
73+
scripts/oq_input/
74+
sources/ ← shared NSHM source model
75+
gsim_model.xml ← shared GSIM logic tree
76+
job_classical.ini
77+
job_disagg.ini
78+
sites_classical.csv
79+
sites_disagg.csv
80+
```
81+
82+
Both calc modes mount this directory as `/job` inside the container and run the appropriate ini file. `export_dir = /tmp` is set in both ini files so OQ can write CSV exports to `/tmp` without touching the read-only mount.
83+
84+
```bash
85+
uv run python scripts/regen_oq_fixtures.py --mode both
86+
```
87+
88+
This will:
89+
1. Pull `openquake/engine:<ver>` for each version in `OQ_VERSIONS`.
90+
2. Detect the image entrypoint (older images use `/bin/bash -c`; newer use `./oq-start.sh`) and build the docker CMD accordingly.
91+
3. Run `oq engine --run /job/job_{classical,disagg}.ini` inside the container.
92+
4. Use `docker cp` (host-side) to pull the resulting `calc_*.hdf5` out of the stopped container — avoids all container-side write-permission issues.
93+
5. Write `tests/fixtures/oq_cross_version/{classical,disaggregation}/oq_<ver>/calc.hdf5` alongside a `manifest.json` recording the image digest, generation timestamp and file checksum.
94+
6. Skip any (version, mode) pair whose `manifest.json` already exists and whose `hdf5_sha256` still matches.
95+
96+
Flags:
97+
- `--version 3.25.1` — regenerate a single version
98+
- `--mode classical|disaggregation|both`
99+
- `--force` — overwrite existing fixtures
100+
- `--dry-run` — print docker commands without running them
101+
102+
### Extractor snapshots
103+
104+
Each fixture directory also contains two pre-baked snapshot files captured from the canonical OQ `Extractor` running inside the same Docker image that produced `calc.hdf5`:
105+
106+
| File | Contents |
107+
|---|---|
108+
| `extractor_snapshot.npz` | Numpy arrays: `sitecol__lat/lon/vs30`, per-rlz `hcurves_rlzs__rlz_NNN` (classical), `disagg__array` (disagg). Load with `np.load(..., allow_pickle=False)`. |
109+
| `extractor_snapshot.json` | Non-array metadata: `oqparam_json`, `realizations`, `hcurves_rlzs_keys`, disagg `kind`/`imt`/`shape_descr`/`rlz_labels`/`disagg_bins`. |
110+
111+
The snapshot is the within-version numerical ground truth consumed by `tests/oq_import/test_extractor_snapshot_cross_version.py` — no host-side OQ install needed at test time. `manifest.json` records `extractor_snapshot_npz_sha256` and `extractor_snapshot_json_sha256` for integrity checking.
112+
113+
Snapshots are generated automatically by `regen_oq_fixtures.py` (a second `docker run` step after the OQ calculation). If snapshots are missing (e.g. for fixtures created before this feature), the corresponding tests skip with an actionable message.
114+
115+
### Adding a new OQ version
116+
117+
1. Append the version string to `OQ_VERSIONS` in `scripts/regen_oq_fixtures.py`.
118+
2. Run `uv run python scripts/regen_oq_fixtures.py --mode both --version <new_ver>`.
119+
3. Commit the new `calc.hdf5`, `extractor_snapshot.npz`, `extractor_snapshot.json`, and `manifest.json`.
120+
4. Run `uv run pytest tests/oq_import/test_cross_version_fixtures.py tests/oq_import/test_extractor_snapshot_cross_version.py -v` — new tests are auto-discovered from the fixture directory.
121+
122+
### Inspecting a fixture by hand
123+
124+
```bash
125+
uv run python -c "
126+
import h5py, json
127+
with h5py.File('tests/fixtures/oq_cross_version/disaggregation/oq_3.25.1/calc.hdf5') as f:
128+
f.visit(print)
129+
print(json.loads(f['oqparam'][()].decode())['calculation_mode'])
130+
"
131+
```
132+
133+
## Compatibility testing
134+
135+
Two complementary suites compare `OqHdf5Reader` against the canonical OQ `Extractor`:
136+
137+
**`tests/oq_import/test_extractor_compat.py`** — runs `OqHdf5Reader` and `openquake.calculators.extract.Extractor` live, side-by-side on the committed classical and disagg fixtures, asserting numerical and structural identity for every field including `bins_digest` and end-to-end RecordBatch output. Opt-in because it pulls `openquake-engine==3.25.1` (~200 MB); normal `uv run pytest` skips all tests via a `HAVE_OQ` guard.
138+
139+
**`tests/oq_import/test_extractor_snapshot_cross_version.py`** — compares `OqHdf5Reader` against the pre-baked Extractor snapshots for all seven OQ versions. No host-side OQ install needed; runs in normal `uv run pytest`. Covers `oqparam`, `sitecol`, `realizations`, `hcurves_rlzs` (classical), and `disagg_rlzs` (disaggregation) for each version. Tests skip gracefully if a snapshot is absent.
140+
141+
### Running
142+
143+
```bash
144+
uv run tox -e oq-compat
145+
```
146+
147+
Or without tox:
148+
149+
```bash
150+
uv sync --group oq-compat
151+
uv run pytest tests/oq_import/test_extractor_compat.py -v
152+
```
153+
154+
### When to run
155+
156+
- After any change to `toshi_hazard_store/oq_import/h5py_reader.py`.
157+
- After bumping the pinned OQ version in `[dependency-groups] oq-compat` (`pyproject.toml`) — confirms our reader still matches the new reference.
158+
- Before releasing changes that touch `extract_classical_hdf5.py` or `extract_disagg_hdf5.py`.
159+
160+
A failure pinpoints the exact field that drifted; fix the reader (not the test) unless the OQ Extractor behaviour itself has changed.

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ nav:
1313
- Configuration: configuration.md
1414
- Usage:
1515
- Usage: usage.md
16+
- OpenQuake Compatibility: h5py_extractor_migration.md
1617
- Data Models:
1718
- Hazard Dataset overview: cli/hazard_dataset_overview.md
1819
- Hazard Metadata: domain_model/hazard_metadata.md

pyproject.toml

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ classifiers=[
2222
requires-python = ">=3.10,<4.0"
2323
dependencies = [
2424
"geopandas>=1.1.2",
25+
"h5py>=3.10",
2526
"lancedb>0.27.10",
2627
"matplotlib>=3.10.8",
2728
"nshm-toshi-client>=1.0.1",
@@ -49,20 +50,20 @@ ths_rlz_import = 'toshi_hazard_store.scripts.ths_rlz_import:main'
4950
ths_rlz_sanity = 'toshi_hazard_store.scripts.ths_rlz_sanity:main'
5051

5152
[project.optional-dependencies]
52-
openquake = [
53-
"fiona>=1.9.5",
54-
"networkx>=3.2.1",
55-
"numba",
56-
"pydantic-to-pyarrow>=0.1.6",
57-
"openquake-engine>=3.20.1",
58-
]
5953
scripts = [
6054
"boto3",
6155
"click>=8.1.3",
6256
"click-plugins>=1.1.1",
6357
]
6458

6559
[dependency-groups]
60+
oq-compat = [
61+
"fiona>=1.9.5",
62+
"networkx>=3.2.1",
63+
"numba",
64+
"openquake-engine==3.25.1",
65+
]
66+
6667
doc = [
6768
"boto3",
6869
"click>=8.1.3",

scripts/bench_sitecol_loop.py

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
"""Bench the sitecol → CodedLocation loop in extract_classical_hdf5.
2+
3+
Times the per-row pandas access pattern in isolation.
4+
5+
Usage:
6+
uv run python scripts/bench_sitecol_loop.py [hdf5_path]
7+
"""
8+
9+
import sys
10+
import timeit
11+
from pathlib import Path
12+
13+
from nzshm_common.location import coded_location
14+
15+
from toshi_hazard_store.oq_import.h5py_reader import OqHdf5Reader
16+
17+
# DEFAULT_HDF5 = (
18+
# Path(__file__).parent.parent
19+
# / 'tests/fixtures/oq_import/openquake_hdf5_archive-T3BlbnF1YWtlSGF6YXJkVGFzazo2OTMxODkz/calc_1.hdf5'
20+
# )
21+
DEFAULT_HDF5 = Path("/home/chrisdc/tmp/calc_1.hdf5")
22+
23+
24+
def under_test(df0):
25+
"""Mirrors extract_classical_hdf5.generate_rlz_record_batches site-indexing block."""
26+
lats = df0['lat'].tolist()
27+
lons = df0['lon'].tolist()
28+
site_vs30s = df0['vs30'].tolist()
29+
nloc_001_locations = [
30+
coded_location.CodedLocation(lat=lat, lon=lon, resolution=0.001) for lat, lon in zip(lats, lons)
31+
]
32+
return nloc_001_locations, site_vs30s
33+
34+
35+
def main():
36+
path = Path(sys.argv[1]) if len(sys.argv) > 1 else DEFAULT_HDF5
37+
df0 = OqHdf5Reader(str(path)).sitecol()
38+
n_sites = df0.shape[0]
39+
40+
timer = timeit.Timer('under_test(df0)', globals={'under_test': under_test, 'df0': df0})
41+
n_loops, _ = timer.autorange() # picks n_loops so each repeat takes ≥ 0.2s
42+
samples = timer.repeat(repeat=5, number=n_loops)
43+
best = min(samples) / n_loops # min is the timeit convention — least scheduler noise
44+
45+
print(f'fixture: {path}')
46+
print(f'n_sites={n_sites} n_loops={n_loops} repeats=5')
47+
print(f'best: {best * 1000:.3f} ms/call per-row: {best / n_sites * 1e6:.2f} µs')
48+
print(f'all repeats (ms/call): {[round(s / n_loops * 1000, 3) for s in samples]}')
49+
50+
51+
if __name__ == '__main__':
52+
main()
53+
54+
# Old implimentation
55+
# fixture: /home/chrisdc/tmp/calc_1.hdf5
56+
# n_sites=3991 n_loops=1 repeats=5
57+
# best: 528.994 ms/call per-row: 132.55 µs
58+
# all repeats (ms/call): [530.03, 528.994, 531.495, 529.58, 532.193]

0 commit comments

Comments
 (0)