Skip to content

HDF5 support for resampled opacity DB#399

Merged
natashabatalha merged 8 commits into
natashabatalha:devfrom
Nicholaswogan:hdf5
Apr 23, 2026
Merged

HDF5 support for resampled opacity DB#399
natashabatalha merged 8 commits into
natashabatalha:devfrom
Nicholaswogan:hdf5

Conversation

@Nicholaswogan
Copy link
Copy Markdown
Collaborator

Summary

This PR adds a HDF5 support for resampled opacity databases as an alternative to SQLite. There is overlap with this PR: #360

The HDF5 files can store opacities in two different formats: log10_uint16 and log10_float32.

The log10_uint16 approach saves log10 opacities using unsigned 16 bit integers as described in #397. When lzf + shuffle compression is used, this decreases resampled opacity file sizes by a factor of 10. The resulting spectra with the log10_uint16 are the same as those computed with a SQLite opacity DB to within a factor of < 0.04% in core tests (reflected + transmission + thermal). Also, the log10_uint16 approach with compression is as fast or faster than SQLite (see benchmark below).

The log10_float32 storage format is just log10 of opacity as a float32, which stores the opacity more accurately.

What changed

  • Added RetrieveOpacitiesHDF5 and wired opannection(...) to dispatch to it automatically for .h5 and .hdf5 files.
  • Added HDF5 resampled-opacity support for both nearest and linear query modes.
  • Added support for HDF5 opacity storage formats:
    • log10_uint16
    • log10_float32
  • Reworked the HDF5 reader to use dense row blocks, reusable output buffers, and compiled numeric kernels instead of the old string-keyed hot path.
  • Added convert_sqlite_to_hdf5(...) to picaso/opacity_factory.py so users can convert a resampled SQLite opacity database directly into the HDF5 layout expected by PICASO.
  • Added configurable molecular and continuum log floors, chunk sizing, and progress reporting to the conversion helper.
  • Tightened SQLite query-parameter handling in optics.py so NumPy scalar values are converted to plain Python types before parameter binding.

Validation

I have attached a test that I ran on my local machine. I run thermal + transmission + reflected light calculations of an Earth-like atmosphere with both an SQLite opacity DB and an HDF5 opacity DB with log10_uint16 storage as well as lzf+shuffle compression. The results of the test are pasted below.

For query_method = nearest, the relative difference between spectra is often around ~1e-6 and at maximum ~4e-4. Runtimes for HDF5 are faster in all cases.

For query_method = linear, the relative difference is large because of a bug in the SQLite indexing as described in #398. A separate PR should fix this issue with SQLite because "fixing" the bug might break previous SQLite DBs and so choices must be made. When I fix the bug for my test opacity DB, then I get good agreement between the HDF5 and SQLite path for query_method = linear. The runtimes are faster for the HDF5 because I implemented numba-based linear interpolation.

test.py

query_method = nearest
calculation = transmission; wave_range = [1.0, 5.0]
Case SQLite warmup runtime: 0.33 s
Case SQLite steady-state runtime: 0.37 s
Case HDF5_I16 warmup runtime: 0.31 s
Case HDF5_I16 steady-state runtime: 0.28 s
HDF5 uint16 vs SQLite rprs2 relative difference stats
  p50 rel diff:  1.0e-08
  p75 rel diff:  1.9e-08
  p90 rel diff:  3.1e-08
  p99 rel diff:  7.4e-08
  p100 rel diff: 1.4e-07
calculation = reflected; wave_range = [0.2, 2.0]
Case SQLite warmup runtime: 2.19 s
Case SQLite steady-state runtime: 2.25 s
Case HDF5_I16 warmup runtime: 2.20 s
Case HDF5_I16 steady-state runtime: 2.11 s
HDF5 uint16 vs SQLite albedo relative difference stats
  p50 rel diff:  2.6e-06
  p75 rel diff:  9.4e-06
  p90 rel diff:  2.0e-05
  p99 rel diff:  1.2e-04
  p100 rel diff: 4.1e-04
calculation = thermal; wave_range = [3.0, 20.0]
Case SQLite warmup runtime: 0.86 s
Case SQLite steady-state runtime: 0.83 s
Case HDF5_I16 warmup runtime: 0.86 s
Case HDF5_I16 steady-state runtime: 0.80 s
HDF5 uint16 vs SQLite planet flux relative difference stats
  p50 rel diff:  2.1e-06
  p75 rel diff:  4.8e-06
  p90 rel diff:  8.7e-06
  p99 rel diff:  1.7e-05
  p100 rel diff: 2.6e-05

query_method = linear
calculation = transmission; wave_range = [1.0, 5.0]
Case SQLite warmup runtime: 0.73 s
Case SQLite steady-state runtime: 0.78 s
Case HDF5_I16 warmup runtime: 0.49 s
Case HDF5_I16 steady-state runtime: 0.45 s
HDF5 uint16 vs SQLite rprs2 relative difference stats
  p50 rel diff:  3.8e-04
  p75 rel diff:  6.8e-04
  p90 rel diff:  1.0e-03
  p99 rel diff:  1.3e-03
  p100 rel diff: 1.4e-03
calculation = reflected; wave_range = [0.2, 2.0]
Case SQLite warmup runtime: 2.80 s
Case SQLite steady-state runtime: 2.82 s
Case HDF5_I16 warmup runtime: 2.53 s
Case HDF5_I16 steady-state runtime: 2.36 s
HDF5 uint16 vs SQLite albedo relative difference stats
  p50 rel diff:  2.3e-04
  p75 rel diff:  6.4e-02
  p90 rel diff:  3.8e-01
  p99 rel diff:  1.8e+00
  p100 rel diff: 4.8e+00
calculation = thermal; wave_range = [3.0, 20.0]
Case SQLite warmup runtime: 1.41 s
Case SQLite steady-state runtime: 1.53 s
Case HDF5_I16 warmup runtime: 1.05 s
Case HDF5_I16 steady-state runtime: 1.01 s
HDF5 uint16 vs SQLite planet flux relative difference stats
  p50 rel diff:  1.0e-01
  p75 rel diff:  2.5e-01
  p90 rel diff:  3.3e-01
  p99 rel diff:  1.1e+00
  p100 rel diff: 1.8e+00

@natashabatalha natashabatalha merged commit 2399fae into natashabatalha:dev Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants