Skip to content

OpenQuake h5py Extractor

We use our own extractor to obtain data from OpenQuake hdf5 files, rather than using the openquake.calculators.extract.Extractor

openquake.calculators.extract.Extractor drifts between OQ minor versions. The version matrix test showed that OQ 3.20–3.23 collapses the poe dimension for disaggregations if there is only one poe (which is typical for our use) while OQ 3.24+ includes all degenerate dimensions. Because the Extractor API is unstable, any new OQ minor release could silently break the extraction code.

The OQ HDF5 file layout is more stable. We replaced the Extractor with direct h5py reads, dropped openquake-engine as a dependency entirely, and validated the new reader against a fixture matrix generated by docker-based OQ runs across seven minor releases.

OqHdf5Reader class

toshi_hazard_store/oq_import/h5py_reader.py — an OqHdf5Reader class that wraps an HDF5 file and exposes exactly the data the extraction code needs:

Method HDF5 path(s) Notes
oqparam() oqparam[()] JSON blob decoded to dict
sitecol() sitecol/{sids,lat,lon,...} Parallel 1-D arrays → DataFrame
hcurves_rlzs() hcurves-rlzs Returns {rlz-N: arr(sites, imts, levels)}
gsim_branches() full_lt/gsim_lt {branch_id: uncertainty_str}
source_branches() full_lt/source_model_lt {idx: sm_lt_path_str} (values used as source_map keys)
realizations() full_lt/sm_data + full_lt/gsim_lt List of _RlzRecord(source_path, gsim_path, ordinal)
disagg_rlzs(kind, ...) disagg-rlzs/<kind>, disagg-bins/*, best_rlzs Returns a _DisaggExtract proxy

The class is tested against fixtures generated by OQ 3.19.1–3.25.1. It is also tested against the 3.25.1 openquake.calculators.extract.Extractor for structural and numerical equivalence.

HDF5 layout reference

oqparam

Scalar dataset holding a UTF-8 JSON blob of the full OqParam dict. Read with:

cfg = json.loads(f['oqparam'][()].decode())

Relevant keys: calculation_mode, hazard_imtls, iml_disagg, disagg_outputs. Cross-version alias: some older OQ versions may use intensity_measure_types_and_levels instead of hazard_imtls; the reader resolves this transparently.

sitecol/

Parallel 1-D datasets per field: sids, lon, lat, depth, vs30, vs30measured, z1pt0, z2pt5, backarc. N rows = number of sites.

full_lt/

Dataset dtype Content
gsim_lt compound (trt, branch, uncertainty, weight) One row per GMM branch. uncertainty is the raw [ClassName]\nparam=val GSIM string as bytes. Both the raw and OQ-normalised form produce identical nzshm_model hash digests.
source_model_lt compound (branchset, branch, utype, uvalue, weight) branch column = sm_lt_path string (e.g. [dmgeologic, ...]).
sm_data compound (name, weight, path, samples) path = sm_lt_path; samples = number of realizations for this source model.

Realizations are reconstructed by iterating sm_data and for each source model, iterating the next samples rows of gsim_lt in declaration order. This matches OQ's enumeration for number_of_logic_tree_samples = 0.

hcurves-rlzs

Shape (n_sites, n_rlz, n_imts, n_levels). Carries a json attribute whose shape_descr lists axis names; the imt key gives ordered IMT names.

disagg-rlzs/<kind>

Shape (n_sites, <kind_axes>, n_imt, n_poe, n_rlz) where <kind_axes> expands the underscore-separated kind name (e.g. TRT_Mag_Dist_Epstrt, mag, dist, eps). No json attribute — axes are inferred from the kind string.

disagg-bins/{Axis} contains bin edges (numeric axes) or labels (TRT). The reader computes bin centres as (edges[:-1] + edges[1:]) / 2.

The rlz axis ordering follows best_rlzs[site_idx] — an integer array giving the ordinal of each rlz in the disagg result in descending-weight order.

Cross-version fixture matrix

Generating fixtures

A developer may want to generate new test fixtures when either new functionality is added to OqHdf5Reader that reads features not present in the existing fixtures or they want to support new versions of OpenQuake. These fixtures are then used to make sure that OqHdf5Reader continues to behave as expected via tests in tests/oq_import/test_cross_version_fixtures.py and tests/oq_import/test_extractor_snapshot_cross_version.py.

Prerequisites: Docker installed and openquake/engine:<ver> images pullable.

OQ job inputs live in scripts/oq_input/ (committed):

scripts/oq_input/
  sources/          ← shared NSHM source model
  gsim_model.xml    ← shared GSIM logic tree
  job_classical.ini
  job_disagg.ini
  sites_classical.csv
  sites_disagg.csv

Both calc modes mount this directory as /job inside the container and run the appropriate ini file. export_dir = /tmp is set in both ini files so OQ can write CSV exports to /tmp without touching the read-only mount.

uv run python scripts/regen_oq_fixtures.py --mode both

This will: 1. Pull openquake/engine:<ver> for each version in OQ_VERSIONS. 2. Detect the image entrypoint (older images use /bin/bash -c; newer use ./oq-start.sh) and build the docker CMD accordingly. 3. Run oq engine --run /job/job_{classical,disagg}.ini inside the container. 4. Use docker cp (host-side) to pull the resulting calc_*.hdf5 out of the stopped container — avoids all container-side write-permission issues. 5. Write tests/fixtures/oq_cross_version/{classical,disaggregation}/oq_<ver>/calc.hdf5 alongside a manifest.json recording the image digest, generation timestamp and file checksum. 6. Skip any (version, mode) pair whose manifest.json already exists and whose hdf5_sha256 still matches.

Flags: - --version 3.25.1 — regenerate a single version - --mode classical|disaggregation|both - --force — overwrite existing fixtures - --dry-run — print docker commands without running them

Extractor snapshots

Each fixture directory also contains two pre-baked snapshot files captured from the canonical OQ Extractor running inside the same Docker image that produced calc.hdf5:

File Contents
extractor_snapshot.npz Numpy arrays: sitecol__lat/lon/vs30, per-rlz hcurves_rlzs__rlz_NNN (classical), disagg__array (disagg). Load with np.load(..., allow_pickle=False).
extractor_snapshot.json Non-array metadata: oqparam_json, realizations, hcurves_rlzs_keys, disagg kind/imt/shape_descr/rlz_labels/disagg_bins.

The snapshot is the within-version numerical ground truth consumed by tests/oq_import/test_extractor_snapshot_cross_version.py — no host-side OQ install needed at test time. manifest.json records extractor_snapshot_npz_sha256 and extractor_snapshot_json_sha256 for integrity checking.

Snapshots are generated automatically by regen_oq_fixtures.py (a second docker run step after the OQ calculation). If snapshots are missing (e.g. for fixtures created before this feature), the corresponding tests skip with an actionable message.

Adding a new OQ version

  1. Append the version string to OQ_VERSIONS in scripts/regen_oq_fixtures.py.
  2. Run uv run python scripts/regen_oq_fixtures.py --mode both --version <new_ver>.
  3. Commit the new calc.hdf5, extractor_snapshot.npz, extractor_snapshot.json, and manifest.json.
  4. Run uv run pytest tests/oq_import/test_cross_version_fixtures.py tests/oq_import/test_extractor_snapshot_cross_version.py -v — new tests are auto-discovered from the fixture directory.

Inspecting a fixture by hand

uv run python -c "
import h5py, json
with h5py.File('tests/fixtures/oq_cross_version/disaggregation/oq_3.25.1/calc.hdf5') as f:
    f.visit(print)
    print(json.loads(f['oqparam'][()].decode())['calculation_mode'])
"

Compatibility testing

Two complementary suites compare OqHdf5Reader against the canonical OQ Extractor:

tests/oq_import/test_extractor_compat.py — runs OqHdf5Reader and openquake.calculators.extract.Extractor live, side-by-side on the committed classical and disagg fixtures, asserting numerical and structural identity for every field including bins_digest and end-to-end RecordBatch output. Opt-in because it pulls openquake-engine==3.25.1 (~200 MB); normal uv run pytest skips all tests via a HAVE_OQ guard.

tests/oq_import/test_extractor_snapshot_cross_version.py — compares OqHdf5Reader against the pre-baked Extractor snapshots for all seven OQ versions. No host-side OQ install needed; runs in normal uv run pytest. Covers oqparam, sitecol, realizations, hcurves_rlzs (classical), and disagg_rlzs (disaggregation) for each version. Tests skip gracefully if a snapshot is absent.

Running

uv run tox -e oq-compat

Or without tox:

uv sync --group oq-compat
uv run pytest tests/oq_import/test_extractor_compat.py -v

When to run

  • After any change to toshi_hazard_store/oq_import/h5py_reader.py.
  • After bumping the pinned OQ version in [dependency-groups] oq-compat (pyproject.toml) — confirms our reader still matches the new reference.
  • Before releasing changes that touch extract_classical_hdf5.py or extract_disagg_hdf5.py.

A failure pinpoints the exact field that drifted; fix the reader (not the test) unless the OQ Extractor behaviour itself has changed.