Skip to content

Disaggregation

These models represent aggregated disaggregation arrays derived from OpenQuake PSHA engine outputs.

Disaggregation Aggregate

The core model for aggregated disaggregation data, stored as PyArrow parquet datasets.

Bases: BaseModel

Aggregated disaggregation arrays across realisations.

Attributes:

Name Type Description
compatible_calc_id str

FK for hazard-calc equivalence.

hazard_model_id str

NSHM hazard model identifier e.g. "NSHM_v1.0.4" (caller-supplied).

bins_digest str

sha256[:16] over sorted axes + sorted bin centres (compatibility key).

nloc_001 str

location string at 0.001° resolution e.g. "-38.330~175.550".

nloc_0 str

location string at 1.0° resolution (used for partitioning).

vs30 int

VS30 value in m/s.

imt str

intensity measure type label e.g. "PGA", "SA(1.0)".

target_aggr str

hazard-curve aggregation the disagg was conditioned on e.g. "mean", "0.5".

probability ProbabilityEnum

ProbabilityEnum name supplied by caller e.g. "_10_PCT_IN_50YRS".

imtl float

IML at which the disagg was computed.

aggr str

aggregation type applied across realisations e.g. "mean", "0.1".

disagg_bins dict[str, list[str]]

ordered map {axis_name: [bin_centre_str, ...]} — key order defines the axis order of disagg_values; values are stringified bin centres.

disagg_values List[float]

flattened disaggregation array over disagg_bins axes, C-order.

Source code in toshi_hazard_store/model/hazard_models_pydantic.py
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
class DisaggregationAggregate(BaseModel):
    """Aggregated disaggregation arrays across realisations.

    Attributes:
        compatible_calc_id: FK for hazard-calc equivalence.
        hazard_model_id: NSHM hazard model identifier e.g. "NSHM_v1.0.4" (caller-supplied).
        bins_digest: sha256[:16] over sorted axes + sorted bin centres (compatibility key).
        nloc_001: location string at 0.001° resolution e.g. "-38.330~175.550".
        nloc_0: location string at 1.0° resolution (used for partitioning).
        vs30: VS30 value in m/s.
        imt: intensity measure type label e.g. "PGA", "SA(1.0)".
        target_aggr: hazard-curve aggregation the disagg was conditioned on e.g. "mean", "0.5".
        probability: ProbabilityEnum name supplied by caller e.g. "_10_PCT_IN_50YRS".
        imtl: IML at which the disagg was computed.
        aggr: aggregation type applied across realisations e.g. "mean", "0.1".
        disagg_bins: ordered map ``{axis_name: [bin_centre_str, ...]}`` — key order
            defines the axis order of ``disagg_values``; values are stringified bin centres.
        disagg_values: flattened disaggregation array over ``disagg_bins`` axes, C-order.
    """

    compatible_calc_id: str
    hazard_model_id: str
    bins_digest: str
    nloc_001: str
    nloc_0: str
    vs30: int
    imt: str
    target_aggr: str
    probability: ProbabilityEnum
    imtl: float
    aggr: str
    disagg_bins: dict[str, list[str]]
    disagg_values: List[float]

    @field_serializer("probability")
    def serialize_probability(self, value: ProbabilityEnum) -> str:
        return value.name

    @field_validator("disagg_bins")
    @classmethod
    def validate_bins_nonempty(cls, value: dict) -> dict:
        if not value:
            raise ValueError("disagg_bins must not be empty")
        return value

    @model_validator(mode="after")
    def validate_values_shape(self) -> "DisaggregationAggregate":
        expected = prod(len(v) for v in self.disagg_bins.values())
        if len(self.disagg_values) != expected:
            raise ValueError(
                f"disagg_values length {len(self.disagg_values)} does not match product of bin sizes {expected}"
            )
        return self

    def to_ndarray(self):
        """Reshape disagg_values into an N-D array with axes ordered by disagg_bins keys."""
        from toshi_hazard_store.model.pyarrow.disagg_reshape import reshape_disagg_values

        return reshape_disagg_values(self.disagg_values, self.disagg_bins)

    @staticmethod
    def pyarrow_schema() -> pa.schema:
        """A pyarrow schema for aggregate disaggregation datasets."""
        return get_disagg_aggregate_schema()

PyArrow Schema

The DisaggregationAggregate model can be converted to a PyArrow schema for dataset I/O:

from toshi_hazard_store.model.hazard_models_pydantic import DisaggregationAggregate
schema = DisaggregationAggregate.pyarrow_schema()

The schema includes:

  • compatible_calc_id (string) - Compatible calculation identifier
  • hazard_model_id (string) - Model identifier (e.g., "NSHM_v1.0.4")
  • bins_digest (string) - sha256[:16] compatibility key over sorted axes + bin centres
  • nloc_001 (string) - Location to 3 decimal places (e.g., "-41.300~174.800")
  • nloc_0 (string) - Location to 0 decimal places (e.g., "-41.0~174.0") for partitioning
  • vs30 (int32) - VS30 value
  • imt (string) - Intensity measure type (e.g., "PGA", "SA(1.0)")
  • target_aggr (string) - Hazard-curve aggregation the disagg was conditioned on (e.g., "mean")
  • probability (string) - Return-period probability as a ProbabilityEnum name (e.g., "_10_PCT_IN_50YRS")
  • imtl (float) - Intensity measure level at which the disagg was computed
  • aggr (string) - Aggregation type across realisations (e.g., "mean", "0.1")
  • disagg_bins (map of string → list of string) - Ordered map of axis name to bin-centre strings; key order defines the axis order of disagg_values
  • disagg_values (list of float32) - Flattened C-order disaggregation array over disagg_bins axes

Dataset Partitioning

Disaggregation aggregate datasets use Hive-style partitioning on bins_digest / vs30 / nloc_0:

<dataset_root>/
├── bins_digest=6028db096c3a9e62/
│   ├── vs30=400/
│   │   └── nloc_0=-41.0~174.0/
│   │       └── <uuid>-part-0.parquet
│   └── vs30=1500/
│       └── nloc_0=-41.0~174.0/
│           └── <uuid>-part-0.parquet

The bins_digest partition groups rows with identical bin topology, enabling efficient filtering when querying a specific disaggregation configuration. Use the d2 query strategy for large datasets to exploit all three partition levels. The bins_digest can be obtained from the disagg_bins with toshi_hazard_store.model.revision_4.extract_disagg_hdf5.compute_bins_digest.

Note that this partitioning is not enforced by append_models_to_dataset, it is left to the user to dictate the partitioning either at write time or (more usually) after running ths_ds_defrag.

Reshaping disagg_values

disagg_values is stored as a flat list. Use to_ndarray() to reshape it into an N-D array with axes ordered by disagg_bins:

from toshi_hazard_store import query
from toshi_hazard_store.model.constraints import ProbabilityEnum

bins = {
    "mag": ["5.5", "6.5", "7.5"],
    "dist": ["10.0", "50.0", "100.0", "200.0"],
    "eps": ["-1.0", "0.0", "1.0"],
}

for disagg in query.get_disagg_aggregates(
    location_codes=["-41.300~174.800"],
    vs30s=[400],
    hazard_model="NSHM_v1.0.4",
    imts=["PGA"],
    aggs=["mean"],
    target_aggrs=["mean"],
    probabilities=[ProbabilityEnum._10_PCT_IN_50YRS],
    disagg_bins=bins,
    strategy="d2",
):
    arr = disagg.to_ndarray()   # shape: (3, 4, 3) for mag × dist × eps
    print(arr.shape)

The flat storage form is preserved as the canonical representation; reshaping is opt-in and allocates a numpy array on demand.

Constraint Enums

Probability Enum

Bases: Enum

Defines the values available for probabilities.

store values as float representing probability in 1 year

Source code in toshi_hazard_store/model/constraints.py
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
class ProbabilityEnum(Enum):
    """
    Defines the values available for probabilities.

    store values as float representing probability in 1 year
    """

    _86_PCT_IN_50YRS = 3.8559e-02
    _63_PCT_IN_50YRS = 1.9689e-02
    _39_PCT_IN_50YRS = 9.8372e-03
    _18_PCT_IN_50YRS = 3.9612e-03
    _10_PCT_IN_50YRS = 2.1050e-03
    _5_PCT_IN_50YRS = 1.0253e-03
    _2_PCT_IN_50YRS = 4.0397e-04
    _1_PCT_IN_50YRS = 2.0099e-04
    _05_PCT_IN_50YRS = 1.0025e-04