Disaggregation¶

These models represent aggregated disaggregation arrays derived from OpenQuake PSHA engine outputs.

Disaggregation Aggregate¶

The core model for aggregated disaggregation data, stored as PyArrow parquet datasets.

Bases: BaseModel

Aggregated disaggregation arrays across realisations.

Attributes:

Name	Type	Description
`compatible_calc_id`	`str`	FK for hazard-calc equivalence.
`hazard_model_id`	`str`	NSHM hazard model identifier e.g. "NSHM_v1.0.4" (caller-supplied).
`bins_digest`	`str`	sha256[:16] over sorted axes + sorted bin centres (compatibility key).
`nloc_001`	`str`	location string at 0.001° resolution e.g. "-38.330~175.550".
`nloc_0`	`str`	location string at 1.0° resolution (used for partitioning).
`vs30`	`int`	VS30 value in m/s.
`imt`	`str`	intensity measure type label e.g. "PGA", "SA(1.0)".
`target_aggr`	`str`	hazard-curve aggregation the disagg was conditioned on e.g. "mean", "0.5".
`probability`	`ProbabilityEnum`	ProbabilityEnum name supplied by caller e.g. "_10_PCT_IN_50YRS".
`imtl`	`float`	IML at which the disagg was computed.
`aggr`	`str`	aggregation type applied across realisations e.g. "mean", "0.1".
`disagg_bins`	`dict[str, list[str]]`	ordered map `{axis_name: [bin_centre_str, ...]}` — key order defines the axis order of `disagg_values`; values are stringified bin centres.
`disagg_values`	`List[float]`	flattened disaggregation array over `disagg_bins` axes, C-order.

Source code in toshi_hazard_store/model/hazard_models_pydantic.py

class DisaggregationAggregate(BaseModel):
    """Aggregated disaggregation arrays across realisations.

    Attributes:
        compatible_calc_id: FK for hazard-calc equivalence.
        hazard_model_id: NSHM hazard model identifier e.g. "NSHM_v1.0.4" (caller-supplied).
        bins_digest: sha256[:16] over sorted axes + sorted bin centres (compatibility key).
        nloc_001: location string at 0.001° resolution e.g. "-38.330~175.550".
        nloc_0: location string at 1.0° resolution (used for partitioning).
        vs30: VS30 value in m/s.
        imt: intensity measure type label e.g. "PGA", "SA(1.0)".
        target_aggr: hazard-curve aggregation the disagg was conditioned on e.g. "mean", "0.5".
        probability: ProbabilityEnum name supplied by caller e.g. "_10_PCT_IN_50YRS".
        imtl: IML at which the disagg was computed.
        aggr: aggregation type applied across realisations e.g. "mean", "0.1".
        disagg_bins: ordered map ``{axis_name: [bin_centre_str, ...]}`` — key order
            defines the axis order of ``disagg_values``; values are stringified bin centres.
        disagg_values: flattened disaggregation array over ``disagg_bins`` axes, C-order.
    """

    compatible_calc_id: str
    hazard_model_id: str
    bins_digest: str
    nloc_001: str
    nloc_0: str
    vs30: int
    imt: str
    target_aggr: str
    probability: ProbabilityEnum
    imtl: float
    aggr: str
    disagg_bins: dict[str, list[str]]
    disagg_values: List[float]

    @field_serializer("probability")
    def serialize_probability(self, value: ProbabilityEnum) -> str:
        return value.name

    @field_validator("disagg_bins")
    @classmethod
    def validate_bins_nonempty(cls, value: dict) -> dict:
        if not value:
            raise ValueError("disagg_bins must not be empty")
        return value

    @model_validator(mode="after")
    def validate_values_shape(self) -> "DisaggregationAggregate":
        expected = prod(len(v) for v in self.disagg_bins.values())
        if len(self.disagg_values) != expected:
            raise ValueError(
                f"disagg_values length {len(self.disagg_values)} does not match product of bin sizes {expected}"
            )
        return self

    def to_ndarray(self):
        """Reshape disagg_values into an N-D array with axes ordered by disagg_bins keys."""
        from toshi_hazard_store.model.pyarrow.disagg_reshape import reshape_disagg_values

        return reshape_disagg_values(self.disagg_values, self.disagg_bins)

    @staticmethod
    def pyarrow_schema() -> pa.schema:
        """A pyarrow schema for aggregate disaggregation datasets."""
        return get_disagg_aggregate_schema()

PyArrow Schema¶

The DisaggregationAggregate model can be converted to a PyArrow schema for dataset I/O:

from toshi_hazard_store.model.hazard_models_pydantic import DisaggregationAggregate
schema = DisaggregationAggregate.pyarrow_schema()

The schema includes:

compatible_calc_id (string) - Compatible calculation identifier
hazard_model_id (string) - Model identifier (e.g., "NSHM_v1.0.4")
bins_digest (string) - sha256[:16] compatibility key over sorted axes + bin centres
nloc_001 (string) - Location to 3 decimal places (e.g., "-41.300~174.800")
nloc_0 (string) - Location to 0 decimal places (e.g., "-41.0~174.0") for partitioning
vs30 (int32) - VS30 value
imt (string) - Intensity measure type (e.g., "PGA", "SA(1.0)")
target_aggr (string) - Hazard-curve aggregation the disagg was conditioned on (e.g., "mean")
probability (string) - Return-period probability as a ProbabilityEnum name (e.g., "_10_PCT_IN_50YRS")
imtl (float) - Intensity measure level at which the disagg was computed
aggr (string) - Aggregation type across realisations (e.g., "mean", "0.1")
disagg_bins (map of string → list of string) - Ordered map of axis name to bin-centre strings; key order defines the axis order of disagg_values
disagg_values (list of float32) - Flattened C-order disaggregation array over disagg_bins axes

Dataset Partitioning¶

Disaggregation aggregate datasets use Hive-style partitioning on bins_digest / vs30 / nloc_0:

<dataset_root>/
├── bins_digest=6028db096c3a9e62/
│   ├── vs30=400/
│   │   └── nloc_0=-41.0~174.0/
│   │       └── <uuid>-part-0.parquet
│   └── vs30=1500/
│       └── nloc_0=-41.0~174.0/
│           └── <uuid>-part-0.parquet

The bins_digest partition groups rows with identical bin topology, enabling efficient filtering when querying a specific disaggregation configuration. Use the d2 query strategy for large datasets to exploit all three partition levels. The bins_digest can be obtained from the disagg_bins with toshi_hazard_store.model.revision_4.extract_disagg_hdf5.compute_bins_digest.

Note that this partitioning is not enforced by append_models_to_dataset, it is left to the user to dictate the partitioning either at write time or (more usually) after running ths_ds_defrag.

Reshaping disagg_values¶

disagg_values is stored as a flat list. Use to_ndarray() to reshape it into an N-D array with axes ordered by disagg_bins:

from toshi_hazard_store import query
from toshi_hazard_store.model.constraints import ProbabilityEnum

bins = {
    "mag": ["5.5", "6.5", "7.5"],
    "dist": ["10.0", "50.0", "100.0", "200.0"],
    "eps": ["-1.0", "0.0", "1.0"],
}

for disagg in query.get_disagg_aggregates(
    location_codes=["-41.300~174.800"],
    vs30s=[400],
    hazard_model="NSHM_v1.0.4",
    imts=["PGA"],
    aggs=["mean"],
    target_aggrs=["mean"],
    probabilities=[ProbabilityEnum._10_PCT_IN_50YRS],
    disagg_bins=bins,
    strategy="d2",
):
    arr = disagg.to_ndarray()   # shape: (3, 4, 3) for mag × dist × eps
    print(arr.shape)

The flat storage form is preserved as the canonical representation; reshaping is opt-in and allocates a numpy array on demand.

Constraint Enums¶

Probability Enum¶

Bases: Enum

Defines the values available for probabilities.

store values as float representing probability in 1 year

Source code in toshi_hazard_store/model/constraints.py

class ProbabilityEnum(Enum):
    """
    Defines the values available for probabilities.

    store values as float representing probability in 1 year
    """

    _86_PCT_IN_50YRS = 3.8559e-02
    _63_PCT_IN_50YRS = 1.9689e-02
    _39_PCT_IN_50YRS = 9.8372e-03
    _18_PCT_IN_50YRS = 3.9612e-03
    _10_PCT_IN_50YRS = 2.1050e-03
    _5_PCT_IN_50YRS = 1.0253e-03
    _2_PCT_IN_50YRS = 4.0397e-04
    _1_PCT_IN_50YRS = 2.0099e-04
    _05_PCT_IN_50YRS = 1.0025e-04