Skip to content

Testng iceberg and pyiceberg

using scripts/ths_iceberg.py

Test 1 local catalog (sqlite)

one location 2 vs30

with:

aggr_uri = "s3://ths-dataset-prod/NZSHM22_AGG"
fltr = pc.field("nloc_001") == "-41.300~174.800"

chrisbc@MLX01 toshi-hazard-store % time poetry run python toshi_hazard_store/scripts/ths_iceberg.py
Opened pyarrow table in 10.826197
created iceberg table in 0.104256
Saved 1080 rows to iceberg table in 0.057077
poetry run python toshi_hazard_store/scripts/ths_iceberg.py  3.43s user 3.68s system 59% cpu 12.010 total

all locations, 1 vs30

with:

# fltr = pc.field("nloc_001") == "-41.200~174.800"
fltr = pc.field("vs30") == 400

chrisbc@MLX01 toshi-hazard-store % time poetry run python toshi_hazard_store/scripts/ths_iceberg.py
Opened pyarrow table in 21.961363
created iceberg table in 0.123352
Saved 2020140 rows to iceberg table in 15.482653
poetry run python toshi_hazard_store/scripts/ths_iceberg.py  51.75s user 13.20s system 165% cpu 39.259 total

Test 2 S3 GP catalog (in-memory)

one location 2 vs30

with:

aggr_uri = "s3://ths-dataset-prod/NZSHM22_AGG"
fltr = pc.field("nloc_001") == "-41.300~174.800"
catalog_uri = "s3://ths-poc-arrow-test/ICEBERG_CATALOG"

time poetry run python toshi_hazard_store/scripts/ths_iceberg.py
Opened pyarrow table in 12.224047
created iceberg table in 2.071799
Saved 1080 rows to iceberg table in 3.24654
poetry run python toshi_hazard_store/scripts/ths_iceberg.py  3.54s user 3.81s system 38% cpu 18.862 total

Test 2 S3 Table catalog (in-memory)

one location 2 vs30

with:

aggr_uri = "s3://ths-dataset-prod/NZSHM22_AGG"
fltr = pc.field("nloc_001") == "-41.300~174.800"
catalog_uri = "s3://ths-poc-iceberg"

poetry run python toshi_hazard_store/scripts/ths_iceberg.py  3.54s user 3.81s system 38% cpu 18.862 total
chrisbc@MLX01 toshi-hazard-store % time poetry run python toshi_hazard_store/scripts/ths_iceberg.py
Opened pyarrow table in 11.932467
Unable to resolve region for bucket ths-poc-iceberg
Traceback (most recent call last):
  File "/Users/Shared/DEV/GNS/LIB/toshi-hazard-store/toshi_hazard_store/scripts/ths_iceberg.py", line 55, in <module>
    import_to_iceberg()
  File "/Users/Shared/DEV/GNS/LIB/toshi-hazard-store/toshi_hazard_store/scripts/ths_iceberg.py", line 41, in import_to_iceberg
    icetable = catalog.create_table("DEFAULT.aggr", schema=dt0.schema)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chrisbc/Library/Caches/pypoetry/virtualenvs/toshi-hazard-store--2imPOIE-py3.12/lib/python3.12/site-packages/pyiceberg/catalog/sql.py", line 217, in create_table
    self._write_metadata(metadata, io, metadata_location)
  File "/Users/chrisbc/Library/Caches/pypoetry/virtualenvs/toshi-hazard-store--2imPOIE-py3.12/lib/python3.12/site-packages/pyiceberg/catalog/__init__.py", line 939, in _write_metadata
    ToOutputFile.table_metadata(metadata, io.new_output(metadata_path))
  File "/Users/chrisbc/Library/Caches/pypoetry/virtualenvs/toshi-hazard-store--2imPOIE-py3.12/lib/python3.12/site-packages/pyiceberg/serializers.py", line 130, in table_metadata
    with output_file.create(overwrite=overwrite) as output_stream:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chrisbc/Library/Caches/pypoetry/virtualenvs/toshi-hazard-store--2imPOIE-py3.12/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py", line 341, in create
    output_file = self._filesystem.open_output_stream(self._path, buffer_size=self._buffer_size)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_fs.pyx", line 885, in pyarrow._fs.FileSystem.open_output_stream
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: When initiating multiple part upload for key 'DEFAULT.db/aggr/metadata/00000-414345b9-2ea8-46fe-ace5-636a7ddadf41.metadata.json' in bucket 'ths-poc-iceberg': AWS Error NO_SUCH_BUCKET during CreateMultipartUpload operation: The specified bucket does not exist
poetry run python toshi_hazard_store/scripts/ths_iceberg.py  3.49s user 3.73s system 45% cpu 15.735 total

THIS WONT WORK

Test 3 using AWS Glue

followign https://aws.amazon.com/blogs/storage/access-data-in-amazon-s3-tables-using-pyiceberg-through-the-aws-glue-iceberg-rest-endpoint/

with :

REGION = 'ap-southeast-2'
CATALOG = 's3tablescatalog'
DATABASE = 'ths_poc_iceberg_db'  # DATABASE -> Namepspace in pyiceberg terms
TABLE_BUCKET = 'ths-poc-iceberg'
TABLE_NAME = 'AGGR'
time poetry run python toshi_hazard_store/scripts/ths_iceberg.py
Opened pyarrow table in 24.090168
created iceberg table in 1.939119
Saved 2020140 rows to iceberg table in 42.426441
       compatible_calc_id hazard_model_id  aggr                                             values      imt         nloc_001  vs30       nloc_0
0                 NZSHM22     NSHM_v1.0.4  mean  [0.051696815, 0.05169638, 0.051679485, 0.05160...      PGA  -34.300~172.900   400  -34.0~173.0
1                 NZSHM22     NSHM_v1.0.4  mean  [0.052680086, 0.052679673, 0.052663658, 0.0525...      PGA  -34.300~173.000   400  -34.0~173.0
2                 NZSHM22     NSHM_v1.0.4  mean  [0.05444223, 0.054441817, 0.05442487, 0.054347...      PGA  -34.300~173.100   400  -34.0~173.0
3                 NZSHM22     NSHM_v1.0.4  mean  [0.048430245, 0.04842985, 0.04841446, 0.048345...      PGA  -34.400~172.600   400  -34.0~173.0
4                 NZSHM22     NSHM_v1.0.4  mean  [0.05057133, 0.050570954, 0.050555933, 0.05048...      PGA  -34.400~172.700   400  -34.0~173.0
...                   ...             ...   ...                                                ...      ...              ...   ...          ...
101002            NZSHM22     NSHM_v1.0.4  mean  [0.17536972, 0.13379893, 0.098121226, 0.079893...  SA(7.5)  -46.600~169.500   400  -47.0~170.0
101003            NZSHM22     NSHM_v1.0.4  mean  [0.1706562, 0.1300151, 0.0952137, 0.07742944, ...  SA(7.5)  -46.600~169.600   400  -47.0~170.0
101004            NZSHM22     NSHM_v1.0.4  mean  [0.16497253, 0.12532957, 0.091469646, 0.074174...  SA(7.5)  -46.600~169.700   400  -47.0~170.0
101005            NZSHM22     NSHM_v1.0.4  mean  [0.15930109, 0.120670915, 0.0876214, 0.0707053...  SA(7.5)  -46.600~169.800   400  -47.0~170.0
101006            NZSHM22     NSHM_v1.0.4  mean  [0.16793892, 0.12838548, 0.09438368, 0.0769320...  SA(7.5)  -46.700~169.500   400  -47.0~170.0

[101007 rows x 8 columns]
poetry run python toshi_hazard_store/scripts/ths_iceberg.py  56.79s user 16.48s system 86% cpu 1:24.27 total

Test 4 query performance comparision

Here's aa a simple test retrieving the typical 80 user curves for NSHM Korroaa....

chrisbc@MLX01 toshi-hazard-store % time poetry run python toshi_hazard_store/scripts/ths_iceberg.py
opened dateset in 0.573172
opened table in 0.538331
(80, 8)
>>>>>
Queried pyarrow table in 0.002323 secs
Total 1.113826 secs
>>>>>

opened catalog in 0.390378
opened table in 1.20442
(80, 4)
>>>>>
Queried iceberg table in 15.744698 secs
Total 17.339496 secs
>>>>>
poetry run python toshi_hazard_store/scripts/ths_iceberg.py  2.03s user 3.88s system 30% cpu 19.316 total

The pyarrow version is now way faster, because it's using partitioning. note that the slower setup (opening dataset/table) will be cached.

With th4e three THS options

THS_DATASET_AGGR_URI=s3://ths-dataset-prod/NZSHM22_AGG poetry run python toshi_hazard_store/scripts/ths_iceberg.py
opened dateset in 0.771697
opened table in 0.888011
(80, 8)
>>>>>
Queried pyarrow table in 0.007486 secs
Total 1.667194 secs
>>>>>

opened catalog in 0.726231
opened table in 1.431485
(80, 4)
>>>>>
Queried iceberg table in 12.177356 secs
Total 14.335072 secs
>>>>>

>>>>>
Total for Function get_hazard_curves_naive 1.904679 secs
>>>>>

>>>>>
Total for Function get_hazard_curves_by_vs30 1.062653 secs
>>>>>

>>>>>
Total for Function get_hazard_curves_by_vs30_nloc0 1.032392 secs
>>>>>

THS_DATASET_AGGR_URI=s3://ths-dataset-prod/NZSHM22_AGG poetry run python   2.41s user 3.98s system 29% cpu 21.331 total