PyArrow 15 to 19
baseline performance tests March 2025¶
We started using pyarrow with version 14/15 and now it's up to at least 19.0.1
FILTER¶
Version 15.0.2¶
use scripts\ths_r4_filter_dataset.py to produce a filtered dataset from larger one.
time poetry run python scripts/ths_r4_filter_dataset.py WORKING/ARROW/THS_R4_HIVE WORKING/ARROW/TMP --verbose
using pyarrow version 15.0.2
((nloc_0 == "-37.0~175.0") and (nloc_001 == "-36.852~174.763"))
...
((nloc_0 == "-46.0~171.0") and (nloc_001 == "-45.874~170.504"))
filter 12 locations to WORKING/ARROW
real 2m34.869s
user 20m43.290s
sys 1m11.017s
Version 19.0.1¶
time poetry run python scripts/ths_r4_filter_dataset.py WORKING/ARROW/THS_R4_HIVE WORKING/ARROW/TMP --verbose
using pyarrow version 19.0.1
((nloc_0 == "-37.0~175.0") and (nloc_001 == "-36.852~174.763"))
...
((nloc_0 == "-46.0~171.0") and (nloc_001 == "-45.874~170.504"))
filter 12 locations to WORKING/ARROW/TMP
real 2m29.080s
user 19m51.944s
sys 1m11.387s
Defrag /reorg¶
Use scripts\ths_r4_defrag.py to compact / restructure partioning. This is single threaded.
Reorg small¶
V1: Approx: 4m30
Run 1¶
Version 15¶
chrisbc@tryharder-ubuntu:/GNSDATA/LIB/toshi-hazard-store$ time poetry run python scripts/ths_r4_defrag.py WORKING/ARROW/TMP WORKING/ARROW/TMP_DEFRAG --verbose
using pyarrow version 15.0.2
partitions: []
partition WORKING/ARROW/TMP/nloc_0=-41.0~175.0
...
compacted WORKING/ARROW/TMP/nloc_0=-37.0~175.0
compacted 12 partitions for WORKING/ARROW
real 4m38.415s
user 15m11.949s
sys 11m2.965s
Version 19.0.1¶
time poetry run python scripts/ths_r4_defrag.py WORKING/ARROW/TMP WORKING/ARROW/TMP_DEFRAG --verbose
using pyarrow version 19.0.1
partitions: []
compacted WORKING/ARROW/TMP/nloc_0=-41.0~175.0 has disk size: 494MB
...
pyarrow RSS memory: 466MB
compacted 12 partitions for WORKING/ARROW
real 4m9.106s
user 14m18.134s
sys 10m32.759s
Run 2¶
chrisbc@tryharder-ubuntu:/GNSDATA/LIB/toshi-hazard-store$ time poetry run python scripts/ths_r4_defrag.py WORKING/ARROW/TMP WORKING/ARROW/TMP_DEFRAG --verbose
using pyarrow version 15.0.2
partitions: []
compacted WORKING/ARROW/TMP/nloc_0=-41.0~175.0
RSS: 494MB
...
compacted 12 partitions for WORKING/ARROW
real 4m32.951s
user 15m4.936s
sys 11m4.036s
Run 3¶
chrisbc@tryharder-ubuntu:/GNSDATA/LIB/toshi-hazard-store$ time poetry run python scripts/ths_r4_defrag.py WORKING/ARROW/TMP WORKING/ARROW/TMP_DEFRAG -p vs30 --verbose
using pyarrow version 15.0.2
partitions: ['vs30']
compacted WORKING/ARROW/TMP/nloc_0=-41.0~175.0;
...
compacted 12 partitions for WORKING/ARROW
real 4m17.809s
user 15m27.989s
sys 10m44.570s
Run 4¶
chrisbc@tryharder-ubuntu:/GNSDATA/LIB/toshi-hazard-store$ time poetry run python scripts/ths_r4_defrag.py WORKING/ARROW/TMP WORKING/ARROW/TMP_DEFRAG -p vs30 --verbose
using pyarrow version 15.0.2
partitions: ['vs30']
compacted WORKING/ARROW/TMP/nloc_0=-41.0~175.0 has disk size: 523MB
...
pyarrow RSS memory: 494MB
compacted 12 partitions for WORKING/ARROW
real 4m20.796s
user 15m36.431s
sys 10m48.709s
Run 5¶
Version 15.0.2¶
chrisbc@tryharder-ubuntu:/GNSDATA/LIB/toshi-hazard-store$ time poetry run python scripts/ths_r4_defrag.py WORKING/ARROW/TMP WORKING/ARROW/TMP_DEFRAG -p 'vs30, imt' --verbose
using pyarrow version 15.0.2
partitions: ['vs30', 'imt']
pyarrow RSS memory: 494MB
compacted 12 partitions for WORKING/ARROW
real 0m55.419s
user 8m40.011s
sys 0m39.568s
Version 19.0.1¶
time poetry run python scripts/ths_r4_defrag.py WORKING/ARROW/TMP WORKING/ARROW/TMP_DEFRAG -p 'vs30, imt' --verbose
using pyarrow version 19.0.1
partitions: ['vs30', 'imt']
compacted WORKING/ARROW/TMP/nloc_0=-41.0~175.0 has disk size: 494MB
...
compacted WORKING/ARROW/TMP/nloc_0=-37.0~175.0 has disk size: 494MB
pyarrow RSS memory: 466MB
compacted 12 partitions for WORKING/ARROW
real 0m48.516s
user 8m8.439s
sys 0m27.097s
Reorg larger¶
NOTE this is the F32 dataset which is much more compact and therefore faster, despite having many more values.
Version 15.0.2¶
$ time poetry run python scripts/ths_r4_defrag.py WORKING/ARROW/THS_R4_F32 WORKING/ARROW/TMP_DEFRAG -p 'vs30, imt' --verbose
using pyarrow version 15.0.2
partitions: ['vs30', 'imt']
compacted WORKING/ARROW/THS_R4_F32/nloc_0=-41.0~175.0 has disk size: 500MB
...
pyarrow RSS memory: 27MB
compacted 64 partitions for WORKING/ARROW
real 6m17.818s
user 48m52.176s
sys 3m5.129s
Version 19.0.1¶
time poetry run python scripts/ths_r4_defrag.py WORKING/ARROW/TMP WORKING/ARROW/TMP_DEFRAG --verbose
using pyarrow version 19.0.1
partitions: []
compacted WORKING/ARROW/TMP/nloc_0=-41.0~175.0 has disk size: 494MB
....
pyarrow RSS memory: 46MB
compacted 64 partitions for WORKING/ARROW
real 5m49.718s
user 45m26.746s
sys 2m35.294s