Multi-threaded writing of output files #107

gessulat · 2023-09-21T09:11:48Z

For very large datasets, single-threaded IO operations are currently a speed bottleneck.
Pyarrow datasets natively support:

partitioning a dataframe
multi-threaded read
multi-threaded write
specifying number of cores used for parallel operations - personally, I prefer setting such values via environment variables

Tasks

read partitioned input data
write to partitioned output data

gessulat · 2023-09-21T09:13:14Z

This issue is related to the upgrade to polars #89
Optimizing reading and writing could be done independently, though.

gessulat · 2023-09-29T08:50:28Z

To motivate this @sambenfredj and I did some benchmarks with a 3M PSM Mokapot input file (tab/csv) and converted it to parquet with different reader and writer implementations. Note, there are several ways how to read and write parquet files. You can write parquet files with different compression algorithms and with different reader and writer implementations: pandas, pyarrow, and polars. Within pyarrow there are again multiple options to read and write parquet files. That's why the read_speed plot is confusing, but I didn't have the time to clean it up - sorry!

Speed is always in seconds.

TL;DR:

using polars read and write implementation for parquet files with lz4 compression seems to be the optimal choice for performance
if possible: directly read row groups

file sizes

zstd offers best compression
lz4 offers second best compression and is probably acceptable, given that lz4 yields better read/write performance.

read speed

polars reader implementations are faster than the pyarrow implementation
if we can read row groups directly (e.g. in a streaming setting) this would be the preferred way to do it.
lz4 compression is offering the best read speed

write speed

polars write implementation is faster than pyarrow
surprisingly lz4 compressed writing is faster than non-compressed writing oO. This finding is consistent for both polars and pyarrow implementations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threaded writing of output files #107

Multi-threaded writing of output files #107

gessulat commented Sep 21, 2023 •

edited

Loading

gessulat commented Sep 21, 2023

gessulat commented Sep 29, 2023

Multi-threaded writing of output files #107

Multi-threaded writing of output files #107

Comments

gessulat commented Sep 21, 2023 • edited Loading

gessulat commented Sep 21, 2023

gessulat commented Sep 29, 2023

file sizes

read speed

write speed

gessulat commented Sep 21, 2023 •

edited

Loading