Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threaded writing of output files #107

Open
2 tasks
gessulat opened this issue Sep 21, 2023 · 2 comments
Open
2 tasks

Multi-threaded writing of output files #107

gessulat opened this issue Sep 21, 2023 · 2 comments

Comments

@gessulat
Copy link
Contributor

gessulat commented Sep 21, 2023

For very large datasets, single-threaded IO operations are currently a speed bottleneck.
Pyarrow datasets natively support:

Tasks

  • read partitioned input data
  • write to partitioned output data
@gessulat
Copy link
Contributor Author

This issue is related to the upgrade to polars #89
Optimizing reading and writing could be done independently, though.

@gessulat
Copy link
Contributor Author

To motivate this @sambenfredj and I did some benchmarks with a 3M PSM Mokapot input file (tab/csv) and converted it to parquet with different reader and writer implementations. Note, there are several ways how to read and write parquet files. You can write parquet files with different compression algorithms and with different reader and writer implementations: pandas, pyarrow, and polars. Within pyarrow there are again multiple options to read and write parquet files. That's why the read_speed plot is confusing, but I didn't have the time to clean it up - sorry!

Speed is always in seconds.

TL;DR:

  • using polars read and write implementation for parquet files with lz4 compression seems to be the optimal choice for performance
  • if possible: directly read row groups

file sizes

file_sizes

  • zstd offers best compression
  • lz4 offers second best compression and is probably acceptable, given that lz4 yields better read/write performance.

read speed

read_speed

  • polars reader implementations are faster than the pyarrow implementation
  • if we can read row groups directly (e.g. in a streaming setting) this would be the preferred way to do it.
  • lz4 compression is offering the best read speed

write speed

write_speed

  • polars write implementation is faster than pyarrow
  • surprisingly lz4 compressed writing is faster than non-compressed writing oO. This finding is consistent for both polars and pyarrow implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant