-
Notifications
You must be signed in to change notification settings - Fork 898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve parquet gzip compression performance using zlib-rs #7200
base: main
Are you sure you want to change the base?
Conversation
@@ -50,7 +50,7 @@ bytes = { version = "1.1", default-features = false, features = ["std"] } | |||
thrift = { version = "0.17", default-features = false } | |||
snap = { version = "1.0", default-features = false, optional = true } | |||
brotli = { version = "7.0", default-features = false, features = ["std"], optional = true } | |||
flate2 = { version = "1.0", default-features = false, features = ["rust_backend"], optional = true } | |||
flate2 = { version = "1.1", default-features = false, features = ["zlib-rs"], optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this pure-rust? Does this compile for wasm32-unknown-unknown
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. its written in pure rust.
In my fork I can see wasm32 pipeline not failing https://github.com/psvri/arrow-rs/actions/runs/13547936797/job/37864105483
It seems like this would be another good reason to 🤔 |
I merged this PR up from main to rerun the tests as I think the failing CI check was resoled |
Hi @psvri -- thank you for this contribution I noticed that this PR reports benchmark numbers. What benchmark were these? Are they from zilb-rs? Did you run any benchmarks for the parquet crate itself (as in from the https://github.com/apache/arrow-rs/tree/main/parquet/benches directory)?
|
Hello @alamb I got these numbers by running gzip benchmarks from this file https://github.com/apache/arrow-rs/blob/main/parquet/benches/compression.rs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again @psvri
I also ran the benchmarks on this branch and saw similar improvements as you reported (up to 96% faster)
++ critcmp main improve_parquet_gzip_compressions_2
group improve_parquet_gzip_compressions_2 main
----- ----------------------------------- ----
compress GZIP(GzipLevel(6)) - alphanumeric 1.00 27.4±0.06ms ? ?/sec 1.46 40.1±0.21ms ? ?/sec
compress GZIP(GzipLevel(6)) - words 1.00 31.5±0.09ms ? ?/sec 1.55 49.0±0.09ms ? ?/sec
decompress GZIP(GzipLevel(6)) - alphanumeric 1.00 3.0±0.01ms ? ?/sec 1.12 3.4±0.01ms ? ?/sec
decompress GZIP(GzipLevel(6)) - words 1.00 1275.7±5.24µs ? ?/sec 1.96 2.5±0.01ms ? ?/sec
Let's hold this one until the next major parquet release (eta in April) to minimize the chance of disruption. I think we can merge the PR after I cut the next maintenance release in a few days
The MSRV action is failing since I haven't updated the rust version from 1.70 to 1.75 in my branch. Would it be possible to update MSRV in the next major version ? |
I think we should I think we are blocked on someone writing down a policy. |
Which issue does this PR close?
Closes #.
Rationale for this change
We will use zlib-rs for delfate operations which has much better performance than the current one. I can see ~10%-47% performance improvement in various scenarios
perf numbers
What changes are included in this PR?
I have updated the flate library to use zlib-rs backend. This does mean that we need to bump our MSRV to 1.75 . So I dont expect the PR to merged immediately until we sort out #181
Also we allow gzip level 10 in our parquet implementation , which is non complaint gzip level as explained here . Hence I have also changed max gzip level to 9.
Are there any user-facing changes?
Yes, max gzip level is now 9 in parquet.