Skip to content

Commit

Permalink
Change default parquet compression format from Snappy to LZ4
Browse files Browse the repository at this point in the history
Snappy's status as default is maybe just due to history.  Snappy had
better Java support and LZ4 wasn't always available in systems like
Spark.  Today Spark and other systems support LZ4 as well, and LZ4
generally performs a bit better, especially on decompression.

This is a significant change, but I think the only reason not to do it
is historical, which I think maybe isn't a good enough reason these
days.
  • Loading branch information
mrocklin committed Nov 24, 2023
1 parent da7dc67 commit 312fb0f
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions dask/dataframe/io/parquet/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -698,7 +698,7 @@ def to_parquet(
df,
path,
engine="auto",
compression="snappy",
compression="lz4",
write_index=True,
append=False,
overwrite=False,
Expand Down Expand Up @@ -729,10 +729,10 @@ def to_parquet(
engine : {'auto', 'pyarrow', 'fastparquet'}, default 'auto'
Parquet library to use. Defaults to 'auto', which uses ``pyarrow`` if
it is installed, and falls back to ``fastparquet`` otherwise.
compression : string or dict, default 'snappy'
Either a string like ``"snappy"`` or a dictionary mapping column names
to compressors like ``{"name": "gzip", "values": "snappy"}``. Defaults
to ``"snappy"``.
compression : string or dict, default 'lz4'
Either a string like ``"lz4"`` or a dictionary mapping column names
to compressors like ``{"name": "gzip", "values": "lz4"}``. Defaults
to ``"lz4"``.
write_index : boolean, default True
Whether or not to write the index. Defaults to True.
append : bool, default False
Expand Down

0 comments on commit 312fb0f

Please sign in to comment.