Default encoding, encryption and compression #472

hahax46 · 2024-07-10T07:59:49Z

hahax46
Jul 10, 2024

Hi, im new in Parquet and im building an app to write huge data to parquet files. Can I know what will be the default encoding, encryption and compression if i didnt set the configuration using this library? I noticed the documentation didnt state any and the example is straight forward.

adamreeve · 2024-07-10T09:29:41Z

adamreeve
Jul 10, 2024
Collaborator

Hi @hahax46, we use the same defaults as the Arrow Parquet C++ library which are partly documented here: https://arrow.apache.org/docs/cpp/parquet.html#writer-properties. I believe all columns will use dictionary encoding by default, and no encryption is used. You can also see some of the default writer settings in our test here: https://github.com/G-Research/ParquetSharp/blob/master/csharp.test/TestWriterProperties.cs

One difference between ParquetSharp and the C++ library is that we set the default compression to Snappy by default in most of the ParquetFileWriter constructors, eg:

ParquetSharp/csharp/ParquetFileWriter.cs

Lines 22 to 26 in 62ae832

    
           public ParquetFileWriter( 
        
               string path, 
        
               Column[] columns, 
        
               Compression compression = Compression.Snappy, 
        
               IReadOnlyDictionary<string, string>? keyValueMetadata = null)

Although if you use one of the constructor overloads that accepts WriterProperties instead and create writer properties with WriterProperties.GetDefaultWriterProperties(), then no compression is used by default.

If you care about file size it's definitely worth experimenting with different compression and encoding settings. Eg. it's often worth disabling dictionary encoding for floating point data columns and enabling byte stream split encoding. It doesn't look like we document how to do this, but there's an example in a test here:

ParquetSharp/csharp.test/TestWriterProperties.cs

Lines 173 to 177 in 62ae832

    
           var p = new WriterPropertiesBuilder() 
        
               .Compression(Compression.Snappy) 
        
               .DisableDictionary("value") 
        
               .Encoding("value", Encoding.ByteStreamSplit) 
        
               .Build();

0 replies

houstonhaynes · 2024-08-02T00:53:52Z

houstonhaynes
Aug 2, 2024

I wrote this F# console app to parse a bunch of CMS Medicare Part D historical data - about 36GB in ten files written to roughly 5.4 GB across 500 or so in parquet format.

https://gist.github.com/houstonhaynes/222075b037749918520dfd610b636b6a

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default encoding, encryption and compression #472

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Default encoding, encryption and compression #472

hahax46 Jul 10, 2024

Replies: 2 comments

adamreeve Jul 10, 2024 Collaborator

houstonhaynes Aug 2, 2024

hahax46
Jul 10, 2024

adamreeve
Jul 10, 2024
Collaborator

houstonhaynes
Aug 2, 2024