Replies: 2 comments
-
Hi @hahax46, we use the same defaults as the Arrow Parquet C++ library which are partly documented here: https://arrow.apache.org/docs/cpp/parquet.html#writer-properties. I believe all columns will use dictionary encoding by default, and no encryption is used. You can also see some of the default writer settings in our test here: https://github.com/G-Research/ParquetSharp/blob/master/csharp.test/TestWriterProperties.cs One difference between ParquetSharp and the C++ library is that we set the default compression to Snappy by default in most of the ParquetSharp/csharp/ParquetFileWriter.cs Lines 22 to 26 in 62ae832 Although if you use one of the constructor overloads that accepts If you care about file size it's definitely worth experimenting with different compression and encoding settings. Eg. it's often worth disabling dictionary encoding for floating point data columns and enabling byte stream split encoding. It doesn't look like we document how to do this, but there's an example in a test here: ParquetSharp/csharp.test/TestWriterProperties.cs Lines 173 to 177 in 62ae832 |
Beta Was this translation helpful? Give feedback.
-
I wrote this F# console app to parse a bunch of CMS Medicare Part D historical data - about 36GB in ten files written to roughly 5.4 GB across 500 or so in parquet format. https://gist.github.com/houstonhaynes/222075b037749918520dfd610b636b6a |
Beta Was this translation helpful? Give feedback.
-
Hi, im new in Parquet and im building an app to write huge data to parquet files. Can I know what will be the default encoding, encryption and compression if i didnt set the configuration using this library? I noticed the documentation didnt state any and the example is straight forward.
Beta Was this translation helpful? Give feedback.
All reactions