Patched `40.0.0` with Parquet memory limiting40 #37

alamb · 2023-05-30T15:18:46Z

This PR contains a patched version of 40.0.0 that backports the fix for apache#3871 and other related parquet changes so that we can use it in IOx - https://github.com/influxdata/influxdb_iox/pull/7880

It starts with the parquet 40.0.0 release and cherry-picks the following commits. All git cherry-picks applied cleanly ( I didn't need to resolve any conflicts)

3adca53 -- metadata
58e2c1c -- splice column
17ca4d5 - Debug Impls
56437cc - default for writer props
aa799f0 - Send
3e5b07a - more send
6959b4b - metrics
741244d - Fixed size support
ea00892 - Memory Accounting

* Add splice column API (apache#4155) * Review feedback * Re-encode offset index

…e#4278) * Add `Debug` impls for writers * Improve display

* feat(api make ArrowArrayStreamReader Send * simplify ptr handling * rename pyarrow traits to conform to guidelines * pr feedback * remove dangling Box::from_raw

* Derive Default for WriterProperties * Review feedback

* Initial implementation for writing fixed-size lists to Parquet. The implementation still needs tests. The implementation uses a new `write_fixed_size_list` method instead of `write_list`. This is done to avoid the overhead of needlessly calculating list offsets. * Initial implementation for reading fixed-size lists from Parquet. The implementation still needs tests. * Added tests for fixed-size list writer. Fixed bugs in implementation found via tests. * Added tests for fixed-size list reader. Fixed bugs in implementation found via tests. * Added correct behavior for writing empty fixed-length lists. Writer now emits the correct definition levels for empty lists. Added empty list unit test. * Added correct behavior for reading empty fixed-length lists. Reader now handles empty list definition levels correctly. Added empty list unit test. * Fixed linter warnings. * Added license header to fixed_size_list_array.rs * Added fixed-size list reader tests from PR review. * Added fixed-size reader row length sanity checks. * Simplified fixed-size list case in LevelInfoBuilder constructor. * Removed dynamic dispatch inside fixed-length list writer. * Expanded list of structs test for fixed-size list writer. * Reverted expected levels in fixed-size list writer test. * Fixed linter warnings. * Updated list size check in fixed-size list reader. Converted the check to return an error instead of panicking. * Small tweak to row length check in fixed-size list reader. * Fixed bug in fixed-size list level encoding. Writer now correctly handles child arrays with variable row length. Added new unit test to verify the new behavior is correct. * Added fixed-size list reader test. Test verifies that reader handles child arrays with variable length correctly.

…ad of RecordBatch (apache#3871) (apache#4280) * Buffer Pages in ArrowWriter instead of RecordBatch (apache#3871) * Review feedback * Improved memory accounting * Clippy

* chore: add docs, part of #37 - add pragma `#![warn(missing_docs)]` to `arrow`, `arrow-arith`, `arrow-avro` - add docs to the same to remove lint warnings * chore: add docs, part of #37 - add pragma `#![warn(missing_docs)]` to `arrow-buffer`, `arrow-cast`, `arrow-csv` - add docs to the same to remove lint warnings * chore: update docs, resolve PR comments

* chore: add docs, part of #37 - add pragma `#![warn(missing_docs)]` to the following - `arrow-array` - `arrow-cast` - `arrow-csv` - `arrow-data` - `arrow-json` - `arrow-ord` - `arrow-pyarrow-integration-testing` - `arrow-row` - `arrow-schema` - `arrow-select` - `arrow-string` - `arrow` - `parquet_derive` - add docs to those that generated lint warnings - Remove `bitflags` workaround in `arrow-schema` At some point, a change in `bitflags v2.3.0` had started generating lint warnings in `arrow-schema`, This was handled using a [workaround](apache#4233) [Issue](bitflags/bitflags#356) `bitflags v2.3.1` fixed the issue hence the workaround is no longer needed. * fix: resolve comments on PR apache#6433

* chore: add docs, part of #37 - add pragma `#![warn(missing_docs)]` to the following - `arrow-flight` - `arrow-ipc` - `arrow-integration-test` - `arrow-integration-testing` - `object_store` - also document the caveat with using level 10 GZIP compression in parquet. See apache#6282. * chore: resolve PR comments from apache#6453

- add pragma `#![warn(missing_docs)]` to `parquet` This is the final component in the effort to make Arrow fully-documented. The entire project now generates warning for missing docs, if any. - `arrow-flight`: replace `tonic`'s deprecated `compile_with_config` with suggested method - new deprecation: The following types were not used anywhere and were possibly strays. They've been marked as deprecated and will be removed in future versions. - `parquet::data_types::SliceAsBytesDataType` - `parquet::column::writer::Level`

tustvold and others added 9 commits May 30, 2023 10:57

Convert parquet metadata back to builders (apache#4265)

aef9ab0

Add splice column API (apache#4155) (apache#4269)

62c6cbb

* Add splice column API (apache#4155) * Review feedback * Re-encode offset index

Add Debug impls for ArrowWriter and SerializedFileWriter (apach…

4cd418c

…e#4278) * Add `Debug` impls for writers * Improve display

Make GenericColumnWriter Send (apache#4287)

2f805db

feat(api!): make ArrowArrayStreamReader Send (apache#4232)

b0b2cf7

* feat(api make ArrowArrayStreamReader Send * simplify ptr handling * rename pyarrow traits to conform to guidelines * pr feedback * remove dangling Box::from_raw

Only increment metrics for data pages (apache#4285)

a257a93

Derive Default for WriterProperties (apache#4268)

feea2bc

* Derive Default for WriterProperties * Review feedback

Improve ArrowWriter memory usage: Buffer Pages in ArrowWriter inste…

4a59200

…ad of RecordBatch (apache#3871) (apache#4280) * Buffer Pages in ArrowWriter instead of RecordBatch (apache#3871) * Review feedback * Improved memory accounting * Clippy

github-actions bot added arrow parquet parquet-derive labels May 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patched `40.0.0` with Parquet memory limiting40 #37

Patched `40.0.0` with Parquet memory limiting40 #37

alamb commented May 30, 2023 •

edited

Loading

Patched 40.0.0 with Parquet memory limiting40 #37

Are you sure you want to change the base?

Patched 40.0.0 with Parquet memory limiting40 #37

Conversation

alamb commented May 30, 2023 • edited Loading

Patched `40.0.0` with Parquet memory limiting40 #37

Patched `40.0.0` with Parquet memory limiting40 #37

alamb commented May 30, 2023 •

edited

Loading