Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patched 40.0.0 with Parquet memory limiting40 #37

Open
wants to merge 9 commits into
base: alamb/40.0.0_base
Choose a base branch
from

Conversation

alamb
Copy link
Owner

@alamb alamb commented May 30, 2023

This PR contains a patched version of 40.0.0 that backports the fix for apache#3871 and other related parquet changes so that we can use it in IOx - https://github.com/influxdata/influxdb_iox/pull/7880

It starts with the parquet 40.0.0 release and cherry-picks the following commits. All git cherry-picks applied cleanly ( I didn't need to resolve any conflicts)

3adca53 -- metadata
58e2c1c -- splice column
17ca4d5 - Debug Impls
56437cc - default for writer props
aa799f0 - Send
3e5b07a - more send
6959b4b - metrics
741244d - Fixed size support
ea00892 - Memory Accounting

tustvold and others added 9 commits May 30, 2023 10:57
* Add splice column API (apache#4155)

* Review feedback

* Re-encode offset index
…e#4278)

* Add `Debug` impls for writers

* Improve display
* feat(api make ArrowArrayStreamReader Send

* simplify ptr handling

* rename pyarrow traits to conform to guidelines

* pr feedback

* remove dangling Box::from_raw
* Derive Default for WriterProperties

* Review feedback
* Initial implementation for writing fixed-size lists to Parquet.

The implementation still needs tests.
The implementation uses a new `write_fixed_size_list` method instead of `write_list`.
This is done to avoid the overhead of needlessly calculating list offsets.

* Initial implementation for reading fixed-size lists from Parquet.

The implementation still needs tests.

* Added tests for fixed-size list writer.

Fixed bugs in implementation found via tests.

* Added tests for fixed-size list reader.

Fixed bugs in implementation found via tests.

* Added correct behavior for writing empty fixed-length lists.

Writer now emits the correct definition levels for empty lists.
Added empty list unit test.

* Added correct behavior for reading empty fixed-length lists.

Reader now handles empty list definition levels correctly.
Added empty list unit test.

* Fixed linter warnings.

* Added license header to fixed_size_list_array.rs

* Added fixed-size list reader tests from PR review.

* Added fixed-size reader row length sanity checks.

* Simplified fixed-size list case in LevelInfoBuilder constructor.

* Removed dynamic dispatch inside fixed-length list writer.

* Expanded list of structs test for fixed-size list writer.

* Reverted expected levels in fixed-size list writer test.

* Fixed linter warnings.

* Updated list size check in fixed-size list reader.

Converted the check to return an error instead of panicking.

* Small tweak to row length check in fixed-size list reader.

* Fixed bug in fixed-size list level encoding.

Writer now correctly handles child arrays with variable row length.
Added new unit test to verify the new behavior is correct.

* Added fixed-size list reader test.

Test verifies that reader handles child arrays with variable length correctly.
…ad of RecordBatch (apache#3871) (apache#4280)

* Buffer Pages in ArrowWriter instead of RecordBatch (apache#3871)

* Review feedback

* Improved memory accounting

* Clippy
alamb pushed a commit that referenced this pull request Sep 23, 2024
* chore: add docs, part of #37
- add pragma `#![warn(missing_docs)]` to `arrow`, `arrow-arith`, `arrow-avro`
- add docs to the same to remove lint warnings

* chore: add docs, part of #37
- add pragma `#![warn(missing_docs)]` to `arrow-buffer`, `arrow-cast`, `arrow-csv`
- add docs to the same to remove lint warnings

* chore: update docs, resolve PR comments
alamb pushed a commit that referenced this pull request Sep 24, 2024
* chore: add docs, part of #37

- add pragma `#![warn(missing_docs)]` to the following
  - `arrow-array`
  - `arrow-cast`
  - `arrow-csv`
  - `arrow-data`
  - `arrow-json`
  - `arrow-ord`
  - `arrow-pyarrow-integration-testing`
  - `arrow-row`
  - `arrow-schema`
  - `arrow-select`
  - `arrow-string`
  - `arrow`
  - `parquet_derive`

- add docs to those that generated lint warnings

- Remove `bitflags` workaround in `arrow-schema`
At some point, a change in `bitflags v2.3.0` had
started generating lint warnings in `arrow-schema`,

This was handled using a
[workaround](apache#4233)

[Issue](bitflags/bitflags#356)

`bitflags v2.3.1` fixed the issue hence the
workaround is no longer needed.

* fix: resolve comments on PR apache#6433
alamb pushed a commit that referenced this pull request Oct 1, 2024
* chore: add docs, part of #37
- add pragma `#![warn(missing_docs)]` to the following
  - `arrow-flight`
  - `arrow-ipc`
  - `arrow-integration-test`
  - `arrow-integration-testing`
  - `object_store`

- also document the caveat with using level 10 GZIP compression in
  parquet. See apache#6282.

* chore: resolve PR comments from apache#6453
alamb pushed a commit that referenced this pull request Oct 2, 2024
- add pragma `#![warn(missing_docs)]` to `parquet`

This is the final component in the effort to make Arrow
fully-documented. The entire project now generates warning
for missing docs, if any.

- `arrow-flight`: replace `tonic`'s deprecated `compile_with_config`
with suggested method

- new deprecation:
The following types were not used anywhere and were possibly strays.
They've been marked as deprecated and will be removed in future
versions.

- `parquet::data_types::SliceAsBytesDataType`
- `parquet::column::writer::Level`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants