Add support for file row numbers in Parquet readers #7307
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #7299.
What changes are included in this PR?
In this PR we:
ArrowReaderBuilder
to set arow_number_column
used to extend the readRecordBatches
with an additional column with file row numbers.ArrayReader
to the vector ofArrayReader
s reading columns from the Parquet file, if therow_number_column
is set in the reader configuration. This is aRowNumberReader
, which is a specialArrayReader
. It reads no data from the Parquet pages, but uses the first row numbers in theRowGroupMetaData
to keep track of progress.The
RowGroupMetaData::first_row_number
isOption<i64>
, since it is possible that the row number is unknown (I encountered an instance of this when trying to integrate this PR in delta-rs), and it's better ifNone
is used instead of some special integer value.The performance impact of this PR should be negligible when the row number column is not set. The only additional overhead would be the tracking of the
first_row_number
of each row group.Are there any user-facing changes?
We add an additional public method:
ArrowReaderBuilder::with_row_number_column
There are a few breaking changes as we touch a few public interfaces:
RowGroupMetaData::from_thrift
andRowGroupMetaData::from_thrift_encrypted
takes an additional parameterfirst_row_number: Optional<i64>
.RowGroups
has an additional methodRowGroups::row_groups
. Potentially this method could replace theRowGroups::num_rows
method or provide a default implementation for it.ParquetError::RowGroupMetaDataMissingRowNumber
.I'm very open to suggestions on how to reduce the amount of breaking changes.