Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for file row numbers in Parquet readers #7307

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jkylling
Copy link

@jkylling jkylling commented Mar 18, 2025

Which issue does this PR close?

Closes #7299.

What changes are included in this PR?

In this PR we:

  • Add configuration to the ArrowReaderBuilder to set a row_number_column used to extend the read RecordBatches with an additional column with file row numbers.
  • Keep track of the first row number in each row group in the file. This is computed from the file metadata.
  • Add an ArrayReader to the vector of ArrayReaders reading columns from the Parquet file, if the row_number_column is set in the reader configuration. This is a RowNumberReader, which is a special ArrayReader. It reads no data from the Parquet pages, but uses the first row numbers in the RowGroupMetaData to keep track of progress.
  • Add some basic tests and fuzz tests of the functionality.

The RowGroupMetaData::first_row_number is Option<i64>, since it is possible that the row number is unknown (I encountered an instance of this when trying to integrate this PR in delta-rs), and it's better if None is used instead of some special integer value.

The performance impact of this PR should be negligible when the row number column is not set. The only additional overhead would be the tracking of the first_row_number of each row group.

Are there any user-facing changes?

We add an additional public method:

  • ArrowReaderBuilder::with_row_number_column

There are a few breaking changes as we touch a few public interfaces:

  • RowGroupMetaData::from_thrift and RowGroupMetaData::from_thrift_encrypted takes an additional parameter first_row_number: Optional<i64>.
  • The trait RowGroups has an additional method RowGroups::row_groups. Potentially this method could replace the RowGroups::num_rows method or provide a default implementation for it.
  • An additional error variant ParquetError::RowGroupMetaDataMissingRowNumber.

I'm very open to suggestions on how to reduce the amount of breaking changes.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Return file row number in Parquet readers
1 participant