Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RecordReader trait and proc macro to implement it for a struct #4773

Merged
merged 16 commits into from
Oct 30, 2023
Merged

Add RecordReader trait and proc macro to implement it for a struct #4773

merged 16 commits into from
Oct 30, 2023

Conversation

Joseph-Rance
Copy link
Contributor

Which issue does this PR close?

Closes #4772.

What changes are included in this PR?

  • Added a parquet::record::RecordReader trait similar to the existing parquet::record::RecordWriter

  • Added a parquet_derive::ParquetRecordReader derive macro to implement the new RecordReader trait for slices of structs, similar to the existing parquet_derive::ParquetRecordWriter macro

  • Added tests to parquet_field.rs for reader_snipped similar to the existing tests for writer_snipped

  • Added a test to parqued_derive_test to test writing a struct slice to a parquet file, reading it again, and then checking the output is the same as the original struct.

  • Not all types supported by ParquetRecordWriter are supported by ParquetRecordReader yet. Specifically there is no support for options or references. References is difficult because using Rc would mean we have to manually annotate types for the compiler which requires quite a lot of matching on the Field's ty field. See: https://github.com/Joseph-Rance/arrow-rs/blob/f49472eb93c4bf27aa20d8040c57539fc73d1059/parquet_derive/src/parquet_field.rs#L452-L454

  • Updated the readme to reflect the above point

Are there any user-facing changes?

  • Users will now have access to the parquet::record::RecordReader trait and the parquet_derive::ParquetRecordReader proc macro

@github-actions github-actions bot added parquet Changes to the parquet crate parquet-derive labels Sep 4, 2023
@Joseph-Rance
Copy link
Contributor Author

@tustvold would it be possible to get a review on this? Apologies if I am pinging the wrong person. Please do let me know if there is anything further I need to do to get this ready for review

@tustvold
Copy link
Contributor

tustvold commented Sep 11, 2023

Apologies, it is on my list to review, but I'm not very familiar with the non-arrow APIs so need longer to sit down and get the necessary context. Development has largely focused over the last couple of years on the columnar, arrow-based APIs as they have better performance characteristics, so I lack any immediate context 😅

@Joseph-Rance
Copy link
Contributor Author

No problem. Just wanted to make sure this PR hadn't been missed

@Joseph-Rance
Copy link
Contributor Author

@tustvold Apologies for pinging you again. I was wondering if you could estimate a possible timeline for review of this PR?

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done an initial review, overall this looks very nice, I would still encourage people who care about performance to use the columnar APIs, but I can see the appeal/utility of a row-API.

Overall this looks well tested and seems to make sense. My only comment is that the logic that converts from columnar to the row representation I think could be made easier to follow and a bit better documented

parquet_derive/src/parquet_field.rs Outdated Show resolved Hide resolved
parquet_derive/src/parquet_field.rs Outdated Show resolved Hide resolved
parquet_derive/src/parquet_field.rs Outdated Show resolved Hide resolved
parquet_derive/src/parquet_field.rs Show resolved Hide resolved
parquet_derive/src/parquet_field.rs Outdated Show resolved Hide resolved
parquet_derive/src/parquet_field.rs Outdated Show resolved Hide resolved
parquet_derive/src/parquet_field.rs Outdated Show resolved Hide resolved
parquet/src/record/record_reader.rs Show resolved Hide resolved
@Joseph-Rance
Copy link
Contributor Author

@tustvold I've resolved most of your comments. Left two open:

  • For the references, I think there isn't anything more to do, but I left it open in case I have misunderstood the situation
  • For the other comment, I wonder if you have any opinions on what I should write?

parquet_derive/README.md Outdated Show resolved Hide resolved
@Joseph-Rance
Copy link
Contributor Author

Any idea why clippy is suddenly failing on code that (I think) has nothing to do with this PR? Can I ignore these issues or is it due to something I have changed?

@tustvold
Copy link
Contributor

You probably need to merge the latest master to pick up fixes for the lints added in the latest Rust version

@Joseph-Rance
Copy link
Contributor Author

@tustvold could you approve the workflows please?

@Joseph-Rance
Copy link
Contributor Author

Joseph-Rance commented Oct 20, 2023

@tustvold workflows should pass now (apologies for fmt failing the previous time)

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me, I have to confess to not being sure that it will be possible to evolve this interface to support nested data, but perhaps that isn't a problem. We don't any row/record-oriented APIs that support nesting, so perhaps this isn't an issue

}
Some(ThirdPartyType::ChronoNaiveDate) => {
quote! {
::chrono::naive::NaiveDate::from_num_days_from_ce_opt(vals[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this logic different to above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure exactly what your question related to, but this logic for getting the NaiveDate is just the inverse of above. However, there is no function from_signed_duration_since that mirrors the signed_duration_since function we are using above, so I had to use from_num_days_from_ce_opt, which uses a different date to measure days from. That's why we need to add on the number of days between that other date and 01/01/1970.

parquet_derive/src/lib.rs Outdated Show resolved Hide resolved
parquet_derive/src/lib.rs Outdated Show resolved Hide resolved
parquet_derive/README.md Outdated Show resolved Hide resolved
@tustvold tustvold merged commit d9aaa43 into apache:master Oct 30, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

There exists a ParquetRecordWriter proc macro in parquet_derive, but ParquetRecordReader is missing
2 participants