-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP Upgrade to arrow-rs/parquet 54.0.0
#13663
Conversation
bd6517c
to
44bb39d
Compare
54.0.0
3c5f100
to
51fdc1e
Compare
@@ -1777,6 +1777,8 @@ fn round_trip_datatype() { | |||
} | |||
} | |||
|
|||
// TODO file a ticket about handling deprecated dict_id attributes | |||
#[allow(deprecated)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what to do here:
- Should we remove the
field_id
from DataFusion serialization code (as it is planned for removal eventually in DataFusion)? - Should we un-deprecate
field_id
in arrow-rs? - Something else?
@avantgardnerio, @thinkharderdev and @Dandandan , this test was added in
However, @brancz's upstream PR in arrow-rs has deprecated dict_id
on Field
and all related APIs with the idea it will be eventually removed (and become a detail of the IPC transport rather than visible in the Schema)
- Add deprecation warnings for everything related to
dict_id
arrow-rs#6873 upstream in arrow-rs
@thinkharderdev mentioned apache/arrow-rs#6873 (comment)
If people are using the dictionary ID it's probably for something that should properly be done with kv metadata instead.
So one possibility could be to provide some more documentation / writeup as an upgrade guide for arrow explaining how to use field metadata instead ?
Please advise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If people are using the dictionary ID it's probably for something that should properly be done with kv metadata instead
I mostly meant that if you give people 32 bits of space someone, somewhere will use it as metadata if they don't need it for anything else. I don't really have any particular use case in mind.
I don't have any objection to removing the field from datafusion serde. When we added that test we were still fighting to try and ensure unique dict IDs but we've long since moved on and found better ways around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe on second thought we should keep it around until it is fully removed from upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sounds like a reasonable approach. I'll update the comments to match if no one else has a usecase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Also, #8457 (comment) suggests it was never important to set the dict id directly, just that it wasn't sent correctly on the wire and already called out what to do once we remove the dict id. There were in fact a number of dict fixes that were necessary for us to correctly communicate the dicts correctly over the wire as well.
51fdc1e
to
fb080a2
Compare
fb080a2
to
2fe785b
Compare
@@ -459,10 +459,6 @@ config_namespace! { | |||
/// default parquet writer setting | |||
pub statistics_enabled: Option<String>, transform = str::to_lowercase, default = Some("page".into()) | |||
|
|||
/// (writing) Sets max statistics size for any column. If NULL, uses | |||
/// default parquet writer setting | |||
pub max_statistics_size: Option<usize>, default = Some(4096) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think simply removing the config option is pretty rough as you end up with errors liek this
External error: query failed: DataFusion error: Invalid or Unsupported Configuration: Config value "max_statistics_size" not found on ParquetOptions
The best thing to do would be to mark the config setting deprecated, but there wasn't an obvious way to do this with the existing macros. I will futz with the macros and see if I can make it deprecated somehow
@@ -321,6 +321,8 @@ impl TryFrom<&protobuf::Field> for Field { | |||
fn try_from(field: &protobuf::Field) -> Result<Self, Self::Error> { | |||
let datatype = field.arrow_type.as_deref().required("arrow_type")?; | |||
let field = if field.dict_id != 0 { | |||
// TODO file a ticket about handling deprecated dict_id attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to update per apache/arrow-rs#6885 I think
I would appreciate this PR getting in before the next datafusion release 🙏 |
|
||
/// Sets max statistics size for the column path. If NULL, uses | ||
/// default parquet options | ||
pub max_statistics_size: Option<usize>, default = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a mistake to remove this field directly -- we should put it back in and figure out some way to mark it deprecated
Yes I hope to -- if anyone has time to help get it in shape / make a new PR it would be most appreciated I think a good first step would be to upgrade DataFusion to use the new arrow / parquet versions and sprinkle some Then we can file follow on tickets to sort out exactly how to deal with the deprecated APIs as a follow on PR I'll try to work on this over the next week if no one beats me to it |
I added this to the list of items that i think we should fix before releasing version 45.0.0: |
Do you still need help on this ? I can help to file a MR tmr to upgrade arrow, parquet & pyo3 for the project. Let me know |
@Owen-CH-Leung -- Yes, please I am still looking for help. Feel free to create a PR and use any/all fo the code from this one as necessary. Thank you 🙏 I will post on this PR when I begin to work on it if you haven't beaten me to it |
Which issue does this PR close?
Related to
54.0.0
(December 2024) arrow-rs#6342Rationale for this change
Prepare DataFusion for the next arrow-rs release
Also, help test to ensure that the next arrow-rs release has no regressions by testing with DataFusion before we release it
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?