-
Notifications
You must be signed in to change notification settings - Fork 802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try casting structs by name before by position #6726
base: master
Are you sure you want to change the base?
Try casting structs by name before by position #6726
Conversation
Alternative is to let the user decide on which approach to use. But, we need to pass |
I'm not sure about this, casting by name comes with its own sets of issues, from performance to potential ambiguity. I wonder if this is really the right level to be implementing this logic, or whether this really belongs in some sort of higher-level schema adapter. Tagging @alamb as there is a fair amount of such schema munging logic in DF. Regardless this is a breaking API change as written |
I agree it is non obvious and this the user should have to explicitly request this behavior
I personally think it makes sense in the cast kernel as it is a common operation (e.g. to convert structs with the same fields but in different order) However that being said, a
I agree. I also think it would be nice to avoid a breaking change if not necessary and I don't think it is necessary in this case. Tto make it not a breaking change I think we could introduce |
Right, and then you're doing potentially non-trivial work per batch, matching columns, etc... as opposed to per-schema. For example DF's SchemaAdapter computes the mapping once and can then apply that to multiple batches. I guess what I am suggesting is that rather than baking this into the cast kernel, if we add a first-party schema adapter into arrow-rs. Potentially upstreaming the one already in DF with some modifications? |
For anyone interested, here is the API that is in DataFusion (it now even has ASCII art and Examples, thanks to @itsjunetime and myself): We can/should probably change the names and reduce the levels of indirection of we upstreamed this into arrow-rs |
100%, and I think the level of complexity that component has grown highlights my concern of trying to shoehorn all of this logic into the cast kernel directly. |
I filed #6735 to track this idea |
Sorry for all this back and forth @aykut-bozkurt -- what do you think about the discussion so far? |
That makes sense to me. The proposal lets us do even more useful things (e.g. handle missing columns) with hopefully better performance. |
Which issue does this PR close?
Part of #4908.
Rationale for this change
Supports more flexible struct to struct conversions.
What changes are included in this PR?
Struct to struct cast already works by casting each field by position (
to_fields.zip(from_fields)
). #5221This PR adds
cast by name
method and if cast by name is not possible, it fallbacks tocast by position
(current method).We can first try casting by matching field names if both structs have the same field names and the corresponding fields can be cast. If casting by name is not possible, it can fall back to casting by position.
Are there any user-facing changes?
No breaking api change. But user behavior may change. The struct, that is previously casted via position, could cast by name now.