-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix pylibcudf to_arrow with multiple nested data types #17504
Fix pylibcudf to_arrow with multiple nested data types #17504
Conversation
def _table_to_schema(Table tbl, metadata): | ||
if metadata is None: | ||
metadata = [ColumnMetadata() for _ in range(len(tbl.columns()))] | ||
metadata = [_maybe_create_nested_column_metadata(col) for col in tbl.columns()] | ||
metadata = [ColumnMetadata(m) if isinstance(m, str) else m for m in metadata] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what metadata
is here but I suspect this could use the same treatment? cc @vyasr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is Arrow metadata. Can we change this into if/else block? Like:
if metadata is None:
metadata = [_maybe_create_nested_column_metadata(col) for col in tbl.columns()]
else:
metadata = [_maybe_create_nested_column_metadata(m) if isinstance(m, str) else m for m in metadata]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks. Changed it to use this if/else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this didn't end up working.
From test_quantiles.py
, the current invocation of pylibcudf.interop.to_arrow
with a pylibcudf.Table
involves metadata
being the associated column names (i.e. list of strings) so I suppose the prior usage was OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh I'm sorry, too many notifications. I missed this one.
Yes, the metadata is an instance of https://github.com/vyasr/cudf/blob/chore/update_ci_jobs/cpp/include/cudf/interop.hpp#L104, which we expose in Python here. I'm not sure if the question here is still relevant, but in general the reason that we need this parameter is to handle nested struct names. Almost every other form of metadata passed this way is redundant. I'll be working to get rid of this in the future using pyarrow APIs if possible so that we have parameter-free conversion to/from arrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. I think in regards to this change there shouldn't be any modifications needed based on your answer
/merge |
Description
Fixes the following case
Checklist