-
Notifications
You must be signed in to change notification settings - Fork 817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: export record batch through stream #4806
Conversation
As a temporary solution until apache/arrow-rs#4806 is released
let class = module.getattr("RecordBatch")?; | ||
let args = (py_arrays,); | ||
let kwargs = PyDict::new(py); | ||
kwargs.set_item("schema", py_schema)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't this work? The schema is provided with the arrays? Is there some limitation of pyarrow's from_arrays method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, PyArrow considers there to be a mismatch if passed a schema with an extension type but the arrays passed are all storage arrays.
I created an issue in PyArrow's tracker to fix this:
apache/arrow#37669
In theory, this code should be fine, so we can consider this a workaround for a bug in PyArrow 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could link to the upstream bug in the code? So that we can potentially avoid this at some point in the future
storage = pa.FixedSizeListArray.from_arrays(inner, 6) | ||
f32_array = pa.ExtensionArray.from_storage(tensor_type, storage) | ||
|
||
# Round-tripping as an array gives back storage type, because arrow-rs has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I must be missing something fundamental here, an array can only have the storage type??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean because the extension metadata is in the field metadata, and thus separate from the Array? Arrays typically have a DataType associated with it, so this depends on whether your Arrow implementation has an Extension
variant of DataType
or not. Arrow C++ and arrow2 do, arrow-rs doesn't.
It actually looks like Arrow C++ exports DataType
with the extension metadata:
https://github.com/apache/arrow/blob/b7581fee01ed0d111d5a0361c2f05779aa3c33e8/cpp/src/arrow/c/bridge.cc#L189
https://github.com/apache/arrow/blob/b7581fee01ed0d111d5a0361c2f05779aa3c33e8/cpp/src/arrow/c/bridge.cc#L243-L252
So if arrow-rs had an Extension
variant of data type, we could do the same and thus arrays themselves could be exported as extension arrays. Of course, that opens whole different can or worms, as discussed in #4472
tensor_type = pa.fixed_shape_tensor(pa.float32(), [2, 3]) | ||
inner = pa.array([float(x) for x in range(1, 7)] + [None] * 12, pa.float32()) | ||
storage = pa.FixedSizeListArray.from_arrays(inner, 6) | ||
f32_array = pa.ExtensionArray.from_storage(tensor_type, storage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as an observation, it is rather strange to me that extension arrays would be a first-class abstraction and in so doing obfuscate the underlying storage type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is rather strange to me that extension arrays would be a first-class abstraction and in so doing obfuscate the underlying storage type
I think for systems where RecordBatch is a type exposed to end-users at the interface, the obfuscation is a feature, not a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is unfortunate that arrow doesn't have a clear separation between logical and physical types. IMO DataType is a physical type, whereas extension types are definitely in the category of logical types, it's a bit of a mess 😅
Which issue does this PR close?
Closes #4805.
Rationale for this change
To export extension arrays properly, we need to expose the arrays together with the schema, not separately.
What changes are included in this PR?
RecordBatch is now exported to PyArrow using an array stream rather than exporting each array individually.
Are there any user-facing changes?