Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: allow using open_arrow with PyCapsule protocol (without pyarrow dependency) #349
ENH: allow using open_arrow with PyCapsule protocol (without pyarrow dependency) #349
Changes from 6 commits
ab214f4
c48fac2
e3e498c
102c117
947b5d1
83f5c40
7c33fa0
51787e7
437485e
73b9b2b
8cf1dd0
d1282ca
a554294
69a0922
3070274
f648447
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe simplify a bit and remove line 1439?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on
return_pyarrow
vsreturn_capsule
? I'm not sure which is clearerThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't return a capsule exactly, but a small wrapper object with the
__arrow_c_stream__
method that will return the capsule.Side question: would returning the capsule directly be more useful for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(although it would be nice to have keyword name that is "positive" about what you want, instead of indicating of not wanting pyarrow .. I just don't have good ideas for that at the moment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more along the lines of what I was trying to ask.
I think having a wrapper object is preferable than raw capsules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could maybe also consider removing the option entirely, and always returns the capsule wrapper object. This API is still quite new and more an advanced API, so we might be OK with still changing it.
Taking the example from the docstring:
If we would just always return the wrapper object as implemented in the PR right now, that means that people using this right now and expecting a pyarrow RecordBachReader (or an iterable of pyarrow Tables), would need to change their code to:
One problem with this however is that this requires pyarrow >= 15.0, so that's not ideal.
We could also add the basic methods of a RecordBatchreader to the wrapper object (
__enter__
,__iter__
,read_next_batch
,read_all
,schema
, etc, just calling the equivalent method of the underlying pyarrow.RecordBatchReader, which we can instantiate lazily only when one of those methods is used, i.e. which still allows you to convert it with__arrow_c_stream__
without requiring pyarrow)That keeps compatibility for people using the Python APIs of a RecordBatchReader (like in the code snippet above, to iterate through the stream). However, that unfortunately still doesn't work when passing this reader object to some other functionality of pyarrow that can accept a reader but passes that through to C++ code (eg writing a dataset with a RecordBatchReader as input)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also just check if pyarrow is available, if so return a RecordBatchReader (which will also have the
__arrow_c_stream__
method for recent pyarrow), and if not fallback on returning the wrapper object.That keeps backwards compatibility perfectly, while allowing to use it without pyarrow (at the expense of a slightly confusing API / different behaviour just depending on whether another package is installed or not)
Although that might also be an extra indirection in case you don't need the pyarrow RecordBatchReader. Because what will be returned from
__arrow_c_stream__
on the returned reader will be a ArrowArrayStream backed by Arrow wrapping the GDAL stream.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the extra indirection would be a problem.
I tend to think that having multiple possible return types is much more confusion than it's worth for most users. I'm not really a fan of having it depend only on if pyarrow is installed.
If we're going to have one function, having a
return_pyarrow=True
I think is the best option. It's probably too much duplication here, but otherwise I would've considered anopen_arrow_pycapsule_interface()
or similar functionThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silently returning a different type depending on installed Python packages is generally error-prone for users. I would recommend avoiding this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points, agreed that it's better to not do any "smart" return value depending on what is installed.
I updated the PR (and added some docs) for the current state of having a
return_pyarrow
keyword with a default of True.I am personally also still fine with changing the default to False (so you have to ask explicitly for a pyarrow object (to avoid the one line to do it yourself), which makes the keyword also about what you want, not what you want to avoid)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The keyword mentioned in the changelog is
use_pyarrow
(which I think I prefer) but the implementation isreturn_pyarrow