-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adopt Arrow PyCapsule Interface and remove required pyarrow dependency #55
Comments
Hi! This sounds very interesting. I will look into this, anything that makes interoperation smoother is always welcome ;) |
Both pyo3-arrow and pyo3-geoarrow will automatically look for the Arrow PyCapsule Interface, so they'll work with any Arrow-exporting library out of the box. Let me know if you need any pointers! |
It turns out we actually have a use case for this sooner than I thought. We were trying to use h3ronpy in something deployed to AWS Lambda, but we can't because AWS Lambda has a hard limit of 250MB for the deployment size. Pyarrow takes up 129MB on its own, plus another 41MB for numpy, and that put us at about 275MB. It's slightly more invasive to take out pyarrow as a required dependency because you have a Python side to your code, and it's not just a rust side. So you have some operations that use Arrow from Python, and not just rust. Since you generally just use pyarrow to represent Arrow data, would you be open to depending on arro3 instead of pyarrow? It's ~120MB smaller in package size while still providing zero-copy data transfer to polars or pyarrow or any other Python Arrow library |
There is a rough draft in #63 - the pyarrow feature of the arrow rust crate is already gone, but it is used in a few places in the python code. This still needs some work, many things are broken ;) I switched to arro3 and it worked mostly fine. As h3arrow only operates on Arrow Arrays and not on chunked arrays, I implemented a somewhat hacky wrapper called Is there any reason the |
I also started hacking on some ideas at the same time 😄 , see #64
The way I've tended to handle this is with
It partially is, transparently. If the input is a numeric, c-contiguous buffer protocol object, it will automatically get converted (zero-copy!) into a non-null Arrow array, as long as the I only wanted to enable zero-copy conversions in the |
Nice, interesting approach, this seems to make things a bit easier as it pushes the array validation further down the pyo3 API . I will give this a try and build ontop your PR. I somehow missed So the buffer protocol can only be used when requiring at least I would like to avoid most python code. Dealing with the different dataframe APIs is quite annoying and time consuming. An exception to this would be the integration into Polars expressions and maybe a bit of the geopandas stuff. |
💯 I feel that the best way to make clean pyo3 libraries is to push as much as you can into the I'll try to clean it up a bit more the next couple hours
It makes it clean!
Yes, or you can turn off |
Hey @nmandery @kylebarron, I've been impressed with h3ronpy's performance. I felt I should chime in that this would be highly beneficial for the deployments of the space2stats project because it would allow stick to our current deployment pipelines with Lambda. Is there anything I can support you with? Do we have any idea of the timeline for these changes? It would be helpful from a planning perspective. |
So far there is no defined timeline for this. I would like to get it implemented rather quickly, but currently there is not to much spare time for this. @kylebarron Do you have any other plans you want to implement in #64 ? Otherwise I can take over there. After that help @zacharyDez would also be very welcome ;) |
I'd like to get back to #64 in the next few days |
Fixed in #64 |
👋
About a year ago, the Arrow project created a new spec, the Arrow PyCapsule Interface, which standardizes the way Python libraries exchange Arrow data. This means that a whole ecosystem of Python Arrow libraries can interoperate without any knowledge of each other and without needing to use pyarrow as an intermediary.
I've been working to promote its usage throughout the ecosystem: apache/arrow#39195 (comment).
I've also been working to create helper libraries for Rust-Python to make it easier to use this interface. I created pyo3-arrow to give one-line interoperability for Arrow data between Python and Rust. As of a few days ago, pyo3-arrow also accepts buffer protocol objects (kylebarron/arro3#204) so that you can access a Rust arrow
Array
that is backed by e.g. numpy memory.Right now, pyarrow is a required dependency and is used for all Python Arrow interop, but we can:
polars.Series
) with only a single code pathAny thoughts?
The text was updated successfully, but these errors were encountered: