Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adopt Arrow PyCapsule Interface and remove required pyarrow dependency #55

Closed
kylebarron opened this issue Oct 2, 2024 · 11 comments
Closed

Comments

@kylebarron
Copy link
Contributor

👋

About a year ago, the Arrow project created a new spec, the Arrow PyCapsule Interface, which standardizes the way Python libraries exchange Arrow data. This means that a whole ecosystem of Python Arrow libraries can interoperate without any knowledge of each other and without needing to use pyarrow as an intermediary.

I've been working to promote its usage throughout the ecosystem: apache/arrow#39195 (comment).

I've also been working to create helper libraries for Rust-Python to make it easier to use this interface. I created pyo3-arrow to give one-line interoperability for Arrow data between Python and Rust. As of a few days ago, pyo3-arrow also accepts buffer protocol objects (kylebarron/arro3#204) so that you can access a Rust arrow Array that is backed by e.g. numpy memory.

Right now, pyarrow is a required dependency and is used for all Python Arrow interop, but we can:

  1. allow other libraries' data input (e.g. polars.Series) with only a single code path
  2. potentially remove pyarrow as a dependency altogether (it's a massive dependency that bloats environments)

Any thoughts?

@nmandery
Copy link
Owner

nmandery commented Oct 4, 2024

Hi!

This sounds very interesting. I will look into this, anything that makes interoperation smoother is always welcome ;)

@kylebarron
Copy link
Contributor Author

Both pyo3-arrow and pyo3-geoarrow will automatically look for the Arrow PyCapsule Interface, so they'll work with any Arrow-exporting library out of the box. Let me know if you need any pointers!

@kylebarron
Copy link
Contributor Author

kylebarron commented Oct 8, 2024

It turns out we actually have a use case for this sooner than I thought. We were trying to use h3ronpy in something deployed to AWS Lambda, but we can't because AWS Lambda has a hard limit of 250MB for the deployment size. Pyarrow takes up 129MB on its own, plus another 41MB for numpy, and that put us at about 275MB.

It's slightly more invasive to take out pyarrow as a required dependency because you have a Python side to your code, and it's not just a rust side. So you have some operations that use Arrow from Python, and not just rust.

Since you generally just use pyarrow to represent Arrow data, would you be open to depending on arro3 instead of pyarrow? It's ~120MB smaller in package size while still providing zero-copy data transfer to polars or pyarrow or any other Python Arrow library

@nmandery
Copy link
Owner

nmandery commented Oct 8, 2024

There is a rough draft in #63 - the pyarrow feature of the arrow rust crate is already gone, but it is used in a few places in the python code. This still needs some work, many things are broken ;)

I switched to arro3 and it worked mostly fine. As h3arrow only operates on Arrow Arrays and not on chunked arrays, I implemented a somewhat hacky wrapper called PyConcatedArray which uses arrow compute concat to build a single array. Certainly not optimal, but it works for now.

Is there any reason the from_numpy function of pyo3-arrow is not exposed in the rust api? It would be helpful for accepting numpy input without explicit conversion by the user.

@kylebarron
Copy link
Contributor Author

kylebarron commented Oct 8, 2024

I also started hacking on some ideas at the same time 😄 , see #64

As h3arrow only operates on Arrow Arrays and not on chunked arrays, I implemented a somewhat hacky wrapper called PyConcatedArray which uses arrow compute concat to build a single array.

The way I've tended to handle this is with AnyArray, which allows either array or stream input. And then the compute function can operate on one or the other: https://github.com/kylebarron/arro3/blob/0829e34fe250314c2e068ff86e3c5e7ad003d607/arro3-compute/src/boolean.rs#L11-L32 (the stream variant is slightly more complex because I'm doing stream processing there instead of materializing the stream in memory as a chunked array).

Is there any reason the from_numpy function of pyo3-arrow is not exposed in the rust api? It would be helpful for accepting numpy input without explicit conversion by the user.

It partially is, transparently. If the input is a numeric, c-contiguous buffer protocol object, it will automatically get converted (zero-copy!) into a non-null Arrow array, as long as the buffer_protocol feature is enabled (default, but disallows abi3 pre 3.11) https://github.com/kylebarron/arro3/blob/0829e34fe250314c2e068ff86e3c5e7ad003d607/pyo3-arrow/src/ffi/from_python/array.rs#L15-L18

I only wanted to enable zero-copy conversions in the FromPyObject impl. In theory we could open up something like from_numpy but for now that's considered the private API of arro3, even though the code actually lives inside pyo3-arrow

@nmandery
Copy link
Owner

nmandery commented Oct 8, 2024

Nice, interesting approach, this seems to make things a bit easier as it pushes the array validation further down the pyo3 API . I will give this a try and build ontop your PR.

I somehow missed AnyArray when looking through the implementation, otherwise I would have used that. +1

So the buffer protocol can only be used when requiring at least abi3-py311.

I would like to avoid most python code. Dealing with the different dataframe APIs is quite annoying and time consuming. An exception to this would be the integration into Polars expressions and maybe a bit of the geopandas stuff.

@kylebarron
Copy link
Contributor Author

Nice, interesting approach, this seems to make things a bit easier as it pushes the array validation further down the pyo3 API . I will give this a try and build ontop your PR.

💯

I feel that the best way to make clean pyo3 libraries is to push as much as you can into the FromPyObject impl.

I'll try to clean it up a bit more the next couple hours

I somehow missed AnyArray when looking through the implementation, otherwise I would have used that. +1

It makes it clean!

So the buffer protocol can only be used when requiring at least abi3-py311.

Yes, or you can turn off abi3 and build per-version Python wheels, which is what I do. But maturin makes it easy to do still, it just takes slightly longer to build the wheels.

@zacdezgeo
Copy link

Hey @nmandery @kylebarron, I've been impressed with h3ronpy's performance. I felt I should chime in that this would be highly beneficial for the deployments of the space2stats project because it would allow stick to our current deployment pipelines with Lambda. Is there anything I can support you with? Do we have any idea of the timeline for these changes? It would be helpful from a planning perspective.

@nmandery
Copy link
Owner

So far there is no defined timeline for this. I would like to get it implemented rather quickly, but currently there is not to much spare time for this.

@kylebarron Do you have any other plans you want to implement in #64 ? Otherwise I can take over there. After that help @zacharyDez would also be very welcome ;)

@kylebarron
Copy link
Contributor Author

I'd like to get back to #64 in the next few days

@nmandery
Copy link
Owner

Fixed in #64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants