ENH: support the Arrow PyCapsule Interface on pandas.DataFrame (export) #56587

jorisvandenbossche · 2023-12-21T16:38:04Z

See apache/arrow#39195 for some context, and https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html for the new Arrow specification.

For now this PR just implements the stream support on DataFrame (using pyarrow under the hood to do the actual conversion). We should also consider adding the array and schema protocol methods.

We could add similar methods on Series, but that has less of an exact equivalent in Arrow terms (e.g. it would loose the index).

This PR also only implements exporting a pandas DataFrame through the protocol, not adding support to our constructors to consume (import) any object supporting the protocol.

WillAyd · 2024-01-05T19:26:23Z

Very cool. Is this ready for review or sitting in draft for futher development?

jorisvandenbossche · 2024-01-10T16:49:12Z

Whatever is here is already ready for review.

As mentioned in the top post, we should also consider adding __arrow_c_schema__ (because we don't have a custom schema object that can expose this, although DataFrame is of course not a schema object, so it's not exactly fitting)

I should probably also add some details about how the conversion is done (essentially the defaults of pyarrow.Table.from_pandas, at the moment, but for example we can document that this converts the Index into a column, except for RangeIndex)

WillAyd · 2024-01-10T21:06:53Z

pandas/tests/frame/test_arrow_interface.py

+    df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})
+
+    capsule = df.__arrow_c_stream__()
+    assert (


Can we use ctypes to test this a little more deeply? I had in mind something like this:

import pyarrow as pa import ctypes tbl = pa.Table.from_pydict({"col": [1, 2, 3]}) stream = tbl.__arrow_c_stream__() class ArrowSchema(ctypes.Structure): pass ArrowSchema._fields_ = [ ("format", ctypes.POINTER(ctypes.c_char)), ("name", ctypes.POINTER(ctypes.c_char)), ("metadata", ctypes.POINTER(ctypes.c_char)), ("flags", ctypes.c_int, 8), ("n_children", ctypes.c_int, 8), ("children", ctypes.POINTER(ctypes.POINTER(ArrowSchema))), ("dictionary", ctypes.POINTER(ArrowSchema)), # NB there are more members # not sure how to define release callback, but probably not important ] ctypes.pythonapi.PyCapsule_GetName.restype = ctypes.c_char_p ctypes.pythonapi.PyCapsule_GetName.argtypes = [ctypes.py_object] nm = ctypes.pythonapi.PyCapsule_GetName(stream) #assert nm == b"array_schema" # TODO: this actually returns arrow_array_stream capsule_name = ctypes.create_string_buffer("arrow_array_stream".encode()) ctypes.pythonapi.PyCapsule_GetPointer.restype = ctypes.c_void_p ctypes.pythonapi.PyCapsule_GetPointer.argtypes = [ctypes.py_object, ctypes.c_char_p] # TODO: not sure why the below isn't working? #void_ptr = ctypes.pythonapi.PyCapsule_GetPointer( # stream, # capsule_name #) #obj = ctypes.cast(void_ptr, ctypes.POINTER(ArrowSchema))[0] #assert obj.n_children = 1

I commented out things that weren't working. I'm a little less sure of the last section what is going on, but at the very least there is a problem with the capsule name as it returns b"arrow_array_stream" yet the documentation says it should be "arrow_schema"

Ignore what I said before - I mistakenly didn't realize this was returning a stream. This all looks good to me - I think the ctypes would get a little too wonky to deal with. Here's something I stubbed out but I'm not sure how ctypes would sanely deal with struct members that are function points. Probably too much detail for us to get into on our end

import pyarrow as pa import ctypes tbl = pa.Table.from_pydict({"col": [1, 2, 3]}) stream = tbl.__arrow_c_stream__() class ArrowSchema(ctypes.Structure): pass class ArrowArray(ctypes.Structure): pass class ArrowArrayStream(ctypes.Structure): pass schema_release_func = ctypes.CFUNCTYPE(None, ctypes.POINTER(ArrowSchema)) ArrowSchema._fields_ = [ ("format", ctypes.POINTER(ctypes.c_char)), ("name", ctypes.POINTER(ctypes.c_char)), ("metadata", ctypes.POINTER(ctypes.c_char)), ("flags", ctypes.c_int, 8), ("n_children", ctypes.c_int, 8), ("children", ctypes.POINTER(ctypes.POINTER(ArrowSchema))), ("dictionary", ctypes.POINTER(ArrowSchema)), ("release", schema_release_func), ] array_release_func = ctypes.CFUNCTYPE(None, ctypes.POINTER(ArrowArray)) ArrowArray._fields_ = [ ("length", ctypes.c_int, 8), ("null_count", ctypes.c_int, 8), ("offset", ctypes.c_int, 8), ("n_buffers", ctypes.c_int, 8), ("n_children", ctypes.c_int, 8), ("buffers", ctypes.POINTER(ctypes.c_void_p)), ("children", ctypes.POINTER(ctypes.POINTER(ArrowArray))), ("dictionary", ctypes.POINTER(ctypes.POINTER(ArrowArray))), ("release", array_release_func), ] get_schema_func = ctypes.CFUNCTYPE(int, ctypes.POINTER(ArrowArrayStream), ctypes.POINTER(ArrowSchema)) get_next_func = ctypes.CFUNCTYPE(int, ctypes.POINTER(ArrowArrayStream), ctypes.POINTER(ArrowArray)) get_last_error_func = ctypes.CFUNCTYPE(bytes, ctypes.POINTER(ArrowArrayStream)) stream_release_func = ctypes.CFUNCTYPE(None, ctypes.POINTER(ArrowArrayStream)) ArrowArrayStream._fields_ = [ ("get_schema", get_schema_func), ("get_next", get_next_func), ("get_last_error", get_last_error_func), ("release", stream_release_func), ] ctypes.pythonapi.PyCapsule_GetName.restype = ctypes.c_char_p ctypes.pythonapi.PyCapsule_GetName.argtypes = [ctypes.py_object] nm = ctypes.pythonapi.PyCapsule_GetName(stream) assert nm == "arrow_array_stream" capsule_name = ctypes.create_string_buffer("arrow_array_stream".encode()) ctypes.pythonapi.PyCapsule_GetPointer.restype = ctypes.c_void_p ctypes.pythonapi.PyCapsule_GetPointer.argtypes = [ctypes.py_object, ctypes.c_char_p] void_ptr = ctypes.pythonapi.PyCapsule_GetPointer( stream, capsule_name ) stream_obj = ctypes.cast(void_ptr, ctypes.POINTER(ArrowArrayStream))[0]

I also think that, because we use pyarrow here, such detailed testing isn't necessary here. We can assume that the struct's content is thoroughly tested on the Arrow side, and we mostly need to test we return the correct capsule (and there is already a test that checks the capsule name with ctypes.pythonapi.PyCapsule_IsValid).

If at some point we would implement our own version of the C Data Interface, then for sure it would need a lot more testing.

MarcoGorelli

anything missing to be able to merge this?

…face

mroeschke · 2024-01-18T17:43:30Z

pandas/core/frame.py

+        """
+        pa = import_optional_dependency("pyarrow", min_version="14.0.0")
+        if requested_schema is not None:
+            requested_schema = pa.Schema._import_from_c_capsule(requested_schema)


Question: Will _import_from_c_capsule become public in the future?

Can't you use pa.schema() directly instead of pa.Schema._import_from_c_capsule?

I don't think it will necessarily become public in the current form (but the _import_from_c version has been used in many other external projects, so we won't just change those methods in pyarrow)

Can't you use pa.schema() directly instead of pa.Schema._import_from_c_capsule?

Not directly, because we get a capsule here (we are inside the low-level dunder here), and pa.schema() doesn't accept capsules, only objects implementing __arrow_c_schema__. Of course we could have a small wrapper object that has the dunder method and returns the capsule, if we want to avoid using the _import_from_c_capsule.

I brought this up in the past on the pyarrow side whether we need an "official" way to import capsules, in the last paragraph in apache/arrow#38010, but we should maybe discuss that a bit more (or whether we just "bless" the _import_from_c_capsule as the official way to do this)

MarcoGorelli · 2024-01-18T19:23:02Z

merging then, as discussed

…ace on pandas.DataFrame (export)

…Interface on pandas.DataFrame (export)) (#56944) Backport PR #56587: ENH: support the Arrow PyCapsule Interface on pandas.DataFrame (export) Co-authored-by: Joris Van den Bossche <[email protected]>

…t) (pandas-dev#56587) * ENH: support the Arrow PyCapsule Interface on pandas.DataFrame * expand documentation on how index is handled

ENH: support the Arrow PyCapsule Interface on pandas.DataFrame

557b0a0

jorisvandenbossche added the Arrow pyarrow functionality label Dec 21, 2023

jorisvandenbossche mentioned this pull request Dec 21, 2023

[Python] Promote usage of the Arrow PyCapsule Protocol (for the C Data Inteface) apache/arrow#39195

Open

8 tasks

WillAyd reviewed Jan 10, 2024

View reviewed changes

WillAyd approved these changes Jan 10, 2024

View reviewed changes

jorisvandenbossche mentioned this pull request Jan 12, 2024

Move interchange protocol implementation into a separate project #56732

Open

WillAyd mentioned this pull request Jan 14, 2024

pantab 4.0 checklist innobi/pantab#219

Closed

10 tasks

MarcoGorelli approved these changes Jan 18, 2024

View reviewed changes

jorisvandenbossche added 2 commits January 18, 2024 14:26

Merge remote-tracking branch 'upstream/main' into arrow-capsule-inter…

af09ba9

…face

expand documentation on how index is handled

05fec03

jorisvandenbossche marked this pull request as ready for review January 18, 2024 13:33

jorisvandenbossche mentioned this pull request Jan 18, 2024

[Python] Arrow PyCapsule Protocol: standard way to get the schema of a "data" (array of stream) object? apache/arrow#39689

Open

mroeschke reviewed Jan 18, 2024

View reviewed changes

mroeschke approved these changes Jan 18, 2024

View reviewed changes

MarcoGorelli added this to the 2.2 milestone Jan 18, 2024

MarcoGorelli merged commit 7212ecd into pandas-dev:main Jan 18, 2024
50 checks passed

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jan 18, 2024

Backport PR pandas-dev#56587: ENH: support the Arrow PyCapsule Interf…

2254508

…ace on pandas.DataFrame (export)

meeseeksmachine mentioned this pull request Jan 18, 2024

Backport PR #56587 on branch 2.2.x (ENH: support the Arrow PyCapsule Interface on pandas.DataFrame (export)) #56944

Merged

jorisvandenbossche deleted the arrow-capsule-interface branch January 18, 2024 22:10

This was referenced Jan 24, 2024

Python bindings: check for Arrow PyCapsule Interface in ogr.Layer.WritePyArrow OSGeo/gdal#9132

Closed

ENH: (Geo)Arrow interoperability & Arrow PyCapsule Interface geopandas/geopandas#3156

Closed

This was referenced Jul 11, 2024

Support for Arrow PyCapsule Interface Eventual-Inc/Daft#2504

Open

feat(python): Support PyCapsule Interface in DataFrame & Series constructors pola-rs/polars#17693

Merged

kylebarron mentioned this pull request Aug 14, 2024

ENH: support the Arrow PyCapsule Interface on pandas.Series (export) #59518

Open

kylebarron mentioned this pull request Aug 22, 2024

__arrow_c_stream__ method on DataFrame pandas-dev/pandas-stubs#985

Merged

2 tasks

jorisvandenbossche mentioned this pull request Aug 27, 2024

ENH: support the Arrow PyCapsule Interface for importing data #59631

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: support the Arrow PyCapsule Interface on pandas.DataFrame (export) #56587

ENH: support the Arrow PyCapsule Interface on pandas.DataFrame (export) #56587

jorisvandenbossche commented Dec 21, 2023

WillAyd commented Jan 5, 2024

jorisvandenbossche commented Jan 10, 2024

WillAyd Jan 10, 2024 •

edited

Loading

WillAyd Jan 10, 2024

jorisvandenbossche Jan 12, 2024

MarcoGorelli left a comment

mroeschke Jan 18, 2024

kylebarron Jan 18, 2024

jorisvandenbossche Jan 18, 2024

MarcoGorelli commented Jan 18, 2024

ENH: support the Arrow PyCapsule Interface on pandas.DataFrame (export) #56587

ENH: support the Arrow PyCapsule Interface on pandas.DataFrame (export) #56587

Conversation

jorisvandenbossche commented Dec 21, 2023

WillAyd commented Jan 5, 2024

jorisvandenbossche commented Jan 10, 2024

WillAyd Jan 10, 2024 • edited Loading

Choose a reason for hiding this comment

WillAyd Jan 10, 2024

Choose a reason for hiding this comment

jorisvandenbossche Jan 12, 2024

Choose a reason for hiding this comment

MarcoGorelli left a comment

Choose a reason for hiding this comment

mroeschke Jan 18, 2024

Choose a reason for hiding this comment

kylebarron Jan 18, 2024

Choose a reason for hiding this comment

jorisvandenbossche Jan 18, 2024

Choose a reason for hiding this comment

MarcoGorelli commented Jan 18, 2024

WillAyd Jan 10, 2024 •

edited

Loading