-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Arrow PyCapsule Protocol: standard way to get the schema of a "data" (array of stream) object? #39689
Comments
I think it's reasonable to allow for objects to expose a schema, so long as it's clear what the expectations are (whether this is expected to be a simple accessor, or if it may block, perform I/O, raise exceptions, etc.) |
It seems like we need to differentiate between "object is a schema" and "object has a schema". One way to do that would be to create an alternative variant of |
I'm not sure how this would work. If you're a pyarrow consumer and want to create a
I admit I was scratching my head for how the |
That's a good point. For that use case, we indeed want |
Maybe pandas should create a new class for its dtypes though that does expose this? Do you know what other libraries have this same limitation? I would like to add on the con about accepting an stream/array where a schema is expected that if in the future we add dunders like |
I just noticed that One of the problems is that there are two reasons you might want to call You might want to use the second version if you are a consumer that doesn't understand one of the new types that were just added to the spec and doesn't have the ability to cast. For example: def split_lines(array):
schema_src = array.__arrow_c_schema__()
if nanoarrow.c_schema_view(schema_src).type == "string_view":
schema_src, array_src = array.__arrow_c_array__(requested_schema=nanoarrow.string())
else:
schema_src, array_src = array.__arrow_c_array__()
if nanoarrow.c_schema_view(schema_src).type != "string":
raise TypeError("array must be string or string_view") In that case, you really do need the ability to get the data type from the producer in the event you have to request something else. This type of negotiation is (in my view) far superior to maintaining a spec for keyword arguments to You might want to use the first one if you have a function like: def cast(array, schema):
schema_dst = schema.__arrow_c_schema()
schema_src, array_src, = array.__arrow_c_array__()
# ...do some casting stuff, maybe in C Here, it would be very strange if you could pass a def cast(array, schema):
if hasattr(obj, "__arrow_c_array__") or hasattr(obj, "__arrow_c_stream__"):
raise TypeError("Can't pass array-like object as schema")
schema_dst = schema.__arrow_c_schema()
schema_src, array_src, = array.__arrow_c_array__()
# ...do some casting stuff, maybe in C I will probably bake this in to |
It looks like there's consensus here? If I'm understanding it right: Defining a "data object" as one that has either an
I'm implementing the PyCapsule Interface for polars and this was brought up because polars uses newer view types internally, and was unsure what to export pola-rs/polars#17676 (comment) |
One other question about schema negotiation: it seems most helpful when Arrow adds new data types to the spec. I.e. some libraries might be on older Arrow format versions and not yet support string view and binary view types. In that case, the consumer might want to ask for the producer to cast to standard string and binary types. But this does rely on the consumer being able to interpret the producer's schema, right? A library that supports only an older Arrow format version, and thus that doesn't support view types, might error just in reading the schema produced by a newer version? So this on its own doesn't solve cross-version Arrow format issues, right? |
I think we had assumed that producers would always produce the most compatible output possible by default unless requested otherwise, although it is probably more natural for a producer to want to produce the output that involves the least amount of copying (which would lead to a situation like the one you described). We still might need a request flag (like PyBUF_SIMPLE since it is reasonable that a producer would only want to export the exact layout they have (as opposed to doing extra work to make it potentially easier to consume). |
Follow-up discussion on the Arrow PyCapsule Protocol semantics added in #37797 (and overview issue promoting it: #39195). Current docs: https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html
This topic came up on the PR itself as well. I brought it up in #37797 (review)), and then we mostly discussed this (with eventually removing
__arrow_c_schema__
from the array) in the thread at #37797 (comment).Rephrasing my question from in the PR discussion:
So in the merged implementation of the protocol in pyarrow itself, we cleanly separated this: the Array/ChunkedArray/RecordBatch/Table classes have
__arrow_c_data/stream__
, and the DataType/Field/Schema classes have__arrow_c_schema__
.But not all libraries have a clear concept of a "schema", or at least not as an accessible/dedicated Python object.
For example, for two cases for which I have an open PR to add the protocol: a pandas.DataFrame does have a
.dtypes
attribute, but that's not a custom object that can expose the schema protocol (it's just a plain Series with data types as the values) (pandas-dev/pandas#56587); and the interchange protocol DataFrame object only exposes column names, and you need to access a column itself to get the dtype, which then is a plain python tuple (so again not something to which the dunder could be added, and it is also not at the dataframe level) (data-apis/dataframe-api#342).Personally I think it would be useful that one has the ability to inspect the schema of a "data" object, before asking for the actual data. For pyarrow objects you could check the
.type
or.schema
attributes, and then get__arrow_c_schema__
, but that gives again something library-specific in the middle, which we want to avoid.Summarizing the different arguments from our earlier thread about having
__arrow_c_schema__
on an array/stream object:Pro:
requested_schema
, you first need to know the schema you would get, before you can create your desired schema to pass to__arrow_c_array/stream__
Con:
pa.schema(..)
would work and return a schema (although sidenote from myself: if we want, we can still disallow this, and only accept objects that only have__arrow_c_schema__
inpa.schema(..)
)I think it would be nice if we can have some guidance for projects about what the best practice is.
(right now I was planning to add
__arrow_c_schema__
in the above mentioned PRs because those projects don't have a "schema" object, but ideally I can follow a recommendation, so that consumer libraries can base their usage on such expectation of a schema being available or not)cc @wjones127 @pitrou @lidavidm
and also cc @kylebarron and @WillAyd as I know you both have been experimenting with the capsule protocol and might have some user experience with it
The text was updated successfully, but these errors were encountered: