Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add support for writing more ExtensionArray/PyArrow types #257

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

himikof
Copy link

@himikof himikof commented Jul 8, 2023

This PR improves the ExtensionArray writing support from #232 to handle other array types (including PyArrow ones).
Before it, such fields were often silently and unexpectedly casted to object dtype and then to OFTString.

Instead of casting the data array to a basic numpy dtype and then using the data array dtype to infer the OGR type in the Cython code, the high-level code now infers the equivalent numpy dtype and passes it into Cython, leaving the ExtensionArray with data as-is. As a side benefit, field data arrays are now never copied during writing.

Addition of DTYPE_OGR_FIELD_TYPES["string"] allows the infer_field_types function to handle the non-ambiguous string data case, where no additional type inference is needed.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

I think it might be worth checking the performance of passing down the ArrowExtensionArray as is to the cython code, versus passing down the equivalent numpy array (at least for simple dtypes like int and float). Given that we repeatedly index into that array while iterating over the records, it might be that indexing into a numpy array is faster and it could be worth it to actually do the conversion.

One difficulty is actually getting that numpy array. Pandas has a pyarrow_array_to_numpy_and_mask utility function. Maybe that could be used through __from_arrow__, so one option could be something like:

In [33]: s = pd.Series([1, 2, None], dtype="int64[pyarrow]")

In [34]: pa_arr = s.array._pa_array

In [35]: dtype_mapping = {
    ...:         pa.int8(): pd.Int8Dtype(),
    ...:         pa.int16(): pd.Int16Dtype(),
    ...:         pa.int32(): pd.Int32Dtype(),
    ...:         pa.int64(): pd.Int64Dtype(),
    ...:         pa.uint8(): pd.UInt8Dtype(),
    ...:         pa.uint16(): pd.UInt16Dtype(),
    ...:         pa.uint32(): pd.UInt32Dtype(),
    ...:         pa.uint64(): pd.UInt64Dtype(),
    ...:         pa.bool_(): pd.BooleanDtype(),
    ...:         pa.string(): pd.StringDtype(),
    ...:         pa.float32(): pd.Float32Dtype(),
    ...:         pa.float64(): pd.Float64Dtype(),
    ...:     }

In [36]: masked_arr = dtype_mapping[pa_arr.type].__from_arrow__(pa_arr)

In [37]: masked_arr
Out[37]: 
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64

In [38]: masked_arr._data
Out[38]: array([1, 2, 0])

In [39]: masked_arr._mask
Out[39]: array([False, False,  True])

(I am not sure there is an official public method to get the pyarrow Array from the ArrowExtensionArray)

Also for the StringArray (which is currently already working), I expect a single conversion to numpy object array (i.e. converting to Python strings in one go) will be more efficient than iteratively getting a Python string one at a time.

@jorisvandenbossche
Copy link
Member

(I am not sure there is an official public method to get the pyarrow Array from the ArrowExtensionArray)

Probably just pa.array(s.array), that should be zero copy give you the underlying pyarrow Array

@himikof
Copy link
Author

himikof commented Jul 11, 2023

Thank you for telling me about pyarrow_array_to_numpy_and_mask and __from_arrow__, I totally missed them!
I must admit I did not think much about the Python code overhead while trying to ensure correct behavior.

Now that I though more about how to implement this efficiently, I wonder if just making the Cython code directly support a pyarrow.Array/pyarrow.ChunkedArray as field_data would be better. It should be just as efficient as ndarray for simple integer/float types (if done carefully), and could be made much more efficient for strings (no need to allocate+encode a str, just access the underlying UTF-8 data directly). It would also allow to implement OFTFooList/OFTBinary writing support in a straightforward way.

So I think the following top-level logic would be needed:

  1. If the array is one of the Pandas masked arrays, use
    field_data=col._data, field_dtype=col.dtype.numpy_dtype, field_mask=col.mask
  2. If the array is an ArrowExtensionArray, use
    field_data=col.__arrow_array__(), field_dtype=col.dtype, field_mask=None.
    The masking will be handled internally in this case.
  3. If the array is some other unknown kind of ExtensionArray, use a slower fallback logic of passing it directly as field_data (accepting the overhead as unavoidable).
  4. If the array is a regular ndarray, use field_data=col, field_dtype=col.dtype, field_mask=None as before.

The Cython code can do one of 3 things to access the values:

  1. Optionally use the Cython typed memoryview (with fused type generics) to access a value of a primitive type from ndarray/pyarrow.Array. This would be even faster than the current __getitem__ call.
  2. Use ndarray.__getitem__ as before (or ExtensionArray.__getitem__ in the slower fallback case).
  3. Use pyarrow.Array.__getitem__/pyarrow.ChunkedArray.__getitem__ and process the returned Scalar. This will allow to easily support types like OFTIntegerList, OFTBinary and do a one-copy UTF-8 OFTString optimized implementation.

If this sounds reasonable I can try to implement this alternative instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants