-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clib.conversion._to_numpy: Add tests for pandas.Series with pandas string dtype #3607
Conversation
@@ -1475,7 +1475,7 @@ def virtualfile_from_vectors( | |||
# 2 columns contains coordinates like longitude, latitude, or datetime string | |||
# types. | |||
for col, array in enumerate(arrays[2:]): | |||
if pd.api.types.is_string_dtype(array.dtype): | |||
if np.issubdtype(array.dtype, np.str_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1506,9 +1506,9 @@ def virtualfile_from_vectors( | |||
strings = string_arrays[0] | |||
elif len(string_arrays) > 1: | |||
strings = np.array( | |||
[" ".join(vals) for vals in zip(*string_arrays, strict=True)] | |||
[" ".join(vals) for vals in zip(*string_arrays, strict=True)], | |||
dtype=np.str_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifying dtype
is not necesary here, but I feel it's good to expicitly tell that here we're expecting a np.str_ array.
@@ -175,6 +179,11 @@ def _to_numpy(data: Any) -> np.ndarray: | |||
else: | |||
vec_dtype = str(getattr(data, "dtype", "")) | |||
array = np.ascontiguousarray(data, dtype=dtypes.get(vec_dtype)) | |||
|
|||
# Check if a np.object_ array can be converted to np.str_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary to support pd.Series string like:
x = pd.Series(["abc", "defg", "12345"], dtype=None)
x = pd.Series(["abc", "defg", "12345"], dtype=np.str_)
x = pd.Series(["abc", "defg", "12345"], dtype="U10")
) | ||
strings = np.asanyarray(a=strings, dtype=np.str_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description of proposed changes
Add tests for pandas.Series with string dtype. Six cases are tested:
dtype=None
dtype=np.str_
dtype="U10"
dtype="string[python]"
dtype="string[pyarrow]"
dtype="string[pyarrow_numpy]"
Neither can be converted to
np.str_
directly. Cases 4-6 can be fixed by 01ba317, and cases 1-3 can be fixed by dac7e8e.