Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clib.converison._to_numpy: Add tests for pandas.Series with pandas numeric dtypes #3584

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

seisman
Copy link
Member

@seisman seisman commented Nov 5, 2024

This PR adds tests for pandas.Series with pandas/pyarrow numeric dtypes (in 7222db2).

Tests pass with pandas 2.2 but fail with pandas 2.1. Check https://github.com/GenericMappingTools/pygmt/actions/runs/11714111582/job/32628172796?pr=3584 for details.

The failures are due to upstream pandas v2.2 changes. Previously, all pandas nullable dtypes are converted to np.object_ dtype. Since pandas v2.2, the new rules are (source: https://pandas.pydata.org/docs/whatsnew/v2.2.0.html#to-numpy-for-numpy-nullable-and-arrow-types-converts-to-suitable-numpy-dtype):

  • float dtypes are cast to NumPy floats
  • integer dtypes without missing values are cast to NumPy integer dtypes
  • integer dtypes with missing values are cast to NumPy float dtypes and NaN is used as missing value indicator
  • boolean dtypes without missing values are cast to NumPy bool dtype
  • boolean dtypes with missing values keep object dtype
  • datetime and timedelta types are cast to Numpy datetime64 and timedelta64 types respectively and NaT is used as missing value indicator

Following the first three points, we add the workaround for pandas<2.2 to explicitly specify how to map pandas dtypes into numpy dtypes (a4f15cd). It is an improved version of PR #3505.

Here is the minimal example to understand the behavior change:

With pandas 2.1

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: x = pd.Series([1, 2, 3], dtype=pd.Int32Dtype())

In [4]: x.dtype
Out[4]: Int32Dtype()

In [5]: np.ascontiguousarray(x)
Out[5]: array([1, 2, 3], dtype=object)

In [6]: x = pd.Series([1, pd.NA, 3], dtype=pd.Int32Dtype())

In [7]: np.ascontiguousarray(x)
Out[7]: array([1, <NA>, 3], dtype=object)

With pandas 2.2

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: x = pd.Series([1, 2, 3], dtype=pd.Int32Dtype())

In [4]: np.ascontiguousarray(x)
Out[4]: array([1, 2, 3], dtype=int32)

In [5]: x = pd.Series([1, pd.NA, 3], dtype=pd.Int32Dtype())

In [6]: np.ascontiguousarray(x)
Out[6]: array([ 1., nan,  3.])

Related to #3581, #2848, #3513, #3583.

Wait for #3583.

@weiji14 weiji14 changed the title WIP: clib.converison._to_numpy: Add tests for panda.Series with pandas numeric dtypes WIP: clib.converison._to_numpy: Add tests for pandas.Series with pandas numeric dtypes Nov 6, 2024
@seisman
Copy link
Member Author

seisman commented Nov 7, 2024

As explained in the top post, currently this PR does two things:

  • Add tests for pandas dtypes
  • Improve the pandas->numpy dtype conversion for pd.Series with NA values (pandas<2.2)

After finishing this PR, we have two options:

  1. Change the PR title to something like "Improve the conversion of pandas dtypes with missing values for pandas<=2.1" and mark it as a "bug".
  2. Cherry-pick the changes in a4f15cd, and open a separate PR so that we'll have two commits in the main branch, one for the new workaround, and one for the newly added tests.

I'm inclined to option 2.

@weiji14
Copy link
Member

weiji14 commented Nov 7, 2024

2. Cherry-pick the changes in a4f15cd, and open a separate PR so that we'll have two commits in the main branch, one for the new workaround, and one for the newly added tests.

I'm inclined to option 2.

Yeah, agree to option 2. Keep the unit tests here, and open a separate PR with the code changes from a4f15cd

Base automatically changed from to_numpy/numpy_numeric to main November 7, 2024 10:38
@seisman seisman changed the title WIP: clib.converison._to_numpy: Add tests for pandas.Series with pandas numeric dtypes clib.converison._to_numpy: Add tests for pandas.Series with pandas numeric dtypes Nov 7, 2024
@seisman seisman marked this pull request as ready for review November 7, 2024 10:41
pygmt/clib/conversion.py Outdated Show resolved Hide resolved
pygmt/clib/conversion.py Outdated Show resolved Hide resolved
@seisman seisman marked this pull request as draft November 8, 2024 02:49
@seisman seisman added the needs review This PR has higher priority and needs review. label Nov 9, 2024
@seisman seisman marked this pull request as ready for review November 9, 2024 08:40
@seisman seisman removed the needs review This PR has higher priority and needs review. label Nov 10, 2024
@seisman seisman added the needs review This PR has higher priority and needs review. label Nov 10, 2024
pygmt/clib/conversion.py Outdated Show resolved Hide resolved
pygmt/clib/conversion.py Outdated Show resolved Hide resolved
pygmt/clib/conversion.py Outdated Show resolved Hide resolved
@seisman seisman removed the needs review This PR has higher priority and needs review. label Nov 10, 2024
@seisman
Copy link
Member Author

seisman commented Nov 12, 2024

Commit 0615b5b catches bugs of the improved workaround for pandas<2.2. Will see what's happening and how to make the workaround more robust.

@seisman seisman marked this pull request as draft November 12, 2024 01:15
@seisman
Copy link
Member Author

seisman commented Nov 12, 2024

I prefer to have other PRs like #3610, #3609 reviewed and merged first, to ensure the workaround in this PR is robust and has no side-effects on other dtypes.

@seisman seisman marked this pull request as ready for review November 15, 2024 05:27
@seisman seisman added the needs review This PR has higher priority and needs review. label Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs needs review This PR has higher priority and needs review. skip-changelog Skip adding Pull Request to changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants