Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clib.conversion: Deal with np.object dtype in vectors_to_arrays and deprecate the array_to_datetime function #3507

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

seisman
Copy link
Member

@seisman seisman commented Oct 11, 2024

Description of proposed changes

In the virtualfile_from_vectors function, we call vectors_to_arrays to convert a sequence of 1-D vectors (list, numpy array, pandas.Series or similar) to a list of 1-D numpy arrays. The converted 1-D numpy arrays can have various dtypes. Some dtypes can be recognized and converted to GMT's supported data types, e.g.,

DTYPES = {
np.int8: "GMT_CHAR",
np.int16: "GMT_SHORT",
np.int32: "GMT_INT",
np.int64: "GMT_LONG",
np.uint8: "GMT_UCHAR",
np.uint16: "GMT_USHORT",
np.uint32: "GMT_UINT",
np.uint64: "GMT_ULONG",
np.float32: "GMT_FLOAT",
np.float64: "GMT_DOUBLE",
np.str_: "GMT_TEXT",
np.datetime64: "GMT_DATETIME",
np.timedelta64: "GMT_LONG",
}

while others can't, e.g.,

def test_put_vector_invalid_dtype():
"""
Check that it fails with an exception for invalid data types.
"""
for dtype in [
np.bool_,
np.bytes_,
np.csingle,
np.cdouble,
np.clongdouble,
np.half,
np.longdouble,
np.object_,
]:

The np.object_ dtype is the special one, since any data types can be stored in np.object_ dtype. Due to this reason, we have to handle with np.object_ carefully in some places:

(1) In _check_dtype_and_dim, if the array dtype is np.object_, we need to check if the array can be converted to np.datetime64 dtype (by calling array_to_datetime). If yes, the function returns GMT_DATETIME.

pygmt/pygmt/clib/session.py

Lines 864 to 866 in c2e429c

if array.dtype.type is np.object_:
# Try to convert unknown object type to np.datetime64
array = array_to_datetime(array)

(2) In virtualfile_from_vectors, after calling _check_dtype_and_dim, we know that the array can be recognized as GMT's GMT_DATETIME type (which actually means either the array is already np.datetime64 or is np.object_ but can be converted to np.datetime64). We still need to call array_to_datetime again to convert the array to np.datetime64.

https://github.com/GenericMappingTools/pygmt/blob/c2e429c0262f4dd49a87be711cfa0883eebb408e/pygmt/clib/session.py#L918C6-L920C2

(3) Again, in virtualfile_from_vectors, we call pd.api.types.is_string_dtype to determine if an array is string dtype (np.object is recognized as a string dtype).

pygmt/pygmt/clib/session.py

Lines 1391 to 1392 in c2e429c

if pd.api.types.is_string_dtype(array.dtype):
columns = col + 2

Instead of dealing with the np.object_ dtype in different places, we can deal with it in the vectors_to_arrays function. This can be done by adding the following lines to this function (see the changed files in this PR for details):

        if array.dtype.type == np.object_:
            try:
                array = np.ascontiguousarray(array, dtype=np.datetime64)
            except ValueError:
                array = np.ascontiguousarray(array, dtype=np.str_)

With the above lines, we can simplify the _check_dtype_and_dim/put_vector/virtualfile_from_vectors functions.

@seisman seisman added the run/benchmark Trigger the benchmark workflow in PRs label Oct 11, 2024
@@ -1388,7 +1378,7 @@ def virtualfile_from_vectors(self, *vectors):
# Assumes that first 2 columns contains coordinates like longitude
# latitude, or datetime string types.
for col, array in enumerate(arrays[2:]):
if pd.api.types.is_string_dtype(array.dtype):
if array.dtype.type == np.str_:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.api.types.is_string_dtype was added in PR #684. Before PR #684, we used np.issubdtype(array.dtype, np.str_), which was added in #520. Any specific reason that we should use np.issubdtype(array.dtype, np.str_), not array.dtype.type == np.str_?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need to check if this can handle pandas.StringDtype and pyarrow.StringArray (xref #2933).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both can be converted to the numpy string dtype by the vectors_to_arrays method, so in virtualfile_from_vectors, we only need to check np.str_:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import pyarrow as pa

In [4]: x = pd.Series(["abc", "defghi"], dtype="string")

In [5]: np.asarray(x)
Out[5]: array(['abc', 'defghi'], dtype=object)

In [6]: np.asarray(x, dtype=str)
Out[6]: array(['abc', 'defghi'], dtype='<U6')

In [7]: y = pa.array(["abc", "defghi"])

In [8]: np.asarray(y)
Out[8]: array(['abc', 'defghi'], dtype=object)

In [9]: np.asarray(y, dtype=str)
Out[9]: array(['abc', 'defghi'], dtype='<U6')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main idea of this PR is to let vectors_to_arrays do the conversion work from any dtypes (including pd.StringDtype and pa.StringArray) into the numpy dtypes, so that the virtualfile_from_vectors only need to care about how to map numpy dtypes into GMT data types.

For any special dtypes that we know how to convert it to numpy dtype, we can maintain a mapping dictionary, just like what you did to support pyarrow's date32[day] and date64[ms] in #2845:

dtypes = {
"date32[day][pyarrow]": np.datetime64,
"date64[ms][pyarrow]": np.datetime64,
}

Copy link
Member Author

@seisman seisman Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 83673cf, I've moved most of the doctests into a separate test file pygmt/tests/test_clib_vectors_to_arrays.py. A test test_vectors_to_arrays_pandas_string is added to check that the function can handle pd.StringDtype correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For any special dtypes that we know how to convert it to numpy dtype, we can maintain a mapping dictionary, just like what you did to support pyarrow's date32[day] and date64[ms] in #2845:

dtypes = {
"date32[day][pyarrow]": np.datetime64,
"date64[ms][pyarrow]": np.datetime64,
}

Based on the tests below, I think we should add the entry "string": np.str_ to the dictionary:

In [1]: import pandas as pd

In [2]: x = pd.Series(["abc", "12345"])

In [3]: x.dtype
Out[3]: dtype('O')

In [4]: str(x.dtype)
Out[4]: 'object'

In [5]: x = pd.Series(["abc", "12345"], dtype="string")

In [6]: x.dtype
Out[6]: string[python]

In [7]: str(x.dtype)
Out[7]: 'string'

In [8]: x = pd.Series(["abc", "12345"], dtype="string[pyarrow]")

In [9]: x.dtype
Out[9]: string[pyarrow]

In [10]: str(x.dtype)
Out[10]: 'string'

In [11]: import pyarrow as pa

In [12]: x = pa.array(["abc", "defghi"])

In [13]: x.type
Out[13]: DataType(string)

In [14]: str(x.type)
Out[14]: 'string'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PR #2774 and #2774, we only checked if PyGMT supports pandas with the pyarrow backend, but didn't check if the original pyarrow arrays works. For example, for a pyarrow date32 array, we need to check array.type rather than array.dtype:

In [1]: import datetime

In [2]: import pyarrow as pa

In [3]: x = pa.array([datetime.date(2020, 1, 1), datetime.date(2021, 12, 31)])

In [4]: str(x.type)
Out[4]: 'date32[day]'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, for a pyarrow date32 array, we need to check array.type

Yes, raw pyarrow arrays of date32/date64 type are not supported yet. That's why it's marked 🚧 in #2800 (I was planning to modify array_to_datetime to handle it).

@@ -313,6 +320,11 @@ def array_to_datetime(array: Sequence[Any]) -> np.ndarray:

If the input array is not in legal datetime formats, raise a ValueError exception.

.. deprecated:: 0.14.0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this PR, array_to_datetime is no longer used, but I still want to keep this function so that we know what kinds of datetime formats that np.asarray(array, dtype=np.datetime64) can support.

@seisman seisman changed the title WIP: Refactor vectors_to_arrays and deprecate the array_to_datetime function clib.conversion: Deal with np.object dtype in vectors_to_arrays and deprecate the array_to_datetime function Oct 13, 2024
@seisman seisman added maintenance Boring but important stuff for the core devs needs review This PR has higher priority and needs review. labels Oct 13, 2024
@seisman seisman added this to the 0.14.0 milestone Oct 13, 2024
@seisman seisman marked this pull request as ready for review October 13, 2024 16:24
@seisman seisman removed the needs review This PR has higher priority and needs review. label Oct 15, 2024
@seisman seisman marked this pull request as draft October 16, 2024 03:21
@seisman seisman marked this pull request as ready for review October 17, 2024 09:45
@seisman
Copy link
Member Author

seisman commented Oct 17, 2024

I've reverted some changes in e26afbf so that this PR can focus on refactoring the existing codes without introducing more enhancements.

We will revisit the issues/enhancements mentioned in #3507 (comment) later.

@seisman seisman added the needs review This PR has higher priority and needs review. label Oct 17, 2024
pygmt/clib/conversion.py Outdated Show resolved Hide resolved
@seisman seisman removed the needs review This PR has higher priority and needs review. label Oct 23, 2024
@seisman seisman marked this pull request as draft October 23, 2024 14:47
"""
vectors = [
pd.Series(["abc", "defhig"]),
pd.Series(["abcdef", "123456"], dtype="string"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3 'string' dtypes in pandas now according to https://pandas.pydata.org/pandas-docs/version/2.2/reference/api/pandas.StringDtype.html:

  • string or string[python] (default as of Pandas 2.x)
  • string[pyarrow]
  • string[pyarrow_numpy]

I think the expectation is that in pandas 3.0, the default StringDtype will change from string[python] to string[pyarrow] (see https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html). So maybe we should explicitly test for string[python] here? Or maybe test for all three variants.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in d8777e5.

@seisman seisman added the needs review This PR has higher priority and needs review. label Oct 28, 2024
@seisman seisman marked this pull request as ready for review October 28, 2024 04:33
@seisman seisman self-assigned this Oct 30, 2024
@seisman seisman removed the needs review This PR has higher priority and needs review. label Nov 5, 2024
)
def test_vectors_to_arrays_pandas_string(dtype):
"""
Test the vectors_to_arrays function with pandas strings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Test the vectors_to_arrays function with pandas strings.
Test the vectors_to_arrays function with pandas.Series objects of dtype
string[python], string[pyarrow] and string[pyarrow_numpy].

Comment on lines +291 to +292
The function is no longer used in the PyGMT project, but we keep this function
to docuemnt and test the supported datetime types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The function is no longer used in the PyGMT project, but we keep this function
to docuemnt and test the supported datetime types.
This function is no longer used in the PyGMT project, but we keep it to document
and test supported datetime types.

@seisman
Copy link
Member Author

seisman commented Nov 6, 2024

Will revisit changes in this PR after #3583, #3584, #3585.

@seisman seisman marked this pull request as draft November 6, 2024 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs run/benchmark Trigger the benchmark workflow in PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants