-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clib.conversion: Deal with np.object dtype in vectors_to_arrays and deprecate the array_to_datetime function #3507
base: main
Are you sure you want to change the base?
Conversation
e5cf4f1
to
705aa4a
Compare
705aa4a
to
d7503d1
Compare
d7503d1
to
20b9215
Compare
pygmt/clib/session.py
Outdated
@@ -1388,7 +1378,7 @@ def virtualfile_from_vectors(self, *vectors): | |||
# Assumes that first 2 columns contains coordinates like longitude | |||
# latitude, or datetime string types. | |||
for col, array in enumerate(arrays[2:]): | |||
if pd.api.types.is_string_dtype(array.dtype): | |||
if array.dtype.type == np.str_: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll need to check if this can handle pandas.StringDtype
and pyarrow.StringArray
(xref #2933).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both can be converted to the numpy string dtype by the vectors_to_arrays
method, so in virtualfile_from_vectors
, we only need to check np.str_
:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: import pyarrow as pa
In [4]: x = pd.Series(["abc", "defghi"], dtype="string")
In [5]: np.asarray(x)
Out[5]: array(['abc', 'defghi'], dtype=object)
In [6]: np.asarray(x, dtype=str)
Out[6]: array(['abc', 'defghi'], dtype='<U6')
In [7]: y = pa.array(["abc", "defghi"])
In [8]: np.asarray(y)
Out[8]: array(['abc', 'defghi'], dtype=object)
In [9]: np.asarray(y, dtype=str)
Out[9]: array(['abc', 'defghi'], dtype='<U6')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main idea of this PR is to let vectors_to_arrays
do the conversion work from any dtypes (including pd.StringDtype
and pa.StringArray
) into the numpy dtypes, so that the virtualfile_from_vectors
only need to care about how to map numpy dtypes into GMT data types.
For any special dtypes that we know how to convert it to numpy dtype, we can maintain a mapping dictionary, just like what you did to support pyarrow's date32[day] and date64[ms] in #2845:
pygmt/pygmt/clib/conversion.py
Lines 208 to 211 in c2e429c
dtypes = { | |
"date32[day][pyarrow]": np.datetime64, | |
"date64[ms][pyarrow]": np.datetime64, | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 83673cf, I've moved most of the doctests into a separate test file pygmt/tests/test_clib_vectors_to_arrays.py
. A test test_vectors_to_arrays_pandas_string
is added to check that the function can handle pd.StringDtype
correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For any special dtypes that we know how to convert it to numpy dtype, we can maintain a mapping dictionary, just like what you did to support pyarrow's date32[day] and date64[ms] in #2845:
pygmt/pygmt/clib/conversion.py
Lines 208 to 211 in c2e429c
dtypes = { "date32[day][pyarrow]": np.datetime64, "date64[ms][pyarrow]": np.datetime64, }
Based on the tests below, I think we should add the entry "string": np.str_
to the dictionary:
In [1]: import pandas as pd
In [2]: x = pd.Series(["abc", "12345"])
In [3]: x.dtype
Out[3]: dtype('O')
In [4]: str(x.dtype)
Out[4]: 'object'
In [5]: x = pd.Series(["abc", "12345"], dtype="string")
In [6]: x.dtype
Out[6]: string[python]
In [7]: str(x.dtype)
Out[7]: 'string'
In [8]: x = pd.Series(["abc", "12345"], dtype="string[pyarrow]")
In [9]: x.dtype
Out[9]: string[pyarrow]
In [10]: str(x.dtype)
Out[10]: 'string'
In [11]: import pyarrow as pa
In [12]: x = pa.array(["abc", "defghi"])
In [13]: x.type
Out[13]: DataType(string)
In [14]: str(x.type)
Out[14]: 'string'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PR #2774 and #2774, we only checked if PyGMT supports pandas with the pyarrow backend, but didn't check if the original pyarrow arrays works. For example, for a pyarrow date32
array, we need to check array.type
rather than array.dtype
:
In [1]: import datetime
In [2]: import pyarrow as pa
In [3]: x = pa.array([datetime.date(2020, 1, 1), datetime.date(2021, 12, 31)])
In [4]: str(x.type)
Out[4]: 'date32[day]'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, for a pyarrow
date32
array, we need to checkarray.type
Yes, raw pyarrow
arrays of date32/date64
type are not supported yet. That's why it's marked 🚧 in #2800 (I was planning to modify array_to_datetime
to handle it).
@@ -313,6 +320,11 @@ def array_to_datetime(array: Sequence[Any]) -> np.ndarray: | |||
|
|||
If the input array is not in legal datetime formats, raise a ValueError exception. | |||
|
|||
.. deprecated:: 0.14.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this PR, array_to_datetime
is no longer used, but I still want to keep this function so that we know what kinds of datetime formats that np.asarray(array, dtype=np.datetime64)
can support.
aa3ffc3
to
6338cde
Compare
a029191
to
54160bf
Compare
I've reverted some changes in e26afbf so that this PR can focus on refactoring the existing codes without introducing more enhancements. We will revisit the issues/enhancements mentioned in #3507 (comment) later. |
""" | ||
vectors = [ | ||
pd.Series(["abc", "defhig"]), | ||
pd.Series(["abcdef", "123456"], dtype="string"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 3 'string' dtypes in pandas now according to https://pandas.pydata.org/pandas-docs/version/2.2/reference/api/pandas.StringDtype.html:
string
orstring[python]
(default as of Pandas 2.x)string[pyarrow]
string[pyarrow_numpy]
I think the expectation is that in pandas 3.0, the default StringDtype will change from string[python]
to string[pyarrow]
(see https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html). So maybe we should explicitly test for string[python]
here? Or maybe test for all three variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in d8777e5.
) | ||
def test_vectors_to_arrays_pandas_string(dtype): | ||
""" | ||
Test the vectors_to_arrays function with pandas strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test the vectors_to_arrays function with pandas strings. | |
Test the vectors_to_arrays function with pandas.Series objects of dtype | |
string[python], string[pyarrow] and string[pyarrow_numpy]. |
The function is no longer used in the PyGMT project, but we keep this function | ||
to docuemnt and test the supported datetime types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is no longer used in the PyGMT project, but we keep this function | |
to docuemnt and test the supported datetime types. | |
This function is no longer used in the PyGMT project, but we keep it to document | |
and test supported datetime types. |
Description of proposed changes
In the
virtualfile_from_vectors
function, we callvectors_to_arrays
to convert a sequence of 1-D vectors (list, numpy array, pandas.Series or similar) to a list of 1-D numpy arrays. The converted 1-D numpy arrays can have various dtypes. Some dtypes can be recognized and converted to GMT's supported data types, e.g.,pygmt/pygmt/clib/session.py
Lines 81 to 95 in c2e429c
while others can't, e.g.,
pygmt/pygmt/tests/test_clib_put_vector.py
Lines 203 to 216 in c2e429c
The
np.object_
dtype is the special one, since any data types can be stored innp.object_
dtype. Due to this reason, we have to handle withnp.object_
carefully in some places:(1) In
_check_dtype_and_dim
, if the array dtype isnp.object_
, we need to check if the array can be converted to np.datetime64 dtype (by callingarray_to_datetime
). If yes, the function returnsGMT_DATETIME
.pygmt/pygmt/clib/session.py
Lines 864 to 866 in c2e429c
(2) In
virtualfile_from_vectors
, after calling_check_dtype_and_dim
, we know that the array can be recognized as GMT'sGMT_DATETIME
type (which actually means either the array is already np.datetime64 or is np.object_ but can be converted to np.datetime64). We still need to callarray_to_datetime
again to convert the array to np.datetime64.https://github.com/GenericMappingTools/pygmt/blob/c2e429c0262f4dd49a87be711cfa0883eebb408e/pygmt/clib/session.py#L918C6-L920C2
(3) Again, in
virtualfile_from_vectors
, we callpd.api.types.is_string_dtype
to determine if an array is string dtype (np.object is recognized as a string dtype).pygmt/pygmt/clib/session.py
Lines 1391 to 1392 in c2e429c
Instead of dealing with the
np.object_
dtype in different places, we can deal with it in thevectors_to_arrays
function. This can be done by adding the following lines to this function (see the changed files in this PR for details):With the above lines, we can simplify the
_check_dtype_and_dim
/put_vector
/virtualfile_from_vectors
functions.