Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: subtle conversion issue for to_datetime #60371

Open
2 of 3 tasks
michael72 opened this issue Nov 20, 2024 · 4 comments
Open
2 of 3 tasks

BUG: subtle conversion issue for to_datetime #60371

michael72 opened this issue Nov 20, 2024 · 4 comments
Labels
Bug Datetime Datetime data dtype Needs Discussion Requires discussion from core team before further action

Comments

@michael72
Copy link

michael72 commented Nov 20, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

dates = ["2024-01-01", "2024-01-02", "2024-01-03"]
df = pd.DataFrame({"date": dates})
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")
print(df.dtypes["date"]) # datetime64[ns]

df["date"] = df["date"].astype("datetime64[ms]")
print(df.dtypes["date"]) # datetime64[ms]
print(df) # dates are OK
#         date
# 0 2024-01-01
# 1 2024-01-02
# 2 2024-01-03
# up to now the data was actually read from a parquet file
# where the date column was datetime64[ms]

df["date"] = pd.to_datetime(df["date"], unit="ns")
# it still is datetime64[ms]
assert(df.dtypes["date"] == "datetime64[ns]")

Issue Description

This is a rather constructed case but we had a very subtle bug reading parquet data that had the ds column stored as datetime64[ms] (previously str) and actually needed the result as datetime64[ns].
While to_datetime works for other types it does not change the used unit, when the underlying type is already datetime64 - and it does so silently.

Expected Behavior

I would either expect an exception, that the type already is datetime64 or the result to have the correct unit - here: datetime64[ns] and not using the same as the input (here: datetime64[ms])

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.12.2
python-bits : 64
OS : Linux
OS-release : 6.8.0-48-generic
Version : #48~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 7 11:24:13 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.3
numpy : 2.1.3
pytz : 2024.2
dateutil : 2.9.0.post0
pip : 24.0
Cython : None
sphinx : None
IPython : 8.29.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.10.0
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.4
lxml.etree : None
matplotlib : 3.9.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 18.0.0
pyreadstat : None
pytest : None
python-calamine : None
pyxlsb : None
s3fs : 2024.10.0
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.2
qtpy : None
pyqt5 : None

@michael72 michael72 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 20, 2024
@michael72
Copy link
Author

issue in prophet that stems from this one:
facebook/prophet#2529

@jorisvandenbossche
Copy link
Member

@michael72 this is a subtle issue and something that caused confusion before, if I recall correctly.

The main explanation for the behaviour is that the unit keyword denotes the resolution of the input, and not the output (the docstring for it also says that somewhat, although reading it now it's not super clear ..), and so it is only used for numeric input.
So essentially in the example above it is just ignored, because the input is not numeric, and since it is already datetime64, the input is returned as is.

Thus, as far as I know, this is the expected behaviour. Though, I agree this is confusing (as long as we only supported ns, that was probably OK, but now that we support multiple units, it's very logical to think that specifying unit means the output unit ..). I think it would be nice if we could for example change the default for unit to None (essentially make it required to specify if you want to use), so we can raise an informative error message when the user passes it in a wrong case.

If you want to converting an existing datetime64 column to a different unit, you can use the general astype or the specific as_unit() (available on DatetimeIndex or on Series through the .dt. accessor)

@rhshadrach
Copy link
Member

Agreed the docs could be clarified here. Perhaps we could also rename unit? Not sure if that'd be worth the churn. Another possibility would be to raise if unit is provided but not used because the input is not numeric.

@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 21, 2024
@michael72
Copy link
Author

@jorisvandenbossche Thanks for the explanation! Yes from the documentation it is not obvious (the unit of the arg) - but ignoring it here when converting a datetime64 to a datetime64 - maybe that is just too lenient and can result in unexpected behavior. Maybe other libraries already depend on the conversion being lenient. I don't know. A warning would be nice (if that is possible) in case the arg already is a datetime64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants