Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST: Add test for pd.read_csv date parsing not working with dtype_backend="pyarrow" and missing values #60286

Conversation

KevsterAmp
Copy link
Contributor

@KevsterAmp
Copy link
Contributor Author

I tried using

    assert pd.api.types.is_datetime64_any_dtype(df["date"])

but it seems that the code checks doesn't allow it. Not sure if the assertion is correct:

    assert (df["date"].dtype) == "datetime64[s]"

df = pd.read_csv(
StringIO(data), parse_dates=["date"], dayfirst=True, dtype_backend="pyarrow"
)
assert (df["date"].dtype) == "datetime64[s]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you build an expected DataFrame and use tm.assert_frame_equal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a bit of struggle with the dtype casting, Tried two methods:

    # put dtype string[pyarrow] on the Series
    expected = pd.DataFrame(
        {
            "date": pd.Series(
                pd.to_datetime(["20/12/2025", pd.NaT, "31/12/2020"], dayfirst=True),
            ),
            "id": pd.Series(["a", "b", "c"], dtype="string[pyarrow]"),
        },
    )
    
    ###############
    
    # cast dtype using .astype()
    expected["id"] = expected["id"].astype("string[pyarrow]")

Returns error:

E       AssertionError: Attributes of DataFrame.iloc[:, 1] (column name="id") are different
E
E       Attribute "dtype" are different
E       [left]:  StringDtype(storage=pyarrow, na_value=<NA>)
E       [right]: string[pyarrow]

For a band-aid fix, I tried casting string[pyarrow] as well to the same column in the df variable.

@td.skip_if_no("pyarrow")
def test_pyarrow_read_csv_datetime_dtype():
    data = "date,id\n20/12/2025,a\n,b\n31/12/2020,c"
    df = pd.read_csv(
        StringIO(data), parse_dates=["date"], dayfirst=True, dtype_backend="pyarrow"
    )
    expected = pd.DataFrame(
        {
            "date": pd.Series(
                pd.to_datetime(["20/12/2025", pd.NaT, "31/12/2020"], dayfirst=True),
            ),
            "id": pd.Series(["a", "b", "c"], dtype="string[pyarrow]"),
        },
    )
    expected["id"] = expected["id"].astype("string[pyarrow]")
    df["id"] = df["id"].astype("string[pyarrow]")

    assert tm.assert_frame_equal(expected, df)
    assert (df["date"].dtype) == "datetime64[s]"

But for some reason, pytest returns:

>       assert tm.assert_frame_equal(expected, df)
E       AssertionError

Hard to check what's the error exaclty, since the error isn't verbose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the simplifying the bug report, I don't think we need the string column, only the "date" column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke - Tried this:

@td.skip_if_no("pyarrow")
def test_pyarrow_read_csv_datetime_dtype():
    # GH 59904
    data = '"date"\n"20/12/2025"\n""\n"31/12/2020"'
    result = pd.read_csv(
        StringIO(data), parse_dates=["date"], dayfirst=True, dtype_backend="pyarrow"
    )
    expected_dict = {
        "date": pd.Series(
            pd.to_datetime(["20/12/2025", pd.NaT, "31/12/2020"], dayfirst=True)
        )
    }
    expected = pd.DataFrame(expected_dict)

    assert (result["date"].dtype) == "datetime64[s]"
    assert tm.assert_frame_equal(expected, result)

Still returns assertion error

>       assert tm.assert_frame_equal(expected, result)
E       AssertionError

pandas/tests/io/test_common.py:696: AssertionError

Copy link
Contributor Author

@KevsterAmp KevsterAmp Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally saw the problem lol
tm.assert_frame_equal should be run without assert. That's why it was showing AssertionError 😆

    assert tm.assert_frame_equal(expect, result) # returns AssertionError

    tm.assert_frame_equal(expect, result) # passes

Fixed it now and the test is passing

@mroeschke mroeschke changed the title BUG: Add test for pd.read_csv date parsing not working with dtype_backend="pyarrow" and missing values TST: Add test for pd.read_csv date parsing not working with dtype_backend="pyarrow" and missing values Nov 12, 2024
@mroeschke mroeschke added the Testing pandas testing functions or related to the test suite label Nov 12, 2024
@rhshadrach rhshadrach added IO CSV read_csv, to_csv Arrow pyarrow functionality Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Nov 12, 2024
)
expect = pd.DataFrame({"date": expect_data})

assert (result["date"].dtype) == "datetime64[s]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert (result["date"].dtype) == "datetime64[s]"

This is done in assert_frame_equal

Comment on lines 686 to 688
expect_data = pd.Series(
pd.to_datetime(["20/12/2025", pd.NaT, "31/12/2020"], dayfirst=True)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expect_data = pd.Series(
pd.to_datetime(["20/12/2025", pd.NaT, "31/12/2020"], dayfirst=True)
)
expect_data = pd.to_datetime(["20/12/2025", pd.NaT, "31/12/2020"], dayfirst=True)

@mroeschke mroeschke added this to the 3.0 milestone Nov 15, 2024
@mroeschke mroeschke merged commit 63d3971 into pandas-dev:main Nov 15, 2024
51 checks passed
@mroeschke
Copy link
Member

Thanks @KevsterAmp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pd.read_csv date parsing not working with dtype_backend="pyarrow" and missing values
3 participants