TST (string dtype): resolve all xfails in IO parser tests #60321

jorisvandenbossche · 2024-11-15T09:52:41Z

There are two remaining xfails left: one related to invalid unicode (that errors if using the pyarrow-backed string dtype, so we should probably have a fall back to object dtype fo that case), and another one about specifying the names keyword with the pyarrow engine giving object-dtype columns.

xref #54792

WillAyd · 2024-11-15T15:51:07Z

pandas/tests/io/parser/common/test_chunksize.py

@@ -260,8 +257,12 @@ def test_warn_if_chunks_have_mismatched_type(all_parsers):
            "Specify dtype option on import or set low_memory=False.",
            buf,
        )
-
-    assert df.a.dtype == object
+    if parser.engine == "c" and parser.low_memory:


Shouldn't low_memory still be using the proper data type? Or why would that stick to object?

I am not super familiar with the parser code, but I think that with the low memory parser, parsing is done in chunks, and so if the inference changes later on, you end up with chunks with different types, and then get object dtype as a result.

In the test here, we have a column with mostly integers, and only a few strings in the middle. So with the default parser, it will decide based on the values in the full column that the dtype should be string. But chunk by chunk you get some chunks as integer and some as string

Ah I see...yea that's a weird one

WillAyd · 2024-11-15T16:14:51Z

pandas/tests/io/parser/common/test_chunksize.py

@@ -260,8 +257,12 @@ def test_warn_if_chunks_have_mismatched_type(all_parsers):
            "Specify dtype option on import or set low_memory=False.",
            buf,
        )
-
-    assert df.a.dtype == object
+    if parser.engine == "c" and parser.low_memory:


Ah I see...yea that's a weird one

lumberbot-app · 2024-11-15T16:15:46Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.3.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 ee3c18f51b393893ed6e31214c7be2f9427ce0c9

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #60321: TST (string dtype): resolve all xfails in IO parser tests'

Push to a named branch:

git push YOURFORK 2.3.x:auto-backport-of-pr-60321-on-2.3.x

Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #60321 on branch 2.3.x (TST (string dtype): resolve all xfails in IO parser tests)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

WillAyd · 2024-11-15T16:45:48Z

Will backport this one too

…in IO parser tests (cherry picked from commit ee3c18f)

jorisvandenbossche · 2024-11-15T19:55:49Z

Manual backport -> #60330

#60330) * Backport PR #60321: TST (string dtype): resolve all xfails in IO parser tests (cherry picked from commit ee3c18f) * BUG: Avoid RangeIndex conversion in read_csv if dtype is specified (#59316) Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Matthew Roeschke <[email protected]>

TST (string dtype): resolve all xfails in IO parser tests

d79e6ca

jorisvandenbossche added IO CSV read_csv, to_csv Strings String extension data type and string data labels Nov 15, 2024

jorisvandenbossche added this to the 2.3 milestone Nov 15, 2024

WillAyd requested changes Nov 15, 2024

View reviewed changes

WillAyd approved these changes Nov 15, 2024

View reviewed changes

WillAyd merged commit ee3c18f into pandas-dev:main Nov 15, 2024
56 of 57 checks passed

lumberbot-app bot added the Still Needs Manual Backport label Nov 15, 2024

jorisvandenbossche deleted the string-dtype-tests-io-parser branch November 15, 2024 16:18

WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Nov 15, 2024

Backport PR pandas-dev#60321: TST (string dtype): resolve all xfails …

284e359

…in IO parser tests (cherry picked from commit ee3c18f)

jorisvandenbossche mentioned this pull request Nov 15, 2024

Backport PR #60321: TST (string dtype): resolve all xfails in IO pars… #60330

Merged

jorisvandenbossche removed the Still Needs Manual Backport label Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST (string dtype): resolve all xfails in IO parser tests #60321

TST (string dtype): resolve all xfails in IO parser tests #60321

jorisvandenbossche commented Nov 15, 2024

WillAyd Nov 15, 2024

jorisvandenbossche Nov 15, 2024

WillAyd Nov 15, 2024

WillAyd Nov 15, 2024

lumberbot-app bot commented Nov 15, 2024

WillAyd commented Nov 15, 2024

jorisvandenbossche commented Nov 15, 2024

TST (string dtype): resolve all xfails in IO parser tests #60321

TST (string dtype): resolve all xfails in IO parser tests #60321

Conversation

jorisvandenbossche commented Nov 15, 2024

WillAyd Nov 15, 2024

Choose a reason for hiding this comment

jorisvandenbossche Nov 15, 2024

Choose a reason for hiding this comment

WillAyd Nov 15, 2024

Choose a reason for hiding this comment

WillAyd Nov 15, 2024

Choose a reason for hiding this comment

lumberbot-app bot commented Nov 15, 2024

WillAyd commented Nov 15, 2024

jorisvandenbossche commented Nov 15, 2024