Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST (string dtype): resolve all xfails in IO parser tests #60321

Merged

Conversation

jorisvandenbossche
Copy link
Member

There are two remaining xfails left: one related to invalid unicode (that errors if using the pyarrow-backed string dtype, so we should probably have a fall back to object dtype fo that case), and another one about specifying the names keyword with the pyarrow engine giving object-dtype columns.

xref #54792

@jorisvandenbossche jorisvandenbossche added IO CSV read_csv, to_csv Strings String extension data type and string data labels Nov 15, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Nov 15, 2024
@@ -260,8 +257,12 @@ def test_warn_if_chunks_have_mismatched_type(all_parsers):
"Specify dtype option on import or set low_memory=False.",
buf,
)

assert df.a.dtype == object
if parser.engine == "c" and parser.low_memory:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't low_memory still be using the proper data type? Or why would that stick to object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not super familiar with the parser code, but I think that with the low memory parser, parsing is done in chunks, and so if the inference changes later on, you end up with chunks with different types, and then get object dtype as a result.

In the test here, we have a column with mostly integers, and only a few strings in the middle. So with the default parser, it will decide based on the values in the full column that the dtype should be string. But chunk by chunk you get some chunks as integer and some as string

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see...yea that's a weird one

@@ -260,8 +257,12 @@ def test_warn_if_chunks_have_mismatched_type(all_parsers):
"Specify dtype option on import or set low_memory=False.",
buf,
)

assert df.a.dtype == object
if parser.engine == "c" and parser.low_memory:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see...yea that's a weird one

@WillAyd WillAyd merged commit ee3c18f into pandas-dev:main Nov 15, 2024
56 of 57 checks passed
Copy link

lumberbot-app bot commented Nov 15, 2024

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.3.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 ee3c18f51b393893ed6e31214c7be2f9427ce0c9
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #60321: TST (string dtype): resolve all xfails in IO parser tests'
  1. Push to a named branch:
git push YOURFORK 2.3.x:auto-backport-of-pr-60321-on-2.3.x
  1. Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #60321 on branch 2.3.x (TST (string dtype): resolve all xfails in IO parser tests)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@jorisvandenbossche jorisvandenbossche deleted the string-dtype-tests-io-parser branch November 15, 2024 16:18
@WillAyd
Copy link
Member

WillAyd commented Nov 15, 2024

Will backport this one too

@jorisvandenbossche
Copy link
Member Author

Manual backport -> #60330

jorisvandenbossche added a commit that referenced this pull request Nov 18, 2024
#60330)

* Backport PR #60321: TST (string dtype): resolve all xfails in IO parser tests

(cherry picked from commit ee3c18f)

* BUG: Avoid RangeIndex conversion in read_csv if dtype is specified (#59316)


Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Matthew Roeschke <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants