Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: centralize pyarrow Table to pandas conversions and types_mapper handling #60324

Merged

Conversation

jorisvandenbossche
Copy link
Member

We have defined this logic in several places, so defining one helper function to reuse that for the different IO formats.

@jorisvandenbossche jorisvandenbossche added the Arrow pyarrow functionality label Nov 15, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Nov 15, 2024
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change overall; just some questions

types_mapper: type[pd.ArrowDtype] | None | Callable
if dtype_backend == "numpy_nullable":
mapping = _arrow_dtype_mapping()
if null_to_int64:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is for compatability, but is this a feature or a bug that the CSV reader does this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea, I didn't look in detail at why this is happening in the CSV reader, for now just wanded to keep the same behaviour (but this is certainly an ugly keyword, the problem with centralizing the conversion as I am doing, it's not otherwise possible to change it on the CSV side)

pandas/io/sql.py Outdated
df = cur.fetch_arrow_table().to_pandas(types_mapper=mapping)
pa_table = cur.fetch_arrow_table()
dtype_backend = (
lib.no_default if dtype_backend == "numpy" else dtype_backend
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any harm to forwarding along the argument of "numpy"? Seems a little strange to handle this one-off in some invocations

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also strange that we use "numpy" as the default in the SQL code, and no_default for all other places.
Ideally we would solve that inconsistency as well, I think.

Now, I can certainly forward it and handle it inside arrow_table_to_pandas (I could also change the default here, and just additionally accept "numpy" for back compat)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that seems like an issue in the SQL code. Looks like the read_sql API uses lib.no_default so that must be getting messed up in the internals. Let's keep this PR the way it is and track in #60326

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening that issue.
Now, in the end I moved it to the utility anyway to deal with the typing issues (the type checkers don't understand that I removed "numpy" from the options ..)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Still something to be cleaned up there, though not urgent for now

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ex code-checks fail (didn't look at it)

@WillAyd WillAyd merged commit 12d6f60 into pandas-dev:main Nov 15, 2024
51 checks passed
Copy link

lumberbot-app bot commented Nov 15, 2024

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.3.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 12d6f602eea98275553ac456f90201151b1f9bf8
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #60324: REF: centralize pyarrow Table to pandas conversions and types_mapper handling'
  1. Push to a named branch:
git push YOURFORK 2.3.x:auto-backport-of-pr-60324-on-2.3.x
  1. Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #60324 on branch 2.3.x (REF: centralize pyarrow Table to pandas conversions and types_mapper handling)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@WillAyd
Copy link
Member

WillAyd commented Nov 15, 2024

Will backport

@WillAyd
Copy link
Member

WillAyd commented Nov 15, 2024

Hmm looking at this some more @jorisvandenbossche it looks like the 2.3 branch in parquet.py requires there to be a split_blocks argument; I suppose we need to add that to the function here too?

@jorisvandenbossche jorisvandenbossche deleted the arrow-to_pandas-dtype-mapping branch November 15, 2024 19:53
@jorisvandenbossche
Copy link
Member Author

Yeah, either just add that, or add a generic to_pandas_kwargs that get passed through, I would say

WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Nov 15, 2024
… conversions and types_mapper handling

(cherry picked from commit 12d6f60)
@WillAyd
Copy link
Member

WillAyd commented Nov 15, 2024

Backport PR #60332

WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Nov 15, 2024
… conversions and types_mapper handling

(cherry picked from commit 12d6f60)
jorisvandenbossche added a commit that referenced this pull request Nov 16, 2024
…ns and types_mapper handling (#60332)

(cherry picked from commit 12d6f60)

Co-authored-by: Joris Van den Bossche <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants