REF: centralize pyarrow Table to pandas conversions and types_mapper handling #60324

jorisvandenbossche · 2024-11-15T14:13:11Z

We have defined this logic in several places, so defining one helper function to reuse that for the different IO formats.

…handling

WillAyd

Nice change overall; just some questions

WillAyd · 2024-11-15T15:07:23Z

pandas/io/_util.py

+    types_mapper: type[pd.ArrowDtype] | None | Callable
+    if dtype_backend == "numpy_nullable":
+        mapping = _arrow_dtype_mapping()
+        if null_to_int64:


I realize this is for compatability, but is this a feature or a bug that the CSV reader does this?

No idea, I didn't look in detail at why this is happening in the CSV reader, for now just wanded to keep the same behaviour (but this is certainly an ugly keyword, the problem with centralizing the conversion as I am doing, it's not otherwise possible to change it on the CSV side)

WillAyd · 2024-11-15T15:08:43Z

pandas/io/sql.py

-            df = cur.fetch_arrow_table().to_pandas(types_mapper=mapping)
+            pa_table = cur.fetch_arrow_table()
+            dtype_backend = (
+                lib.no_default if dtype_backend == "numpy" else dtype_backend


Is there any harm to forwarding along the argument of "numpy"? Seems a little strange to handle this one-off in some invocations

It's also strange that we use "numpy" as the default in the SQL code, and no_default for all other places.
Ideally we would solve that inconsistency as well, I think.

Now, I can certainly forward it and handle it inside arrow_table_to_pandas (I could also change the default here, and just additionally accept "numpy" for back compat)

Hmm that seems like an issue in the SQL code. Looks like the read_sql API uses lib.no_default so that must be getting messed up in the internals. Let's keep this PR the way it is and track in #60326

Thanks for opening that issue.
Now, in the end I moved it to the utility anyway to deal with the typing issues (the type checkers don't understand that I removed "numpy" from the options ..)

Sounds good. Still something to be cleaned up there, though not urgent for now

WillAyd

lgtm ex code-checks fail (didn't look at it)

lumberbot-app · 2024-11-15T18:09:14Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.3.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 12d6f602eea98275553ac456f90201151b1f9bf8

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #60324: REF: centralize pyarrow Table to pandas conversions and types_mapper handling'

Push to a named branch:

git push YOURFORK 2.3.x:auto-backport-of-pr-60324-on-2.3.x

Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #60324 on branch 2.3.x (REF: centralize pyarrow Table to pandas conversions and types_mapper handling)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

WillAyd · 2024-11-15T18:21:18Z

Will backport

WillAyd · 2024-11-15T18:24:22Z

Hmm looking at this some more @jorisvandenbossche it looks like the 2.3 branch in parquet.py requires there to be a split_blocks argument; I suppose we need to add that to the function here too?

jorisvandenbossche · 2024-11-15T19:54:20Z

Yeah, either just add that, or add a generic to_pandas_kwargs that get passed through, I would say

… conversions and types_mapper handling (cherry picked from commit 12d6f60)

WillAyd · 2024-11-15T21:33:11Z

Backport PR #60332

… conversions and types_mapper handling (cherry picked from commit 12d6f60)

…ns and types_mapper handling (#60332) (cherry picked from commit 12d6f60) Co-authored-by: Joris Van den Bossche <[email protected]>

REF: centralize pyarrow Table to pandas conversions and types_mapper …

d834e46

…handling

jorisvandenbossche added the Arrow pyarrow functionality label Nov 15, 2024

jorisvandenbossche added this to the 2.3 milestone Nov 15, 2024

jorisvandenbossche requested a review from WillAyd November 15, 2024 14:13

fix typing and sql default backend case

7996aa3

WillAyd reviewed Nov 15, 2024

View reviewed changes

WillAyd mentioned this pull request Nov 15, 2024

REF: dtype_backend argument in sql module mixes lib.no_default and numpy #60326

Open

WillAyd approved these changes Nov 15, 2024

View reviewed changes

try fix typing

a655347

WillAyd merged commit 12d6f60 into pandas-dev:main Nov 15, 2024
51 checks passed

lumberbot-app bot added the Still Needs Manual Backport label Nov 15, 2024

jorisvandenbossche deleted the arrow-to_pandas-dtype-mapping branch November 15, 2024 19:53

WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Nov 15, 2024

Backport PR pandas-dev#60324: REF: centralize pyarrow Table to pandas…

c071e47

… conversions and types_mapper handling (cherry picked from commit 12d6f60)

jorisvandenbossche removed the Still Needs Manual Backport label Nov 15, 2024

jorisvandenbossche mentioned this pull request Nov 15, 2024

Backport PR #60324: REF: centralize pyarrow Table to pandas conversions and types_mapper handling #60332

Merged

WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Nov 15, 2024

Backport PR pandas-dev#60324: REF: centralize pyarrow Table to pandas…

890206d

… conversions and types_mapper handling (cherry picked from commit 12d6f60)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: centralize pyarrow Table to pandas conversions and types_mapper handling #60324

REF: centralize pyarrow Table to pandas conversions and types_mapper handling #60324

jorisvandenbossche commented Nov 15, 2024

WillAyd left a comment

WillAyd Nov 15, 2024

jorisvandenbossche Nov 15, 2024

WillAyd Nov 15, 2024

jorisvandenbossche Nov 15, 2024

WillAyd Nov 15, 2024

jorisvandenbossche Nov 15, 2024

WillAyd Nov 15, 2024

WillAyd left a comment

lumberbot-app bot commented Nov 15, 2024

WillAyd commented Nov 15, 2024

WillAyd commented Nov 15, 2024

jorisvandenbossche commented Nov 15, 2024

WillAyd commented Nov 15, 2024

REF: centralize pyarrow Table to pandas conversions and types_mapper handling #60324

REF: centralize pyarrow Table to pandas conversions and types_mapper handling #60324

Conversation

jorisvandenbossche commented Nov 15, 2024

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd Nov 15, 2024

Choose a reason for hiding this comment

jorisvandenbossche Nov 15, 2024

Choose a reason for hiding this comment

WillAyd Nov 15, 2024

Choose a reason for hiding this comment

jorisvandenbossche Nov 15, 2024

Choose a reason for hiding this comment

WillAyd Nov 15, 2024

Choose a reason for hiding this comment

jorisvandenbossche Nov 15, 2024

Choose a reason for hiding this comment

WillAyd Nov 15, 2024

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

lumberbot-app bot commented Nov 15, 2024

WillAyd commented Nov 15, 2024

WillAyd commented Nov 15, 2024

jorisvandenbossche commented Nov 15, 2024

WillAyd commented Nov 15, 2024