Access Frame attributes instead of ColumnAccessor attributes when available #16652

mroeschke · 2024-08-23T23:23:12Z

Description

There are some places where a public object like DataFrame or Index accesses a ColumnAccessor attribute when it's accessible in a shared subclass attribute instead (like Frame).

In an effort to access the ColumnAccessor less, replaced usages of ._data.attribute with a Frame specific attribute`

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…ilable

…ttributes

Discovered in #16652, `DataFrame.iloc/loc.__setitem__` with a non-cupy type e.g. `"category"` failed because the indexing path unconditionally tries to `cupy.asarray` the value to be set which only accepts types recognized by cupy. We can skip this `asarray` if we have a numpy/pandas/cudf object Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #16677

…ttributes

galipremsagar · 2024-08-30T03:51:09Z

python/cudf/cudf/_lib/csv.pyx

@@ -273,8 +273,7 @@ def read_csv(
        elif isinstance(dtype, abc.Collection):
            for index, col_dtype in enumerate(dtype):
                if isinstance(cudf.dtype(col_dtype), cudf.CategoricalDtype):
-                    col_name = df._data.names[index]
-                    df._data[col_name] = df._data[col_name].astype(col_dtype)
+                    df.iloc[:, index] = df.iloc[:, index].astype(col_dtype)


Will this change not end up being expensive since we are touching the public dataframe API + typecasting a series which will be having the returning the index too?

Yeah, this is quite a lot more expensive. We're replacing two dict lookups with quite a lot of work to determine how to do those two dict lookups.

Yeah that's fair. I'll revert this change

wence-

Some very small nits, in addition to @galipremsagar's comment.

wence- · 2024-08-30T08:01:27Z

python/cudf/cudf/_lib/csv.pyx

@@ -273,8 +273,7 @@ def read_csv(
        elif isinstance(dtype, abc.Collection):
            for index, col_dtype in enumerate(dtype):
                if isinstance(cudf.dtype(col_dtype), cudf.CategoricalDtype):
-                    col_name = df._data.names[index]
-                    df._data[col_name] = df._data[col_name].astype(col_dtype)
+                    df.iloc[:, index] = df.iloc[:, index].astype(col_dtype)


Yeah, this is quite a lot more expensive. We're replacing two dict lookups with quite a lot of work to determine how to do those two dict lookups.

wence- · 2024-08-30T08:20:18Z

python/cudf/cudf/core/frame.py

+        return zip(self._column_names, self._columns)
+
+    @property
+    def _dtypes(self) -> abc.Generator:


Suggested change

def _dtypes(self) -> abc.Generator:

def _dtypes(self) -> abc.Generator[tuple[Hashable, Dtype], None, None]:

? can't remember what the type of col.dtype is

Thanks. Yup it's Dtype

wence- · 2024-08-30T08:22:06Z

python/cudf/cudf/core/frame.py

@@ -75,8 +75,13 @@ def _columns(self) -> tuple[ColumnBase, ...]:
        return self._data.columns

    @property
-    def _dtypes(self) -> abc.Iterable:
-        return zip(self._data.names, (col.dtype for col in self._data.columns))
+    def _column_labels_and_values(self) -> abc.Iterable:


Suggested change

def _column_labels_and_values(self) -> abc.Iterable:

def _column_labels_and_values(self) -> abc.Iterable[tuple[Hashable, Dtype]]:

?

The second argument in the tuple should be ColumnBase, but thanks!

…ttributes

galipremsagar · 2024-09-19T21:40:36Z

/merge

…ilable (rapidsai#16652) There are some places where a public object like `DataFrame` or `Index` accesses a `ColumnAccessor` attribute when it's accessible in a shared subclass attribute instead (like `Frame`). In an effort to access the `ColumnAccessor` less, replaced usages of `._data.attribute` with a `Frame` specific attribute` Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: rapidsai#16652

Access Frame attributes instead of ColumnAccessor attributes when ava…

0ef3dfb

…ilable

mroeschke added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 23, 2024

mroeschke requested a review from a team as a code owner August 23, 2024 23:23

mroeschke requested review from wence- and brandon-b-miller August 23, 2024 23:23

mroeschke added 4 commits August 26, 2024 11:19

Merge remote-tracking branch 'upstream/branch-24.10' into ref/frame/a…

b12f34f

…ttributes

Revert some attributes

9358b0f

Merge remote-tracking branch 'upstream/branch-24.10' into ref/frame/a…

bf7b7d7

…ttributes

Fix parquet bug

f0ea329

mroeschke mentioned this pull request Aug 28, 2024

Fix loc/iloc.__setitem__[:, loc] with non cupy types #16677

Merged

3 tasks

mroeschke added 5 commits August 28, 2024 12:05

Merge remote-tracking branch 'upstream/branch-24.10' into ref/frame/a…

b731708

…ttributes

Merge remote-tracking branch 'upstream/branch-24.10' into ref/frame/a…

afdf88e

…ttributes

Typo and bug

71cb94b

Merge remote-tracking branch 'upstream/branch-24.10' into ref/frame/a…

5fdb504

…ttributes

Another parquet fix

c6da4c3

galipremsagar reviewed Aug 30, 2024

View reviewed changes

wence- reviewed Aug 30, 2024

View reviewed changes

mroeschke added 2 commits August 30, 2024 11:40

Merge remote-tracking branch 'upstream/branch-24.10' into ref/frame/a…

69a567b

…ttributes

Typing and don't use public API

1bffbe5

vyasr requested review from galipremsagar and wence- September 3, 2024 18:23

Merge branch 'branch-24.10' into ref/frame/attributes

9b43ce6

galipremsagar approved these changes Sep 19, 2024

View reviewed changes

rapids-bot bot merged commit d63ca6a into rapidsai:branch-24.10 Sep 19, 2024
94 checks passed

mroeschke deleted the ref/frame/attributes branch September 25, 2024 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access Frame attributes instead of ColumnAccessor attributes when available #16652

Access Frame attributes instead of ColumnAccessor attributes when available #16652

mroeschke commented Aug 23, 2024

galipremsagar Aug 30, 2024

wence- Aug 30, 2024

mroeschke Aug 30, 2024

wence- left a comment

wence- Aug 30, 2024

wence- Aug 30, 2024

mroeschke Aug 30, 2024

wence- Aug 30, 2024

mroeschke Aug 30, 2024

galipremsagar commented Sep 19, 2024

	def _dtypes(self) -> abc.Generator:
	def _dtypes(self) -> abc.Generator[tuple[Hashable, Dtype], None, None]:

	def _column_labels_and_values(self) -> abc.Iterable:
	def _column_labels_and_values(self) -> abc.Iterable[tuple[Hashable, Dtype]]:

Access Frame attributes instead of ColumnAccessor attributes when available #16652

Access Frame attributes instead of ColumnAccessor attributes when available #16652

Conversation

mroeschke commented Aug 23, 2024

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

galipremsagar commented Sep 19, 2024