Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add public APIs to Access Underlying cudf and pandas Objects from cudf.pandas Proxy Objects #17629

Merged
merged 27 commits into from
Jan 29, 2025
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
b5eea1f
Add a public api to get fast slow objects
galipremsagar Dec 19, 2024
7bc76e5
Merge remote-tracking branch 'upstream/branch-25.02' into 17524
galipremsagar Jan 24, 2025
3cdfe94
update names and add fast paths
galipremsagar Jan 24, 2025
34375dc
centralize logic
galipremsagar Jan 25, 2025
31f9e99
fix
galipremsagar Jan 25, 2025
72ba73f
cleanup
galipremsagar Jan 25, 2025
3fd679f
Merge branch 'branch-25.02' into 17524
galipremsagar Jan 25, 2025
37764c2
Apply suggestions from code review
galipremsagar Jan 25, 2025
bbe0fa4
Apply suggestions from code review
galipremsagar Jan 27, 2025
cf6888f
wrap result
galipremsagar Jan 27, 2025
528c189
Merge branch 'branch-25.02' into 17524
galipremsagar Jan 27, 2025
6b744f4
Update faq.md
galipremsagar Jan 27, 2025
9ec0215
style
galipremsagar Jan 27, 2025
a3c49fd
add is_cudf_pandas.. APIs
galipremsagar Jan 27, 2025
3f06e70
update docs
galipremsagar Jan 27, 2025
95ba799
Merge remote-tracking branch 'upstream/branch-25.02' into 17524
galipremsagar Jan 27, 2025
a2e97f5
Apply suggestions from code review
galipremsagar Jan 27, 2025
b96f8ff
Apply suggestions from code review
galipremsagar Jan 27, 2025
0044d8f
update API
galipremsagar Jan 28, 2025
a77fecc
revert cudf.pandas spilling into cudf
galipremsagar Jan 28, 2025
aafbd31
Merge branch 'branch-25.02' into 17524
galipremsagar Jan 28, 2025
f15d37b
Update docs/cudf/source/cudf_pandas/faq.md
galipremsagar Jan 28, 2025
f94253d
cleanup
galipremsagar Jan 28, 2025
daaa5df
update api
galipremsagar Jan 28, 2025
eaa23cb
Merge branch 'branch-25.02' into 17524
galipremsagar Jan 28, 2025
9c58a71
cleanup
galipremsagar Jan 28, 2025
69837f3
Merge branch 'branch-25.02' into 17524
galipremsagar Jan 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions docs/cudf/source/cudf_pandas/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,30 @@ cuDF (learn more in [this
blog](https://medium.com/rapids-ai/easy-cpu-gpu-arrays-and-dataframes-run-your-dask-code-where-youd-like-e349d92351d)) and the [RAPIDS Accelerator for Apache Spark](https://nvidia.github.io/spark-rapids/)
provides a similar configuration-based plugin for Spark.


## How can I access the underlying GPU or CPU objects?

When working with `cudf.pandas` proxy objects, it is sometimes necessary to get true `cudf` or `pandas` objects that reside on GPU or CPU.
For example, this can be used to ensure that GPU-aware libraries that support both `cudf` and `pandas` can use the `cudf`-optimized code paths that keep data on GPU when processing `cudf.pandas` objects.
Otherwise, the library might use less-optimized CPU code because it thinks that the `cudf.pandas` object is a plain `pandas` dataframe.

The following methods can be used to retrieve the actual `cudf` or `pandas` objects:

- `as_gpu_object()`: This method returns the `cudf` object from the proxy.
- `as_cpu_object()`: This method returns the `pandas` object from the proxy.

Here is an example of how to use these methods:

```python
# Assuming `proxy_obj` is a cudf.pandas proxy object
cudf_obj = proxy_obj.as_gpu_object()
pandas_obj = proxy_obj.as_cpu_object()

# Now you can use `cudf_obj` and `pandas_obj` with libraries that are cudf or pandas aware
```

galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
Be aware that if `cudf.pandas` objects are converted to their underlying `cudf` or `pandas` types, the `cudf.pandas` proxy no longer controls them. This means that automatic conversion between GPU and CPU types and automatic fallback from GPU to CPU functionality will not occur.
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved

(are-there-any-known-limitations)=
## Are there any known limitations?

Expand Down
2 changes: 1 addition & 1 deletion python/cudf/cudf/core/column/column.py
Original file line number Diff line number Diff line change
Expand Up @@ -1251,7 +1251,7 @@ def as_categorical_column(self, dtype) -> ColumnBase:
)

# Categories must be unique and sorted in ascending order.
cats = self.unique().sort_values().astype(self.dtype)
cats = self.unique().sort_values()
Matt711 marked this conversation as resolved.
Show resolved Hide resolved
label_dtype = min_unsigned_type(len(cats))
labels = self._label_encoding(
cats=cats, dtype=label_dtype, na_sentinel=cudf.Scalar(1)
Expand Down
15 changes: 15 additions & 0 deletions python/cudf/cudf/core/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@
from cudf.utils.utils import (
GetAttrGetItemMixin,
_external_only_api,
_extract_from_proxy,
_is_null_host_scalar,
)

Expand Down Expand Up @@ -707,9 +708,23 @@ def __init__(
if copy is not None:
raise NotImplementedError("copy is not currently implemented.")
super().__init__({}, index=cudf.Index([]))

if nan_as_null is no_default:
nan_as_null = not cudf.get_option("mode.pandas_compatible")

if cudf.get_option("mode.pandas_compatible"):
data, data_extracted = _extract_from_proxy(data)
vyasr marked this conversation as resolved.
Show resolved Hide resolved
index, index_extracted = _extract_from_proxy(index)
columns, columns_extracted = _extract_from_proxy(
columns, fast=False
)
if (
(data is None or data_extracted)
and (index is None or index_extracted)
and (columns is None or columns_extracted)
) and (dtype is None and copy is None):
self.__dict__.update(data.__dict__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use _mimic_inplace instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is much light weight that using _mimic_inplace and also copies cached attributes.

return
if isinstance(columns, (Series, cudf.BaseIndex)):
columns = columns.to_pandas()

Expand Down
12 changes: 11 additions & 1 deletion python/cudf/cudf/core/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,11 @@
is_mixed_with_object_dtype,
)
from cudf.utils.performance_tracking import _performance_tracking
from cudf.utils.utils import _warn_no_dask_cudf, search_range
from cudf.utils.utils import (
_extract_from_proxy,
_warn_no_dask_cudf,
search_range,
)

if TYPE_CHECKING:
from collections.abc import Generator, Iterable
Expand Down Expand Up @@ -1067,6 +1071,12 @@ class Index(SingleColumnFrame, BaseIndex, metaclass=IndexMeta):
@_performance_tracking
def __init__(self, data, **kwargs):
name = _getdefault_name(data, name=kwargs.get("name"))
if cudf.get_option("mode.pandas_compatible"):
data, data_extracted = _extract_from_proxy(data)
if data_extracted and len(kwargs) == 0:
self.__dict__.update(data.__dict__)
return

super().__init__({name: data})

@_performance_tracking
Expand Down
17 changes: 17 additions & 0 deletions python/cudf/cudf/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,9 @@
to_cudf_compatible_scalar,
)
from cudf.utils.performance_tracking import _performance_tracking
from cudf.utils.utils import (
_extract_from_proxy,
)

if TYPE_CHECKING:
from collections.abc import MutableMapping
Expand Down Expand Up @@ -626,6 +629,20 @@ def __init__(
):
if nan_as_null is no_default:
nan_as_null = not cudf.get_option("mode.pandas_compatible")

Matt711 marked this conversation as resolved.
Show resolved Hide resolved
if cudf.get_option("mode.pandas_compatible"):
data, data_extracted = _extract_from_proxy(data)
index, _ = _extract_from_proxy(index)

if (
data_extracted
and index is None
and dtype is None
and name is None
and copy is False
):
self.__dict__.update(data.__dict__)
return
index_from_data = None
name_from_data = None
if data is None:
Expand Down
9 changes: 8 additions & 1 deletion python/cudf/cudf/pandas/_wrappers/pandas.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2023-2024, NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: Copyright (c) 2023-2025, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import abc
Expand Down Expand Up @@ -266,6 +266,12 @@ def custom_repr_html(obj):
html_formatter.for_type(DataFrame, custom_repr_html)


def _Series_dtype(self):
# Fast-path to extract dtype from the current
# object without round-tripping through the slow<->fast
return self._fsproxy_wrapped.dtype

galipremsagar marked this conversation as resolved.
Show resolved Hide resolved

Series = make_final_proxy_type(
"Series",
cudf.Series,
Expand All @@ -285,6 +291,7 @@ def custom_repr_html(obj):
"_constructor": _FastSlowAttribute("_constructor"),
"_constructor_expanddim": _FastSlowAttribute("_constructor_expanddim"),
"_accessors": set(),
"dtype": _Series_dtype,
},
)

Expand Down
10 changes: 9 additions & 1 deletion python/cudf/cudf/pandas/fast_slow_proxy.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2023-2024, NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: Copyright (c) 2023-2025, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0

Expand Down Expand Up @@ -204,6 +204,12 @@ def _fsproxy_fast_to_slow(self):
return fast_to_slow(self._fsproxy_wrapped)
return self._fsproxy_wrapped

def as_gpu_object(self):
return self._fsproxy_slow_to_fast()

Comment on lines +207 to +208
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to call isinstance_cudf_pandas inside this function and raise a friendly error message on False?

def as_cpu_object(self):
return self._fsproxy_fast_to_slow()

@property # type: ignore
def _fsproxy_state(self) -> _State:
return (
Expand All @@ -221,6 +227,8 @@ def _fsproxy_state(self) -> _State:
"_fsproxy_slow_type": slow_type,
"_fsproxy_slow_to_fast": _fsproxy_slow_to_fast,
"_fsproxy_fast_to_slow": _fsproxy_fast_to_slow,
"as_gpu_object": as_gpu_object,
"as_cpu_object": as_cpu_object,
"_fsproxy_state": _fsproxy_state,
}

Expand Down
16 changes: 15 additions & 1 deletion python/cudf/cudf/utils/utils.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2020-2024, NVIDIA CORPORATION.
# Copyright (c) 2020-2025, NVIDIA CORPORATION.
from __future__ import annotations

import decimal
Expand Down Expand Up @@ -451,3 +451,17 @@ def _datetime_timedelta_find_and_replace(
except TypeError:
result_col = original_column.copy(deep=True)
return result_col # type: ignore


def _extract_from_proxy(proxy, fast=True):
"""
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
Extract the object from a proxy object.
"""
try:
return (
(proxy.as_gpu_object(), True)
if fast
else (proxy.as_cpu_object(), True)
)
except AttributeError:
return (proxy, False)
8 changes: 7 additions & 1 deletion python/cudf/cudf_pandas_tests/test_cudf_pandas.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2023-2024, NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: Copyright (c) 2023-2025, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0

Expand Down Expand Up @@ -1885,3 +1885,9 @@ def test_dataframe_setitem():
new_df = df + 1
df[df.columns] = new_df
tm.assert_equal(df, new_df)


def test_dataframe_get_fast_slow_methods():
df = xpd.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3]})
assert isinstance(df.as_gpu_object(), cudf.DataFrame)
assert isinstance(df.as_cpu_object(), pd.DataFrame)
Loading