Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add public APIs to Access Underlying cudf and pandas Objects from cudf.pandas Proxy Objects #17629

Merged
merged 27 commits into from
Jan 29, 2025
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
b5eea1f
Add a public api to get fast slow objects
galipremsagar Dec 19, 2024
7bc76e5
Merge remote-tracking branch 'upstream/branch-25.02' into 17524
galipremsagar Jan 24, 2025
3cdfe94
update names and add fast paths
galipremsagar Jan 24, 2025
34375dc
centralize logic
galipremsagar Jan 25, 2025
31f9e99
fix
galipremsagar Jan 25, 2025
72ba73f
cleanup
galipremsagar Jan 25, 2025
3fd679f
Merge branch 'branch-25.02' into 17524
galipremsagar Jan 25, 2025
37764c2
Apply suggestions from code review
galipremsagar Jan 25, 2025
bbe0fa4
Apply suggestions from code review
galipremsagar Jan 27, 2025
cf6888f
wrap result
galipremsagar Jan 27, 2025
528c189
Merge branch 'branch-25.02' into 17524
galipremsagar Jan 27, 2025
6b744f4
Update faq.md
galipremsagar Jan 27, 2025
9ec0215
style
galipremsagar Jan 27, 2025
a3c49fd
add is_cudf_pandas.. APIs
galipremsagar Jan 27, 2025
3f06e70
update docs
galipremsagar Jan 27, 2025
95ba799
Merge remote-tracking branch 'upstream/branch-25.02' into 17524
galipremsagar Jan 27, 2025
a2e97f5
Apply suggestions from code review
galipremsagar Jan 27, 2025
b96f8ff
Apply suggestions from code review
galipremsagar Jan 27, 2025
0044d8f
update API
galipremsagar Jan 28, 2025
a77fecc
revert cudf.pandas spilling into cudf
galipremsagar Jan 28, 2025
aafbd31
Merge branch 'branch-25.02' into 17524
galipremsagar Jan 28, 2025
f15d37b
Update docs/cudf/source/cudf_pandas/faq.md
galipremsagar Jan 28, 2025
f94253d
cleanup
galipremsagar Jan 28, 2025
daaa5df
update api
galipremsagar Jan 28, 2025
eaa23cb
Merge branch 'branch-25.02' into 17524
galipremsagar Jan 28, 2025
9c58a71
cleanup
galipremsagar Jan 28, 2025
69837f3
Merge branch 'branch-25.02' into 17524
galipremsagar Jan 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions docs/cudf/source/cudf_pandas/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,52 @@ cuDF (learn more in [this
blog](https://medium.com/rapids-ai/easy-cpu-gpu-arrays-and-dataframes-run-your-dask-code-where-youd-like-e349d92351d)) and the [RAPIDS Accelerator for Apache Spark](https://nvidia.github.io/spark-rapids/)
provides a similar configuration-based plugin for Spark.

## How do I know if an object is a `cudf.pandas` proxy object?

To determine if an object is a `cudf.pandas` proxy object, you can use the `isinstance_cudf_pandas` API. This function checks if the given object is a proxy object that wraps either a `cudf` or `pandas` object. Here is an example of how to use this API:

```python
from cudf.pandas import isinstance_cudf_pandas

obj = ... # Your object here
if isinstance_cudf_pandas(obj, "Series"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally isinstance takes a class, not a string. Is that an option here, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responded here; #17629 (comment)

print("The object is a cudf.pandas proxy Series object.")
else:
print("The object is not a cudf.pandas proxy Series object.")
```

To detect `Series`, `DataFrame`, `Index`, and `ndarray` objects separately, you can pass the type names as the second parameter:

* `isinstance_cudf_pandas(obj, "Series")`: Detects if the object is a `cudf.pandas` proxy `Series`.
* `isinstance_cudf_pandas(obj, "DataFrame")`: Detects if the object is a `cudf.pandas` proxy `DataFrame`.
* `isinstance_cudf_pandas(obj, "Index")`: Detects if the object is a `cudf.pandas` proxy `Index`.
* `isinstance_cudf_pandas(obj, "ndarray")`: Detects if the object is a `cudf.pandas` proxy `ndarray`.

## How can I access the underlying GPU or CPU objects?

When working with `cudf.pandas` proxy objects, it is sometimes necessary to get true `cudf` or `pandas` objects that reside on GPU or CPU.
For example, this can be used to ensure that GPU-aware libraries that support both `cudf` and `pandas` can use the `cudf`-optimized code paths that keep data on GPU when processing `cudf.pandas` objects.
Otherwise, the library might use less-optimized CPU code because it thinks that the `cudf.pandas` object is a plain `pandas` dataframe.

The following methods can be used to retrieve the actual `cudf` or `pandas` objects:

- `as_gpu_object()`: This method returns the `cudf` object from the proxy.
- `as_cpu_object()`: This method returns the `pandas` object from the proxy.

If `as_gpu_object()` is called on a proxy array, it will return a `cupy` array and `as_cpu_object` will return a `numpy` array.
vyasr marked this conversation as resolved.
Show resolved Hide resolved

Here is an example of how to use these methods:

```python
# Assuming `proxy_obj` is a cudf.pandas proxy object
cudf_obj = proxy_obj.as_gpu_object()
pandas_obj = proxy_obj.as_cpu_object()

# Now you can use `cudf_obj` and `pandas_obj` with libraries that are cudf or pandas aware
```

galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
Be aware that if `cudf.pandas` objects are converted to their underlying `cudf` or `pandas` types, the `cudf.pandas` proxy no longer controls them. This means that automatic conversion between GPU and CPU types and automatic fallback from GPU to CPU functionality will not occur.
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved

(are-there-any-known-limitations)=
## Are there any known limitations?

Expand Down
2 changes: 1 addition & 1 deletion python/cudf/cudf/core/column/column.py
Original file line number Diff line number Diff line change
Expand Up @@ -1251,7 +1251,7 @@ def as_categorical_column(self, dtype) -> ColumnBase:
)

# Categories must be unique and sorted in ascending order.
cats = self.unique().sort_values().astype(self.dtype)
cats = self.unique().sort_values()
Matt711 marked this conversation as resolved.
Show resolved Hide resolved
label_dtype = min_unsigned_type(len(cats))
labels = self._label_encoding(
cats=cats, dtype=label_dtype, na_sentinel=cudf.Scalar(1)
Expand Down
1 change: 1 addition & 0 deletions python/cudf/cudf/core/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -707,6 +707,7 @@ def __init__(
if copy is not None:
raise NotImplementedError("copy is not currently implemented.")
super().__init__({}, index=cudf.Index([]))

if nan_as_null is no_default:
nan_as_null = not cudf.get_option("mode.pandas_compatible")

Expand Down
6 changes: 5 additions & 1 deletion python/cudf/cudf/core/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,10 @@
is_mixed_with_object_dtype,
)
from cudf.utils.performance_tracking import _performance_tracking
from cudf.utils.utils import _warn_no_dask_cudf, search_range
from cudf.utils.utils import (
_warn_no_dask_cudf,
search_range,
)

if TYPE_CHECKING:
from collections.abc import Generator, Iterable
Expand Down Expand Up @@ -1067,6 +1070,7 @@ class Index(SingleColumnFrame, BaseIndex, metaclass=IndexMeta):
@_performance_tracking
def __init__(self, data, **kwargs):
name = _getdefault_name(data, name=kwargs.get("name"))

super().__init__({name: data})

@_performance_tracking
Expand Down
1 change: 1 addition & 0 deletions python/cudf/cudf/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -625,6 +625,7 @@ def __init__(
):
if nan_as_null is no_default:
nan_as_null = not cudf.get_option("mode.pandas_compatible")

Matt711 marked this conversation as resolved.
Show resolved Hide resolved
index_from_data = None
name_from_data = None
if data is None:
Expand Down
5 changes: 4 additions & 1 deletion python/cudf/cudf/pandas/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2023-2024, NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: Copyright (c) 2023-2025, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0

Expand All @@ -8,6 +8,9 @@
import pylibcudf
import rmm.mr

from ._wrappers.pandas import (
isinstance_cudf_pandas,
)
from .fast_slow_proxy import is_proxy_object
from .magics import load_ipython_extension
from .profiler import Profiler
Expand Down
6 changes: 5 additions & 1 deletion python/cudf/cudf/pandas/_wrappers/numpy.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2023-2024, NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: Copyright (c) 2023-2025, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0

Expand Down Expand Up @@ -176,3 +176,7 @@ def ndarray__array_ufunc__(self, ufunc, method, *inputs, **kwargs):
cupy._core.flags.Flags,
_numpy_flagsobj,
)


def is_cudf_pandas_ndarray(obj):
return is_proxy_object(obj) and isinstance(obj, ndarray)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

15 changes: 14 additions & 1 deletion python/cudf/cudf/pandas/_wrappers/pandas.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2023-2024, NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: Copyright (c) 2023-2025, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import abc
Expand Down Expand Up @@ -35,7 +35,9 @@
_fast_slow_function_call,
_FastSlowAttribute,
_FunctionProxy,
_maybe_wrap_result,
_Unusable,
is_proxy_object,
make_final_proxy_type as _make_final_proxy_type,
make_intermediate_proxy_type as _make_intermediate_proxy_type,
register_proxy_func,
Expand Down Expand Up @@ -266,6 +268,12 @@ def custom_repr_html(obj):
html_formatter.for_type(DataFrame, custom_repr_html)


def _Series_dtype(self):
# Fast-path to extract dtype from the current
# object without round-tripping through the slow<->fast
return _maybe_wrap_result(self._fsproxy_wrapped.dtype, None)


Series = make_final_proxy_type(
"Series",
cudf.Series,
Expand All @@ -285,6 +293,7 @@ def custom_repr_html(obj):
"_constructor": _FastSlowAttribute("_constructor"),
"_constructor_expanddim": _FastSlowAttribute("_constructor_expanddim"),
"_accessors": set(),
"dtype": _Series_dtype,
},
)

Expand Down Expand Up @@ -1704,6 +1713,10 @@ def holiday_calendar_factory_wrapper(*args, **kwargs):
)


def isinstance_cudf_pandas(obj, type_name):
return is_proxy_object(obj) and obj.__class__.__name__ == type_name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the second half of this sufficiently precise? Can we rely on names alone? I would like to use some kind of "real" check like isinstance if that is allowable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think allowing real types will make this API more complex and confusing. We don't want to allow isinstance_cudf_pandas(obj, pd.DataFrame/cudf.DataFrame) as users might get confused to what they are actually checking when cudf.pandas is enabled and disabled.

xgboost has this string based checks: https://github.com/dmlc/xgboost/pull/11014/files#diff-bf11fe0b3133c5b20253ea67b82c3e576513c8079f3be355fc323e3e903d989cR852

Copy link
Contributor

@bdice bdice Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically string-based checks are used as a way to avoid requiring an import of that package. We could still accept a class in the API and do a string-based check of obj.__class__.__name__ == type.__name__.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the api to use this approach. 👍


# timestamps and timedeltas are not proxied, but non-proxied
# pandas types are currently not picklable. Thus, we define
# custom reducer/unpicker functions for these types:
Expand Down
10 changes: 9 additions & 1 deletion python/cudf/cudf/pandas/fast_slow_proxy.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2023-2024, NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: Copyright (c) 2023-2025, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0

Expand Down Expand Up @@ -204,6 +204,12 @@ def _fsproxy_fast_to_slow(self):
return fast_to_slow(self._fsproxy_wrapped)
return self._fsproxy_wrapped

def as_gpu_object(self):
return self._fsproxy_slow_to_fast()

Comment on lines +207 to +208
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to call isinstance_cudf_pandas inside this function and raise a friendly error message on False?

def as_cpu_object(self):
return self._fsproxy_fast_to_slow()

@property # type: ignore
def _fsproxy_state(self) -> _State:
return (
Expand All @@ -221,6 +227,8 @@ def _fsproxy_state(self) -> _State:
"_fsproxy_slow_type": slow_type,
"_fsproxy_slow_to_fast": _fsproxy_slow_to_fast,
"_fsproxy_fast_to_slow": _fsproxy_fast_to_slow,
"as_gpu_object": as_gpu_object,
"as_cpu_object": as_cpu_object,
"_fsproxy_state": _fsproxy_state,
}

Expand Down
16 changes: 15 additions & 1 deletion python/cudf/cudf/utils/utils.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2020-2024, NVIDIA CORPORATION.
# Copyright (c) 2020-2025, NVIDIA CORPORATION.
from __future__ import annotations

import decimal
Expand Down Expand Up @@ -451,3 +451,17 @@ def _datetime_timedelta_find_and_replace(
except TypeError:
result_col = original_column.copy(deep=True)
return result_col # type: ignore


def _extract_from_proxy(proxy: Any, fast: bool = True) -> tuple[Any, bool]:
"""
Extract the object from a proxy object.
"""
try:
return (
(proxy.as_gpu_object(), True)
if fast
else (proxy.as_cpu_object(), True)
)
except AttributeError:
return (proxy, False)
34 changes: 34 additions & 0 deletions python/cudf/cudf_pandas_tests/test_cudf_pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,10 @@
get_calendar,
)

from cudf.pandas import (
isinstance_cudf_pandas,
)

# Accelerated pandas has the real pandas and cudf modules as attributes
pd = xpd._fsproxy_slow
cudf = xpd._fsproxy_fast
Expand Down Expand Up @@ -1885,3 +1889,33 @@ def test_dataframe_setitem():
new_df = df + 1
df[df.columns] = new_df
tm.assert_equal(df, new_df)


def test_dataframe_get_fast_slow_methods():
df = xpd.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3]})
assert isinstance(df.as_gpu_object(), cudf.DataFrame)
assert isinstance(df.as_cpu_object(), pd.DataFrame)


def test_is_cudf_pandas():
s = xpd.Series([1, 2, 3])
df = xpd.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3]})
index = xpd.Index([1, 2, 3])

assert isinstance_cudf_pandas(s, "Series")
assert isinstance_cudf_pandas(df, "DataFrame")
assert isinstance_cudf_pandas(index, "Index")
assert isinstance_cudf_pandas(index.values, "ndarray")

for obj in [s, df, index, index.values]:
assert not isinstance_cudf_pandas(obj._fsproxy_slow, "Series")
assert not isinstance_cudf_pandas(obj._fsproxy_fast, "Series")

assert not isinstance_cudf_pandas(obj._fsproxy_slow, "DataFrame")
assert not isinstance_cudf_pandas(obj._fsproxy_fast, "DataFrame")

assert not isinstance_cudf_pandas(obj._fsproxy_slow, "Index")
assert not isinstance_cudf_pandas(obj._fsproxy_fast, "Index")

assert not isinstance_cudf_pandas(obj._fsproxy_slow, "ndarray")
assert not isinstance_cudf_pandas(obj._fsproxy_fast, "ndarray")
Loading