Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add public APIs to Access Underlying cudf and pandas Objects from cudf.pandas Proxy Objects #17629

Merged
merged 27 commits into from
Jan 29, 2025
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
b5eea1f
Add a public api to get fast slow objects
galipremsagar Dec 19, 2024
7bc76e5
Merge remote-tracking branch 'upstream/branch-25.02' into 17524
galipremsagar Jan 24, 2025
3cdfe94
update names and add fast paths
galipremsagar Jan 24, 2025
34375dc
centralize logic
galipremsagar Jan 25, 2025
31f9e99
fix
galipremsagar Jan 25, 2025
72ba73f
cleanup
galipremsagar Jan 25, 2025
3fd679f
Merge branch 'branch-25.02' into 17524
galipremsagar Jan 25, 2025
37764c2
Apply suggestions from code review
galipremsagar Jan 25, 2025
bbe0fa4
Apply suggestions from code review
galipremsagar Jan 27, 2025
cf6888f
wrap result
galipremsagar Jan 27, 2025
528c189
Merge branch 'branch-25.02' into 17524
galipremsagar Jan 27, 2025
6b744f4
Update faq.md
galipremsagar Jan 27, 2025
9ec0215
style
galipremsagar Jan 27, 2025
a3c49fd
add is_cudf_pandas.. APIs
galipremsagar Jan 27, 2025
3f06e70
update docs
galipremsagar Jan 27, 2025
95ba799
Merge remote-tracking branch 'upstream/branch-25.02' into 17524
galipremsagar Jan 27, 2025
a2e97f5
Apply suggestions from code review
galipremsagar Jan 27, 2025
b96f8ff
Apply suggestions from code review
galipremsagar Jan 27, 2025
0044d8f
update API
galipremsagar Jan 28, 2025
a77fecc
revert cudf.pandas spilling into cudf
galipremsagar Jan 28, 2025
aafbd31
Merge branch 'branch-25.02' into 17524
galipremsagar Jan 28, 2025
f15d37b
Update docs/cudf/source/cudf_pandas/faq.md
galipremsagar Jan 28, 2025
f94253d
cleanup
galipremsagar Jan 28, 2025
daaa5df
update api
galipremsagar Jan 28, 2025
eaa23cb
Merge branch 'branch-25.02' into 17524
galipremsagar Jan 28, 2025
9c58a71
cleanup
galipremsagar Jan 28, 2025
69837f3
Merge branch 'branch-25.02' into 17524
galipremsagar Jan 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions docs/cudf/source/cudf_pandas/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,26 @@ cuDF (learn more in [this
blog](https://medium.com/rapids-ai/easy-cpu-gpu-arrays-and-dataframes-run-your-dask-code-where-youd-like-e349d92351d)) and the [RAPIDS Accelerator for Apache Spark](https://nvidia.github.io/spark-rapids/)
provides a similar configuration-based plugin for Spark.

## How do I know if an object is a `cudf.pandas` proxy object?

To determine if an object is a `cudf.pandas` proxy object, you can use the `is_cudf_pandas_obj` API. This function checks if the given object is a proxy object that wraps either a `cudf` or `pandas` object. Here is an example of how to use this API:

```python
from cudf.pandas import is_cudf_pandas_obj

obj = ... # Your object here
if is_cudf_pandas_obj(obj):
vyasr marked this conversation as resolved.
Show resolved Hide resolved
print("The object is a cudf.pandas proxy object.")
else:
print("The object is not a cudf.pandas proxy object.")
```

There are various APIs to detect `Series`, `DataFrame`, `Index` and `ndarray`'s separately:
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved

* `is_cudf_pandas_series`: Detects if the object is a `cudf.pandas` proxy `Series`.
* `is_cudf_pandas_dataframe`: Detects if the object is a `cudf.pandas` proxy `DataFrame`.
* `is_cudf_pandas_index`: Detects if the object is a `cudf.pandas` proxy `Index`.
* `is_cudf_pandas_nd_array`: Detects if the object is a `cudf.pandas` proxy `ndarray`.
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved

## How can I access the underlying GPU or CPU objects?

Expand All @@ -154,6 +174,8 @@ The following methods can be used to retrieve the actual `cudf` or `pandas` obje
- `as_gpu_object()`: This method returns the `cudf` object from the proxy.
- `as_cpu_object()`: This method returns the `pandas` object from the proxy.

If, `as_gpu_object()` is called on a proxy array, it will return a `cupy` array and `as_cpu_object` will return a `numpy` array.
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved

Here is an example of how to use these methods:

```python
Expand Down
9 changes: 8 additions & 1 deletion python/cudf/cudf/pandas/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2023-2024, NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: Copyright (c) 2023-2025, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0

Expand All @@ -8,6 +8,13 @@
import pylibcudf
import rmm.mr

from ._wrappers.numpy import is_cudf_pandas_nd_array
from ._wrappers.pandas import (
is_cudf_pandas_dataframe,
is_cudf_pandas_index,
is_cudf_pandas_obj,
is_cudf_pandas_series,
)
from .fast_slow_proxy import is_proxy_object
from .magics import load_ipython_extension
from .profiler import Profiler
Expand Down
6 changes: 5 additions & 1 deletion python/cudf/cudf/pandas/_wrappers/numpy.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2023-2024, NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: Copyright (c) 2023-2025, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0

Expand Down Expand Up @@ -176,3 +176,7 @@ def ndarray__array_ufunc__(self, ufunc, method, *inputs, **kwargs):
cupy._core.flags.Flags,
_numpy_flagsobj,
)


def is_cudf_pandas_nd_array(obj):
return is_proxy_object(obj) and isinstance(obj, ndarray)
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
20 changes: 19 additions & 1 deletion python/cudf/cudf/pandas/_wrappers/pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,9 @@
_fast_slow_function_call,
_FastSlowAttribute,
_FunctionProxy,
_maybe_wrap_result,
_Unusable,
is_proxy_object,
make_final_proxy_type as _make_final_proxy_type,
make_intermediate_proxy_type as _make_intermediate_proxy_type,
register_proxy_func,
Expand Down Expand Up @@ -269,7 +271,7 @@ def custom_repr_html(obj):
def _Series_dtype(self):
# Fast-path to extract dtype from the current
# object without round-tripping through the slow<->fast
return self._fsproxy_wrapped.dtype
return _maybe_wrap_result(self._fsproxy_wrapped.dtype, None)


Series = make_final_proxy_type(
Expand Down Expand Up @@ -1711,6 +1713,22 @@ def holiday_calendar_factory_wrapper(*args, **kwargs):
)


def is_cudf_pandas_obj(obj):
return is_proxy_object(obj)


def is_cudf_pandas_dataframe(obj):
return is_proxy_object(obj) and isinstance(obj, DataFrame)


def is_cudf_pandas_series(obj):
return is_proxy_object(obj) and isinstance(obj, Series)


def is_cudf_pandas_index(obj):
return is_proxy_object(obj) and isinstance(obj, Index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these functions actually useful? It seems like it would be better to just tell the user how to check this themselves with is_proxy_object + isinstance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could do that but my intention here was to just have single API that informs the users. It will end up being a verbose pattern in libraries that operate cudf/cudf.pandas aware.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion in a thread above was to instead implement a single method (bikeshed the name as much as you wish, I'm mostly concerned with suggesting an implementation)

def isinstance_cudf_pandas(obj, type):
    return is_proxy_object(obj) and isinstance(obj, type)

so that they don't need to be aware of is_proxy_object and it "feels like" they're using isinstance.



# timestamps and timedeltas are not proxied, but non-proxied
# pandas types are currently not picklable. Thus, we define
# custom reducer/unpicker functions for these types:
Expand Down
2 changes: 1 addition & 1 deletion python/cudf/cudf/utils/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -453,7 +453,7 @@ def _datetime_timedelta_find_and_replace(
return result_col # type: ignore


def _extract_from_proxy(proxy, fast=True):
def _extract_from_proxy(proxy: Any, fast: bool = True) -> tuple[Any, bool]:
"""
Extract the object from a proxy object.
"""
Expand Down
39 changes: 39 additions & 0 deletions python/cudf/cudf_pandas_tests/test_cudf_pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,14 @@
get_calendar,
)

from cudf.pandas import (
is_cudf_pandas_dataframe,
is_cudf_pandas_index,
is_cudf_pandas_nd_array,
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
is_cudf_pandas_obj,
is_cudf_pandas_series,
)

# Accelerated pandas has the real pandas and cudf modules as attributes
pd = xpd._fsproxy_slow
cudf = xpd._fsproxy_fast
Expand Down Expand Up @@ -1891,3 +1899,34 @@ def test_dataframe_get_fast_slow_methods():
df = xpd.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3]})
assert isinstance(df.as_gpu_object(), cudf.DataFrame)
assert isinstance(df.as_cpu_object(), pd.DataFrame)


def test_is_cudf_pandas():
s = xpd.Series([1, 2, 3])
df = xpd.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3]})
index = xpd.Index([1, 2, 3])
assert is_cudf_pandas_obj(s)
assert is_cudf_pandas_obj(df)
assert is_cudf_pandas_obj(index)
assert is_cudf_pandas_obj(index.values)

assert is_cudf_pandas_series(s)
assert is_cudf_pandas_dataframe(df)
assert is_cudf_pandas_index(index)
assert is_cudf_pandas_nd_array(index.values)
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved

for obj in [s, df, index, index.values]:
assert not is_cudf_pandas_obj(obj._fsproxy_slow)
assert not is_cudf_pandas_obj(obj._fsproxy_fast)

assert not is_cudf_pandas_series(obj._fsproxy_slow)
assert not is_cudf_pandas_series(obj._fsproxy_fast)

assert not is_cudf_pandas_dataframe(obj._fsproxy_slow)
assert not is_cudf_pandas_dataframe(obj._fsproxy_fast)

assert not is_cudf_pandas_index(obj._fsproxy_slow)
assert not is_cudf_pandas_index(obj._fsproxy_fast)

assert not is_cudf_pandas_nd_array(obj._fsproxy_slow)
assert not is_cudf_pandas_nd_array(obj._fsproxy_fast)
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
Loading