Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Merge fails on 24.10 nightly with Failed to generate metadata for RenameAxis(frame=Merge(...), index=None). #16892

Closed
praateekmahajan opened this issue Sep 24, 2024 · 1 comment · Fixed by #16899
Labels
bug Something isn't working

Comments

@praateekmahajan
Copy link

praateekmahajan commented Sep 24, 2024

Describe the bug

When performing a merge and left._meta.index_name != right._meta.index_name the behavior in dask-expr has changed https://github.com/dask/dask-expr/pull/1121/files

This raises a RuntimeError: Failed to generate metadata for RenameAxis(frame=Merge(75f6fd3), index=None). This operation may not be supported by the current backend. (full stacktrace + debug checkpoint screenshot at dask-expr/_collections.py below)

Steps/Code to reproduce bug
We ran into this in crossfit when running our pytests. Here is a repro that would two methods from crossfit library (namely sample_raw and reset_global_index). I imagine a simpler reproduce is possible, but in a timeboxed manner this is what I was able to get

import dask_cudf
from crossfit.dataset.beir.raw import sample_raw
from crossfit.dataset.beir.load import reset_global_index
import os

dataset_name = "nq"
out_dir = None
blocksize = 2**30
raw_path = sample_raw(dataset_name, out_dir=out_dir, overwrite=False)


qrels_files = [
    f for f in os.listdir(os.path.join(raw_path, "qrels")) if f.endswith(".tsv")
]
qrels_file = qrels_files[0]

qrels_dtypes = {"query-id": "str", "corpus-id": "str", "score": "int32"}


queries_ddf = dask_cudf.read_json(
    os.path.join(raw_path, "queries.jsonl"),
    lines=True,
    blocksize=blocksize,
    dtype={"_id": "string", "text": "string"},
)[["_id", "text"]]
# if we don't call reset_global_index code works fine
queries_ddf = reset_global_index(queries_ddf)


qrels_ddf = dask_cudf.read_csv(
    os.path.join(raw_path, "qrels", qrels_file),
    sep="\t",
    dtype=qrels_dtypes,
)[["query-id", "corpus-id", "score"]]


qrels_ddf.merge(
    queries_ddf,
    left_on="query-id",
    right_on="_id",
    how="left",
)

print("Success")

Expected behavior
Before 24.10 nightly the merge worked as expected

Installed crossfit using pip (i.e pip installed cudf etc)

cudf-cu12==24.10.0a373
dask==2024.9.0
dask-cuda==24.10.0a22
dask-cudf-cu12==24.10.0a373
dask-expr==1.1.14
libcudf-cu12==24.10.0a373
pylibcudf-cu12==24.10.0a373
raft-dask-cu12==24.10.0a38
rapids-dask-dependency==24.10.0a8

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context

Traceback (most recent call last):
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/cudf/utils/utils.py", line 228, in __getattr__
    return self[key]
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/cudf/core/dataframe.py", line 1347, in __getitem__
    out = self._get_columns_by_label(arg)
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/cudf/core/frame.py", line 358, in _get_columns_by_label
    return self._from_data_like_self(self._data.select_by_label(labels))
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/cudf/core/column_accessor.py", line 401, in select_by_label
    return self._select_by_label_grouped(key)
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/cudf/core/column_accessor.py", line 563, in _select_by_label_grouped
    result = self._grouped_data[key]
KeyError: 'rename_axis'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/dask_expr/_core.py", line 470, in __getattr__
    return object.__getattribute__(self, key)
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/dask_expr/_expr.py", line 496, in _meta
    return self.operation(*args, **self._kwargs)
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/dask/utils.py", line 1241, in __call__
    return getattr(__obj, self.method)(*args, **kwargs)
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/cudf/utils/utils.py", line 230, in __getattr__
    raise AttributeError(
AttributeError: DataFrame object has no attribute rename_axis

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/dask_expr/_collection.py", line 4799, in new_collection
    meta = expr._meta
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/dask_expr/_core.py", line 475, in __getattr__
    raise RuntimeError(
RuntimeError: Failed to generate metadata for RenameAxis(frame=Merge(75f6fd3), index=None). This operation may not be supported by the current backend.

Image

@rjzamora
Copy link
Member

Thanks for raising this issue @praateekmahajan - Hopefully this will be resolved by #16899

rapids-bot bot pushed a commit that referenced this issue Sep 25, 2024
…16899)

See #16895
Closes #16892

Dask-expr uses `rename_axis`, which is not supported by cudf yet. This is a temporary workaround until #16895 is resolved.

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #16899
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants