Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix image curation on latest RAPIDS #458

Merged
merged 3 commits into from
Jan 3, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion nemo_curator/datasets/image_text_pair_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,11 @@ def from_webdataset(cls, path: str, id_col: str):
path (str): The path to the WebDataset-like format on disk or cloud storage.
id_col (str): The column storing the unique identifier for each record.
"""
metadata = dask_cudf.read_parquet(path)
metadata = dask_cudf.read_parquet(path, split_row_groups=False)
# TODO: This is a hack to ensure that the number of partitions is not combined
# and remain the same as the number of shards.
# DEBUG: Why is this happening?
metadata = metadata.repartition(npartitions=metadata.npartitions)
VibhuJawa marked this conversation as resolved.
Show resolved Hide resolved
metadata = metadata.map_partitions(cls._sort_partition, id_col=id_col)

tar_files = cls._get_tar_files(path)
Expand Down
2 changes: 1 addition & 1 deletion nemo_curator/image/classifiers/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@
import cudf
import cupy as cp
import torch
from crossfit.backend.cudf.series import create_list_series_from_1d_or_2d_ar

from nemo_curator.datasets import ImageTextPairDataset
from nemo_curator.utils.cudf_utils import create_list_series_from_1d_or_2d_ar
from nemo_curator.utils.distributed_utils import load_object_on_worker


Expand Down
2 changes: 1 addition & 1 deletion nemo_curator/image/embedders/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,11 @@

import cupy as cp
import torch
from crossfit.backend.cudf.series import create_list_series_from_1d_or_2d_ar
from tqdm import tqdm

from nemo_curator.datasets import ImageTextPairDataset
from nemo_curator.image.classifiers import ImageClassifier
from nemo_curator.utils.cudf_utils import create_list_series_from_1d_or_2d_ar
from nemo_curator.utils.distributed_utils import load_object_on_worker


Expand Down
46 changes: 0 additions & 46 deletions nemo_curator/utils/cudf_utils.py
ryantwolf marked this conversation as resolved.
Show resolved Hide resolved

This file was deleted.

Loading