Skip to content

Commit

Permalink
Repartition on dask-pandas is failing
Browse files Browse the repository at this point in the history
Signed-off-by: Vibhu Jawa <[email protected]>
  • Loading branch information
VibhuJawa committed Jan 9, 2025
1 parent eaeaee2 commit 9b81751
Showing 1 changed file with 0 additions and 4 deletions.
4 changes: 0 additions & 4 deletions nemo_curator/modules/semantic_dedup.py
Original file line number Diff line number Diff line change
Expand Up @@ -338,14 +338,10 @@ def __call__(self, embeddings_dataset: DocumentDataset):

with performance_report_if_with_ts_suffix(self.profile_dir, "clustering-model"):
embeddings_df = embeddings_df[[self.id_col, self.embedding_col]]

embeddings_df = embeddings_df.repartition(
partition_size=self.partition_size
)
embeddings_df = embeddings_df.to_backend("pandas").persist()
# embeddings_df = embeddings_df.repartition(
# partition_size=self.partition_size
# )
embeddings_df = embeddings_df.to_backend("cudf")

cupy_darr = embeddings_df.map_partitions(
Expand Down

0 comments on commit 9b81751

Please sign in to comment.