Skip to content

Commit

Permalink
Merge branch 'NVIDIA:main' into vjawa/fix_semdedup_on_nightly
Browse files Browse the repository at this point in the history
  • Loading branch information
VibhuJawa authored Jan 9, 2025
2 parents e65d304 + 9c8f185 commit eaeaee2
Show file tree
Hide file tree
Showing 19 changed files with 270 additions and 139 deletions.
8 changes: 6 additions & 2 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,13 @@ on:
required: true
default: true
type: boolean

version-bump-branch:
type: string
required: true
description: Branch to target for version bump
jobs:
release:
uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_release_library.yml@v0.17.4
uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_release_library.yml@v0.18.4
with:
release-ref: ${{ inputs.release-ref }}
image-name: nemo_curator_container
Expand All @@ -43,6 +46,7 @@ jobs:
container-workdir: /opt/NeMo-Curator
library-name: NeMo Curator
dry-run: ${{ inputs.dry-run }}
version-bump-branch: ${{ inputs.version-bump-branch }}
secrets:
TWINE_USERNAME: ${{ secrets.TWINE_USERNAME }}
TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}
Expand Down
18 changes: 15 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,18 @@
# Changelog

## NeMo Curator 0.5.0
## NVIDIA NeMo Curator 0.6.0

- Synthetic Data Generation for Text Retrieval
- LLM-based Filters
- Easiness
- Answerability
- Q&A Retrieval Generation Pipeline
- Parallel Dataset Curation for Machine Translation
- Load/Write Bitext Files
- Heuristic filtering (Histogram, Length Ratio)
- Classifier filtering (Comet, Cometoid)

## NVIDIA NeMo Curator 0.5.0

### Highlights

Expand All @@ -16,15 +28,15 @@

**Full Changelog**: <https://github.com/NVIDIA/NeMo-Curator/commits/v0.5.0>

## NeMo Curator 0.4.1
## NVIDIA NeMo Curator 0.4.1

## What's Changed

* Add spacy<3.8 pin to r0.4.1 by @ayushdg in <https://github.com/NVIDIA/NeMo-Curator/pull/279>

**Full Changelog**: <https://github.com/NVIDIA/NeMo-Curator/compare/v0.4.0...v0.4.1>

## NeMo Curator 0.4.0
## NVIDIA NeMo Curator 0.4.0

## Highlights

Expand Down
6 changes: 4 additions & 2 deletions docs/user-guide/documentdataset.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,14 +68,16 @@ Let's walk through this code line by line.
"books_dataset/books_02.jsonl"]
* ``books = DocumentDataset.read_json(files, add_filename=True)`` This will read the files listed into memory.
The ``add_filename=True`` option preserves the name of the shard (``books_00.jsonl``, ``books_01.jsonl``, etc.) as an additional ``filename`` field.
When the dataset is written back to disk, this option (in conjunction with the ``write_to_filename`` option) ensure that documents stay in their original shard.
The ``add_filename=True`` option preserves the name of the shard (``books_00.jsonl``, ``books_01.jsonl``, etc.) as an additional ``file_name`` field.
When the dataset is written back to disk, this option (in conjunction with the ``write_to_filename`` option and ``filename_col`` ) ensure that documents stay in their original shard.
This can be useful for manually inspecting the results of filtering shard by shard.
The ``add_filename`` option can also be used as a string, in which case it will be used as the name of the column (instead of the default ``file_name``).
* ``filter_step = ...`` This constructs and applies a heuristic filter for the length of the document.
More information is provided in the filtering page of the documentation.
* ``long_books.to_json("long_books/", write_to_filename=True)`` This writes the filtered dataset to a new directory.
As mentioned above, the ``write_to_filename=True`` preserves the sharding of the dataset.
If the dataset was not read in with ``add_filename=True``, setting ``write_to_filename=True`` will throw an error.
If the dataset was read with ``add_filename="path"`` then along with ``write_to_filename=True`` the ``filename_col="path"`` will need to be set as well.

``DocumentDataset`` is just a wrapper around a `Dask dataframe <https://docs.dask.org/en/stable/dataframe.html>`_.
The underlying dataframe can be accessed with the ``DocumentDataset.df`` member variable.
Expand Down
35 changes: 10 additions & 25 deletions nemo_curator/classifiers/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,44 +121,29 @@ def _run_classifier_helper(
prob_col: str = None,
) -> "dask_cudf.DataFrame":

keep_prob = prob_col is not None
prob_internal_col = "_prob"
# TODO: Make crossfit handle this cleanly
pred_internal_col = "labels"
df["sliced_text"] = df[text_field].str.slice(0, max_chars)
if prob_col:
df[prob_col] = 0
else:
prob_col = "_prob"

columns_to_keep_list = df.columns.to_list()
columns_to_keep_list.remove("sliced_text")

classifier_pipe = op.Sequential(
op.Tokenizer(model, cols=["sliced_text"], tokenizer_type="default"),
op.Tokenizer(
model, cols=[text_field], tokenizer_type="default", max_chars=max_chars
),
op.Predictor(
model,
sorted_data_loader=True,
batch_size=batch_size,
pred_output_col=prob_internal_col,
pred_output_col=prob_col,
),
op.Labeler(labels, cols=[prob_col], suffix=label_col),
repartition=df.npartitions,
keep_cols=columns_to_keep_list,
)
df = classifier_pipe(df)

# TODO: Make crossfit handle this cleanly
# to prevent the labeler from dropping the prob_internal_col
# and combine it into a single step
labeling_pipe = op.Sequential(
op.Labeler(labels, cols=[prob_internal_col]),
keep_cols=columns_to_keep_list + [prob_internal_col],
)
df = labeling_pipe(df)

if keep_prob:
df = df.rename(
columns={prob_internal_col: prob_col, pred_internal_col: label_col},
)
else:
df = df.rename(columns={pred_internal_col: label_col})
df = df.drop(columns=[prob_internal_col])

return df


Expand Down
24 changes: 15 additions & 9 deletions nemo_curator/datasets/doc_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ def read_json(
backend: Literal["pandas", "cudf"] = "pandas",
files_per_partition: Optional[int] = None,
blocksize: Optional[str] = "1gb",
add_filename: bool = False,
add_filename: Union[bool, str] = False,
input_meta: Union[str, dict] = None,
columns: Optional[List[str]] = None,
**kwargs,
Expand All @@ -64,7 +64,9 @@ def read_json(
input_files: The path of the input file(s).
backend: The backend to use for reading the data.
files_per_partition: The number of files to read per partition.
add_filename: Whether to add a "file_name" column to the DataFrame.
add_filename: Whether to add a filename column to the DataFrame.
If True, a new column is added to the DataFrame called `file_name`.
If str, sets new column name. Default is False.
input_meta: A dictionary or a string formatted as a dictionary, which outlines
the field names and their respective data types within the JSONL input file.
columns: If not None, only these columns will be read from the file.
Expand All @@ -91,7 +93,7 @@ def read_parquet(
backend: Literal["pandas", "cudf"] = "pandas",
files_per_partition: Optional[int] = None,
blocksize: Optional[str] = "1gb",
add_filename=False,
add_filename: Union[bool, str] = False,
columns: Optional[List[str]] = None,
**kwargs,
) -> "DocumentDataset":
Expand All @@ -102,7 +104,9 @@ def read_parquet(
input_files: The path of the input file(s).
backend: The backend to use for reading the data.
files_per_partition: The number of files to read per partition.
add_filename: Whether to add a "file_name" column to the DataFrame.
add_filename: Whether to add a filename column to the DataFrame.
If True, a new column is added to the DataFrame called `file_name`.
If str, sets new column name. Default is False.
columns: If not None, only these columns will be read from the file.
There is a significant performance gain when specifying columns for Parquet files.
Expand Down Expand Up @@ -135,7 +139,9 @@ def read_pickle(
input_files: The path of the input file(s).
backend: The backend to use for reading the data.
files_per_partition: The number of files to read per partition.
add_filename: Whether to add a "file_name" column to the DataFrame.
add_filename: Whether to add a filename column to the DataFrame.
If True, a new column is added to the DataFrame called `file_name`.
If str, sets new column name. Default is False.
columns: If not None, only these columns will be read from the file.
"""
Expand All @@ -152,7 +158,7 @@ def read_pickle(
def to_json(
self,
output_path: str,
write_to_filename: bool = False,
write_to_filename: Union[bool, str] = False,
keep_filename_column: bool = False,
):
"""
Expand All @@ -170,7 +176,7 @@ def to_json(
def to_parquet(
self,
output_path: str,
write_to_filename: bool = False,
write_to_filename: Union[bool, str] = False,
keep_filename_column: bool = False,
):
"""
Expand All @@ -188,7 +194,7 @@ def to_parquet(
def to_pickle(
self,
output_path: str,
write_to_filename: bool = False,
write_to_filename: Union[bool, str] = False,
):
raise NotImplementedError("DocumentDataset does not support to_pickle yet")

Expand Down Expand Up @@ -234,7 +240,7 @@ def _read_json_or_parquet(
input_files: Union[str, List[str]],
file_type: str,
backend: Literal["cudf", "pandas"],
add_filename: bool,
add_filename: Union[bool, str] = False,
files_per_partition: Optional[int] = None,
blocksize: Optional[str] = None,
input_meta: Union[str, dict] = None,
Expand Down
17 changes: 11 additions & 6 deletions nemo_curator/datasets/parallel_dataset.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
import csv
from typing import List, Optional, Tuple, Union
from typing import List, Tuple, Union

import dask.dataframe as dd
import pandas as pd

from nemo_curator.datasets.doc_dataset import DocumentDataset
from nemo_curator.utils.distributed_utils import write_to_disk
from nemo_curator.utils.distributed_utils import _resolve_filename_col, write_to_disk
from nemo_curator.utils.file_utils import remove_path_extension
from nemo_curator.utils.import_utils import gpu_only_import

Expand All @@ -31,7 +31,7 @@ def read_simple_bitext(
src_lang: str,
tgt_lang: str,
backend: str = "pandas",
add_filename: bool = False,
add_filename: Union[bool, str] = False,
npartitions: int = 16,
):
"""See `read_single_simple_bitext_file_pair` docstring for what "simple_bitext" means and usage of other parameters.
Expand Down Expand Up @@ -99,7 +99,7 @@ def read_single_simple_bitext_file_pair(
tgt_lang: str,
doc_id: str = None,
backend: str = "cudf",
add_filename: bool = False,
add_filename: Union[bool, str] = False,
) -> Union[dd.DataFrame, "dask_cudf.DataFrame"]:
"""This function reads a pair of "simple bitext" files into a pandas DataFrame.
A simple bitext is a commonly data format in machine translation.
Expand Down Expand Up @@ -129,7 +129,10 @@ def read_single_simple_bitext_file_pair(
tgt_lang (str): Target language, in ISO-639-1 (two character) format (e.g. 'en')
doc_id (str, optional): A string document id to assign to every segment in the file. Defaults to None.
backend (str, optional): Backend of the data frame. Defaults to "cudf".
add_filename (bool, optional): Add "file_name" as an extra field to every segment in the file. Defaults to False.
add_filename (Union[bool, str]): Whether to add a filename column to the DataFrame.
If True, a new column is added to the DataFrame called `file_name`.
If str, sets new column name. Default is False.
Returns:
Union[dd.DataFrame, dask_cudf.DataFrame]
Expand Down Expand Up @@ -162,6 +165,8 @@ def read_single_simple_bitext_file_pair(
df_combined["tgt_lang"] = tgt_lang

if add_filename:
df_combined["file_name"] = remove_path_extension(src_input_file)
df_combined[_resolve_filename_col(add_filename)] = remove_path_extension(
src_input_file
)

return df_combined
1 change: 1 addition & 0 deletions nemo_curator/download/arxiv.py
Original file line number Diff line number Diff line change
Expand Up @@ -415,6 +415,7 @@ def download_arxiv(
output_type=output_type,
keep_raw_download=keep_raw_download,
force_download=force_download,
filename_col="file_name",
)

return dataset
1 change: 1 addition & 0 deletions nemo_curator/download/commoncrawl.py
Original file line number Diff line number Diff line change
Expand Up @@ -442,6 +442,7 @@ def download_common_crawl(
output_type=output_type,
keep_raw_download=keep_raw_download,
force_download=force_download,
filename_col="file_name",
)

return dataset
16 changes: 12 additions & 4 deletions nemo_curator/download/doc_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,12 +112,16 @@ def _download_and_extract_single_partition(
keep_raw_download: bool,
force_download: bool,
input_meta: Union[str, dict] = None,
filename_col: str = "file_name",
) -> pd.DataFrame:
url, output_path = paths

if os.path.exists(output_path) and not force_download:
partition = read_single_partition(
[output_path], backend="pandas", filetype=output_type, add_filename=True
[output_path],
backend="pandas",
filetype=output_type,
add_filename=filename_col,
)
return partition

Expand All @@ -141,8 +145,10 @@ def _download_and_extract_single_partition(
partition = pd.DataFrame(records)
filename = os.path.basename(output_path)
output_dir = os.path.dirname(output_path)
partition["file_name"] = filename
single_partition_write_with_filename(partition, output_dir, output_type=output_type)
partition[filename_col] = filename
single_partition_write_with_filename(
partition, output_dir, output_type=output_type, filename_col=filename_col
)
if not keep_raw_download:
os.remove(downloaded_file)

Expand All @@ -160,6 +166,7 @@ def download_and_extract(
keep_raw_download=False,
force_download=False,
input_meta: Union[str, dict] = None,
filename_col: str = "file_name",
) -> DocumentDataset:
"""
Downloads and extracts a dataset into a format accepted by the NeMo Curator
Expand All @@ -178,7 +185,7 @@ def download_and_extract(
directly read from them instead.
input_meta: A dictionary or a string formatted as a dictionary, which outlines
the field names and their respective data types within the JSONL input file.
filename_col : The name of the column that contains the filename. Default is "filename_col"
Returns:
A DocumentDataset of the downloaded data
"""
Expand All @@ -202,6 +209,7 @@ def download_and_extract(
force_download=force_download,
enforce_metadata=False,
input_meta=input_meta,
filename_col=filename_col,
meta=output_format,
)

Expand Down
1 change: 1 addition & 0 deletions nemo_curator/download/wikipedia.py
Original file line number Diff line number Diff line change
Expand Up @@ -811,6 +811,7 @@ def download_wikipedia(
output_type=output_type,
keep_raw_download=keep_raw_download,
force_download=force_download,
filename_col="file_name",
)

return dataset
9 changes: 5 additions & 4 deletions nemo_curator/modules/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@
from .config import FuzzyDuplicatesConfig, SemDedupConfig
from .dataset_ops import blend_datasets, Shuffle
from .exact_dedup import ExactDuplicates
from .filter import Filter, Score, ScoreFilter, ParallelScoreFilter
from .meta import Sequential
from .modify import Modify
from .task import TaskDecontamination
Expand All @@ -39,9 +38,7 @@
BucketsToEdges = gpu_only_import_from(
"nemo_curator.modules.fuzzy_dedup", "BucketsToEdges"
)
# Pytorch related imports must come after all imports that require cugraph,
# because of context cleanup issues b/w pytorch and cugraph
# See this issue: https://github.com/rapidsai/cugraph/issues/2718

SemDedup = gpu_only_import_from("nemo_curator.modules.semantic_dedup", "SemDedup")
EmbeddingCreator = gpu_only_import_from(
"nemo_curator.modules.semantic_dedup", "EmbeddingCreator"
Expand All @@ -52,6 +49,10 @@
SemanticClusterLevelDedup = gpu_only_import_from(
"nemo_curator.modules.semantic_dedup", "SemanticClusterLevelDedup"
)
# Pytorch related imports must come after all imports that require cugraph,
# because of context cleanup issues b/w pytorch and cugraph
# See this issue: https://github.com/rapidsai/cugraph/issues/2718
from .filter import Filter, Score, ScoreFilter, ParallelScoreFilter

__all__ = [
"ExactDuplicates",
Expand Down
Loading

0 comments on commit eaeaee2

Please sign in to comment.