Add support for parallel data curation (#193)

* add data interface to read simple bitext Signed-off-by: Shuoyang Ding <[email protected]> * adding ParallelScoreFilter Signed-off-by: Shuoyang Ding <[email protected]> * add test for ParallelScoreFilter, small style change for ParallelDataset test, fix a few data and import bugs Signed-off-by: Shuoyang Ding <[email protected]> * allow ParallelScoreFilter to take different filters for source and target Signed-off-by: Shuoyang Ding <[email protected]> * add JointScoreFilter and LengthRatioFilter Signed-off-by: Shuoyang Ding <[email protected]> * [WIP] add heuristic filter w/o test Signed-off-by: Shuoyang Ding <[email protected]> * merge with main Signed-off-by: Shuoyang Ding <[email protected]> * add test for histogram filter, fix a few bugs Signed-off-by: Shuoyang Ding <[email protected]> * length ratio, joint score filter testing Signed-off-by: Shuoyang Ding <[email protected]> * fix typing in joint test Signed-off-by: Shuoyang Ding <[email protected]> * add a fake comet qe filter as an initial step Signed-off-by: Shuoyang Ding <[email protected]> * [WIP] adding bitext cleaning tutorial Signed-off-by: Shuoyang Ding <[email protected]> * [WIP] fixing example Signed-off-by: Shuoyang Ding <[email protected]> * fix slow histogram filter, fix faulty bitext loading Signed-off-by: Shuoyang Ding <[email protected]> * tutorial running Signed-off-by: Shuoyang Ding <[email protected]> * [WIP] documentation of bitext tutorial Signed-off-by: Shuoyang Ding <[email protected]> * add tested version of comet-qe filter Signed-off-by: Shuoyang Ding <[email protected]> * fix ParallelDataset bug where single file name is not accepted, and dataset is sometimes turned into its parent class by mistake, add write to simple bitext functionality, update bitext tutorial Signed-off-by: Shuoyang Ding <[email protected]> * add docstring to explain simple bitext format, fix a bug where file extensions are removed twice before writing Signed-off-by: Shuoyang Ding <[email protected]> * remove print line for debug Signed-off-by: Shuoyang Ding <[email protected]> * add comet filter to tutorial Signed-off-by: Shuoyang Ding <[email protected]> * refactor COMET QE filter to decouple model from filter, make sure JointScoreFilter can take more than one fields for source and target Signed-off-by: Shuoyang Ding <[email protected]> * use refactored qe filter Signed-off-by: Shuoyang Ding <[email protected]> * wrap_qe_input should be a static method Signed-off-by: Shuoyang Ding <[email protected]> * use conditional import for comet, formatting changes Signed-off-by: Shuoyang Ding <[email protected]> * [WIP] add cometoid Signed-off-by: Shuoyang Ding <[email protected]> * [WIP] attempt to resolve device conflict but is failing Signed-off-by: Shuoyang Ding <[email protected]> * [WIP] playing with cometoid arguments Signed-off-by: Shuoyang Ding <[email protected]> * [WIP] -d 0 doesn't look necessary Signed-off-by: Shuoyang Ding <[email protected]> * tested arguments for Cometoid Signed-off-by: Shuoyang Ding <[email protected]> * use proper safe import, make sure test doesn't crash sans comet/pymarian Signed-off-by: Shuoyang Ding <[email protected]> * falling back to comet for tutorial since that's easier to set up, uppdate README Signed-off-by: Shuoyang Ding <[email protected]> * give credit to original fairseq implementation of histogram filtering, run black formatter Signed-off-by: Shuoyang Ding <[email protected]> * fix pre-commit complaint Signed-off-by: Shuoyang Ding <[email protected]> * fix small bug Signed-off-by: Shuoyang Ding <[email protected]> * fix another occurrence of the same bug Signed-off-by: Shuoyang Ding <[email protected]> * introduce shard limit to a single PyMarian API call to avoid memory leakage Signed-off-by: Shuoyang Ding <[email protected]> * repartition after reading simple bitext data Signed-off-by: Shuoyang Ding <[email protected]> * -d 0 is actually needed for pymarian Signed-off-by: Shuoyang Ding <[email protected]> * remove duplicate LengthRatioFilter definition Signed-off-by: Shuoyang Ding <[email protected]> * refactor repeated code segment in file writing, change classifier to accomodate custom field names, pause doc repartition since it causes problems Signed-off-by: Shuoyang Ding <[email protected]> * [WIP] addressed comments in #193 apart from resolving .iloc pattern, test currently failing Signed-off-by: Shuoyang Ding <[email protected]> * refactor to resolve .loc pattern, test passing Signed-off-by: Shuoyang Ding <[email protected]> * add missing file Signed-off-by: Shuoyang Ding <[email protected]> * revert changes in setup.py Signed-off-by: Shuoyang Ding <[email protected]> * fix a small bug in parallel dataset, explain why repartition is disabled, fix tutorial Signed-off-by: Shuoyang Ding <[email protected]> * add api guide, small change on bitext/parallel score filter docstring Signed-off-by: Shuoyang Ding <[email protected]> * fix read_simple_bitext test issues Signed-off-by: Shuoyang Ding <[email protected]> * reinstate dependencies lost during merging Signed-off-by: Shuoyang Ding <[email protected]> * re-enable multiple partitions for simple bitext, add parallel write Signed-off-by: Shuoyang Ding <[email protected]> * take care of the case where filename is not supplied in dataframe, make logic clearer Signed-off-by: Shuoyang Ding <[email protected]> * address other minor comments in the PR, fix segment order scrambling Signed-off-by: Shuoyang Ding <[email protected]> * fix test errors, add bitext dependencies Signed-off-by: Shuoyang Ding <[email protected]> * add back more missing imports Signed-off-by: Shuoyang Ding <[email protected]> * add bitext to [all] in .toml, add platformdirs as dependency Signed-off-by: Shuoyang Ding <[email protected]> * merge upstream, remove old bitext requirement list Signed-off-by: Shuoyang Ding <[email protected]> * delete requirement file again Signed-off-by: Shuoyang Ding <[email protected]> --------- Signed-off-by: Shuoyang Ding <[email protected]> Co-authored-by: nverma1 <[email protected]>
NVIDIA · Nov 27, 2024 · 3d14b0d · 3d14b0d
1 parent b15b08a
commit 3d14b0d
Show file tree

Hide file tree

Showing 23 changed files with 1,490 additions and 30 deletions.
diff --git a/docs/user-guide/api/datasets.rst b/docs/user-guide/api/datasets.rst
@@ -9,10 +9,12 @@ DocumentDataset
 .. autoclass:: nemo_curator.datasets.DocumentDataset
     :members:
 
+.. autoclass:: nemo_curator.datasets.ParallelDataset
+    :members:
 
 -------------------------------
 ImageTextPairDataset
 -------------------------------
 
 .. autoclass:: nemo_curator.datasets.ImageTextPairDataset
-    :members:
+    :members:
diff --git a/docs/user-guide/api/filters.rst b/docs/user-guide/api/filters.rst
@@ -10,6 +10,10 @@ Base Class
     :members:
     :member-order: bysource
 
+.. autoclass:: nemo_curator.filters.BitextFilter
+    :members:
+    :member-order: bysource
+
 .. autofunction:: nemo_curator.filters.import_filter
 
 ------------------------------
@@ -40,6 +44,14 @@ FastText Filters
     :members:
     :member-order: bysource
 
+------------------------------
+Quality Estimation Filters
+------------------------------
+
+.. autoclass:: nemo_curator.filters.QualityEstimationFilter
+    :members:
+    :member-order: bysource
+
 ------------------------------
 Heuristic Filters
 ------------------------------
@@ -132,6 +144,14 @@ Heuristic Filters
     :members:
     :member-order: bysource
 
+.. autoclass:: nemo_curator.filters.HistogramFilter
+    :members:
+    :member-order: bysource
+
+.. autoclass:: nemo_curator.filters.LengthRatioFilter
+    :members:
+    :member-order: bysource
+
 ------------------------------
 Code Filters
 ------------------------------

diff --git a/nemo_curator/datasets/__init__.py b/nemo_curator/datasets/__init__.py
@@ -15,9 +15,10 @@
 from nemo_curator.utils.import_utils import image_only_import_from
 
 from .doc_dataset import DocumentDataset
+from .parallel_dataset import ParallelDataset
 
 ImageTextPairDataset = image_only_import_from(
     "nemo_curator.datasets.image_text_pair_dataset", "ImageTextPairDataset"
 )
 
-__all__ = ["DocumentDataset", "ImageTextPairDataset"]
+__all__ = ["DocumentDataset", "ImageTextPairDataset", "ParallelDataset"]
diff --git a/nemo_curator/datasets/parallel_dataset.py b/nemo_curator/datasets/parallel_dataset.py
@@ -0,0 +1,167 @@
+import csv
+from typing import List, Optional, Tuple, Union
+
+import dask.dataframe as dd
+import pandas as pd
+
+from nemo_curator.datasets.doc_dataset import DocumentDataset
+from nemo_curator.utils.distributed_utils import write_to_disk
+from nemo_curator.utils.file_utils import remove_path_extension
+from nemo_curator.utils.import_utils import gpu_only_import
+
+cudf = gpu_only_import("cudf")
+
+
+class ParallelDataset(DocumentDataset):
+    """
+    An extension of the standard `DocumentDataset` with a special method that loads simple bitext.
+
+    For data with more complicated metadata, please convert your data into jsonl/parquet/pickle format
+    and use interfaces defined in `DocumentDataset`.
+    """
+
+    def persist(self):
+        return ParallelDataset(self.df.persist())
+
+    @classmethod
+    def read_simple_bitext(
+        cls,
+        src_input_files: Union[str, List[str]],
+        tgt_input_files: Union[str, List[str]],
+        src_lang: str,
+        tgt_lang: str,
+        backend: str = "pandas",
+        add_filename: bool = False,
+        npartitions: int = 16,
+    ):
+        """See `read_single_simple_bitext_file_pair` docstring for what "simple_bitext" means and usage of other parameters.
+
+        Args:
+            src_input_files (Union[str, List[str]]): one or several input files, in source language
+            tgt_input_files (Union[str, List[str]]): one or several input files, in target language
+
+        Raises:
+            TypeError: If types of `src_input_files` and `tgt_input_files` doesn't agree.
+
+        Returns:
+            ParallelDataset: A `ParallelDataset` object with `self.df` holding the ingested simple bitext.
+        """
+
+        if isinstance(src_input_files, str) and isinstance(tgt_input_files, str):
+            src_input_files = [src_input_files]
+            tgt_input_files = [tgt_input_files]
+        elif not isinstance(src_input_files, list) or not isinstance(
+            tgt_input_files, list
+        ):
+            raise TypeError("Both file inputs must be strings or lists.")
+
+        # use default doc id for now
+        # but in the future it might be useful to allow customizing doc id by passing a prefix
+        df_files = []
+        # We do not use `dd.from_map` because an individual file could be pretty large,
+        # hence, it's not appropriate to partition based on individual files.
+        # What we do is that we concatenate all the individual files and perform repartition.
+        for src_input_file, tgt_input_file in zip(src_input_files, tgt_input_files):
+            df_file = ParallelDataset.read_single_simple_bitext_file_pair(
+                (src_input_file, tgt_input_file),
+                src_lang=src_lang,
+                tgt_lang=tgt_lang,
+                backend=backend,
+                add_filename=add_filename,
+            )
+            df_files.append(df_file)
+
+        if backend == "cudf":
+            df = cudf
+        else:
+            df = pd
+
+        data = dd.from_pandas(df.concat(df_files), npartitions=npartitions)
+        return cls(data)
+
+    def to_bitext(
+        self,
+        output_file_dir,
+        write_to_filename=False,
+    ):
+        """See `nemo_curator.utils.distributed_utils.write_to_disk` docstring for parameter usage."""
+        write_to_disk(
+            df=self.df,
+            output_file_dir=output_file_dir,
+            write_to_filename=write_to_filename,
+            output_type="bitext",
+        )
+
+    @staticmethod
+    def read_single_simple_bitext_file_pair(
+        input_file_pair: Tuple[str],
+        src_lang: str,
+        tgt_lang: str,
+        doc_id: str = None,
+        backend: str = "cudf",
+        add_filename: bool = False,
+    ) -> Union[dd.DataFrame, "dask_cudf.DataFrame"]:
+        """This function reads a pair of "simple bitext" files into a pandas DataFrame.
+        A simple bitext is a commonly data format in machine translation.
+        It consists of two plain text files with the same number of lines, each line pair being translations of each other. For example:
+
+        data.de:
+
+        ```
+        Wir besitzen keine Reisetaschen aus Leder.
+        Die Firma produziert Computer für den deutschen Markt.
+        ...
+        ```
+
+        data.en:
+
+        ```
+        We don't own duffel bags made of leather.
+        The company produces computers for the German market.
+        ...
+        ```
+
+        For simplicity, we also assume that the names of the two text files have the same prefix, except for different language code at the end as file extensions.
+
+        Args:
+            input_file_pair (Tuple[str]): A pair of file paths pointing to the input files
+            src_lang (str): Source language, in ISO-639-1 (two character) format (e.g. 'en')
+            tgt_lang (str): Target language, in ISO-639-1 (two character) format (e.g. 'en')
+            doc_id (str, optional): A string document id to assign to every segment in the file. Defaults to None.
+            backend (str, optional): Backend of the data frame. Defaults to "cudf".
+            add_filename (bool, optional): Add filename as an extra field to every segment in the file. Defaults to False.
+
+        Returns:
+            Union[dd.DataFrame, dask_cudf.DataFrame]
+        """
+        src_input_file, tgt_input_file = input_file_pair
+        assert remove_path_extension(src_input_file) == remove_path_extension(
+            tgt_input_file
+        ), f"Assuming source and target filenames would have common prefix before language code, but got {src_input_file} and {tgt_input_file}."
+
+        if not doc_id:
+            doc_id = "▁".join([src_input_file, tgt_input_file])
+
+        if backend == "cudf":
+            df = cudf
+        else:
+            df = pd
+
+        df_src = df.read_csv(
+            src_input_file, names=["src"], sep="\t", quoting=csv.QUOTE_NONE
+        )
+        df_tgt = df.read_csv(
+            tgt_input_file, names=["tgt"], sep="\t", quoting=csv.QUOTE_NONE
+        )
+        assert len(df_src) == len(
+            df_tgt
+        ), f"We assume the source and target file would have the same number of lines, but got {len(df_src)} and {len(df_tgt)}."
+        df_combined = df.concat([df_src, df_tgt], axis=1)
+        df_combined["doc_id"] = doc_id
+        df_combined["src_lang"] = src_lang
+        df_combined["tgt_lang"] = tgt_lang
+
+        if add_filename:
+            df_combined["filename"] = remove_path_extension(src_input_file)
+
+        return df_combined
diff --git a/nemo_curator/filters/__init__.py b/nemo_curator/filters/__init__.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from .classifier_filter import FastTextLangId, FastTextQualityFilter
+from .bitext_filter import BitextFilter
+from .classifier_filter import (
+    FastTextLangId,
+    FastTextQualityFilter,
+    QualityEstimationFilter,
+)
 from .code import (
     AlphaFilter,
     GeneralCommentToCodeFilter,
@@ -29,6 +34,8 @@
     BulletsFilter,
     CommonEnglishWordsFilter,
     EllipsisFilter,
+    HistogramFilter,
+    LengthRatioFilter,
     LongWordFilter,
     MeanWordLengthFilter,
     NonAlphaNumericFilter,
@@ -51,6 +58,7 @@
 from .synthetic import AnswerabilityFilter, EasinessFilter
 
 __all__ = [
+    "BitextFilter",
     "DocumentFilter",
     "import_filter",
     "FastTextLangId",
@@ -85,6 +93,9 @@
     "AlphaFilter",
     "HTMLBoilerplateFilter",
     "PerExtensionFilter",
+    "LengthRatioFilter",
+    "HistogramFilter",
+    "QualityEstimationFilter",
     "AnswerabilityFilter",
     "EasinessFilter",
 ]