Skip to content

Commit

Permalink
Add support for parallel data curation (#193)
Browse files Browse the repository at this point in the history
* add data interface to read simple bitext

Signed-off-by: Shuoyang Ding <[email protected]>

* adding ParallelScoreFilter

Signed-off-by: Shuoyang Ding <[email protected]>

* add test for ParallelScoreFilter, small style change for ParallelDataset test, fix a few data and import bugs

Signed-off-by: Shuoyang Ding <[email protected]>

* allow ParallelScoreFilter to take different filters for source and target

Signed-off-by: Shuoyang Ding <[email protected]>

* add JointScoreFilter and LengthRatioFilter

Signed-off-by: Shuoyang Ding <[email protected]>

* [WIP] add heuristic filter w/o test

Signed-off-by: Shuoyang Ding <[email protected]>

* merge with main

Signed-off-by: Shuoyang Ding <[email protected]>

* add test for histogram filter, fix a few bugs

Signed-off-by: Shuoyang Ding <[email protected]>

* length ratio, joint score filter testing

Signed-off-by: Shuoyang Ding <[email protected]>

* fix typing in joint test

Signed-off-by: Shuoyang Ding <[email protected]>

* add a fake comet qe filter as an initial step

Signed-off-by: Shuoyang Ding <[email protected]>

* [WIP] adding bitext cleaning tutorial

Signed-off-by: Shuoyang Ding <[email protected]>

* [WIP] fixing example

Signed-off-by: Shuoyang Ding <[email protected]>

* fix slow histogram filter, fix faulty bitext loading

Signed-off-by: Shuoyang Ding <[email protected]>

* tutorial running

Signed-off-by: Shuoyang Ding <[email protected]>

* [WIP] documentation of bitext tutorial

Signed-off-by: Shuoyang Ding <[email protected]>

* add tested version of comet-qe filter

Signed-off-by: Shuoyang Ding <[email protected]>

* fix ParallelDataset bug where single file name is not accepted, and dataset is sometimes turned into its parent class by mistake, add write to simple bitext functionality, update bitext tutorial

Signed-off-by: Shuoyang Ding <[email protected]>

* add docstring to explain simple bitext format, fix a bug where file extensions are removed twice before writing

Signed-off-by: Shuoyang Ding <[email protected]>

* remove print line for debug

Signed-off-by: Shuoyang Ding <[email protected]>

* add comet filter to tutorial

Signed-off-by: Shuoyang Ding <[email protected]>

* refactor COMET QE filter to decouple model from filter, make sure JointScoreFilter can take more than one fields for source and target

Signed-off-by: Shuoyang Ding <[email protected]>

* use refactored qe filter

Signed-off-by: Shuoyang Ding <[email protected]>

* wrap_qe_input should be a static method

Signed-off-by: Shuoyang Ding <[email protected]>

* use conditional import for comet, formatting changes

Signed-off-by: Shuoyang Ding <[email protected]>

* [WIP] add cometoid

Signed-off-by: Shuoyang Ding <[email protected]>

* [WIP] attempt to resolve device conflict but is failing

Signed-off-by: Shuoyang Ding <[email protected]>

* [WIP] playing with cometoid arguments

Signed-off-by: Shuoyang Ding <[email protected]>

* [WIP] -d 0 doesn't look necessary

Signed-off-by: Shuoyang Ding <[email protected]>

* tested arguments for Cometoid

Signed-off-by: Shuoyang Ding <[email protected]>

* use proper safe import, make sure test doesn't crash sans comet/pymarian

Signed-off-by: Shuoyang Ding <[email protected]>

* falling back to comet for tutorial since that's easier to set up, uppdate README

Signed-off-by: Shuoyang Ding <[email protected]>

* give credit to original fairseq implementation of histogram filtering, run black formatter

Signed-off-by: Shuoyang Ding <[email protected]>

* fix pre-commit complaint

Signed-off-by: Shuoyang Ding <[email protected]>

* fix small bug

Signed-off-by: Shuoyang Ding <[email protected]>

* fix another occurrence of the same bug

Signed-off-by: Shuoyang Ding <[email protected]>

* introduce shard limit to a single PyMarian API call to avoid memory leakage

Signed-off-by: Shuoyang Ding <[email protected]>

* repartition after reading simple bitext data

Signed-off-by: Shuoyang Ding <[email protected]>

* -d 0 is actually needed for pymarian

Signed-off-by: Shuoyang Ding <[email protected]>

* remove duplicate LengthRatioFilter definition

Signed-off-by: Shuoyang Ding <[email protected]>

* refactor repeated code segment in file writing, change classifier to accomodate custom field names, pause doc repartition since it causes problems

Signed-off-by: Shuoyang Ding <[email protected]>

* [WIP] addressed comments in #193 apart from resolving .iloc pattern, test currently failing

Signed-off-by: Shuoyang Ding <[email protected]>

* refactor to resolve .loc pattern, test passing

Signed-off-by: Shuoyang Ding <[email protected]>

* add missing file

Signed-off-by: Shuoyang Ding <[email protected]>

* revert changes in setup.py

Signed-off-by: Shuoyang Ding <[email protected]>

* fix a small bug in parallel dataset, explain why repartition is disabled, fix tutorial

Signed-off-by: Shuoyang Ding <[email protected]>

* add api guide, small change on bitext/parallel score filter docstring

Signed-off-by: Shuoyang Ding <[email protected]>

* fix read_simple_bitext test issues

Signed-off-by: Shuoyang Ding <[email protected]>

* reinstate dependencies lost during merging

Signed-off-by: Shuoyang Ding <[email protected]>

* re-enable multiple partitions for simple bitext, add parallel write

Signed-off-by: Shuoyang Ding <[email protected]>

* take care of the case where filename is not supplied in dataframe, make logic clearer

Signed-off-by: Shuoyang Ding <[email protected]>

* address other minor comments in the PR, fix segment order scrambling

Signed-off-by: Shuoyang Ding <[email protected]>

* fix test errors, add bitext dependencies

Signed-off-by: Shuoyang Ding <[email protected]>

* add back more missing imports

Signed-off-by: Shuoyang Ding <[email protected]>

* add bitext to [all] in .toml, add platformdirs as dependency

Signed-off-by: Shuoyang Ding <[email protected]>

* merge upstream, remove old bitext requirement list

Signed-off-by: Shuoyang Ding <[email protected]>

* delete requirement file again

Signed-off-by: Shuoyang Ding <[email protected]>

---------

Signed-off-by: Shuoyang Ding <[email protected]>
Co-authored-by: nverma1 <[email protected]>
  • Loading branch information
shuoyangd and nverma1 authored Nov 27, 2024
1 parent b15b08a commit 3d14b0d
Show file tree
Hide file tree
Showing 23 changed files with 1,490 additions and 30 deletions.
4 changes: 3 additions & 1 deletion docs/user-guide/api/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,12 @@ DocumentDataset
.. autoclass:: nemo_curator.datasets.DocumentDataset
:members:

.. autoclass:: nemo_curator.datasets.ParallelDataset
:members:

-------------------------------
ImageTextPairDataset
-------------------------------

.. autoclass:: nemo_curator.datasets.ImageTextPairDataset
:members:
:members:
20 changes: 20 additions & 0 deletions docs/user-guide/api/filters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@ Base Class
:members:
:member-order: bysource

.. autoclass:: nemo_curator.filters.BitextFilter
:members:
:member-order: bysource

.. autofunction:: nemo_curator.filters.import_filter

------------------------------
Expand Down Expand Up @@ -40,6 +44,14 @@ FastText Filters
:members:
:member-order: bysource

------------------------------
Quality Estimation Filters
------------------------------

.. autoclass:: nemo_curator.filters.QualityEstimationFilter
:members:
:member-order: bysource

------------------------------
Heuristic Filters
------------------------------
Expand Down Expand Up @@ -132,6 +144,14 @@ Heuristic Filters
:members:
:member-order: bysource

.. autoclass:: nemo_curator.filters.HistogramFilter
:members:
:member-order: bysource

.. autoclass:: nemo_curator.filters.LengthRatioFilter
:members:
:member-order: bysource

------------------------------
Code Filters
------------------------------
Expand Down
3 changes: 2 additions & 1 deletion nemo_curator/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,10 @@
from nemo_curator.utils.import_utils import image_only_import_from

from .doc_dataset import DocumentDataset
from .parallel_dataset import ParallelDataset

ImageTextPairDataset = image_only_import_from(
"nemo_curator.datasets.image_text_pair_dataset", "ImageTextPairDataset"
)

__all__ = ["DocumentDataset", "ImageTextPairDataset"]
__all__ = ["DocumentDataset", "ImageTextPairDataset", "ParallelDataset"]
167 changes: 167 additions & 0 deletions nemo_curator/datasets/parallel_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
import csv
from typing import List, Optional, Tuple, Union

import dask.dataframe as dd
import pandas as pd

from nemo_curator.datasets.doc_dataset import DocumentDataset
from nemo_curator.utils.distributed_utils import write_to_disk
from nemo_curator.utils.file_utils import remove_path_extension
from nemo_curator.utils.import_utils import gpu_only_import

cudf = gpu_only_import("cudf")


class ParallelDataset(DocumentDataset):
"""
An extension of the standard `DocumentDataset` with a special method that loads simple bitext.
For data with more complicated metadata, please convert your data into jsonl/parquet/pickle format
and use interfaces defined in `DocumentDataset`.
"""

def persist(self):
return ParallelDataset(self.df.persist())

@classmethod
def read_simple_bitext(
cls,
src_input_files: Union[str, List[str]],
tgt_input_files: Union[str, List[str]],
src_lang: str,
tgt_lang: str,
backend: str = "pandas",
add_filename: bool = False,
npartitions: int = 16,
):
"""See `read_single_simple_bitext_file_pair` docstring for what "simple_bitext" means and usage of other parameters.
Args:
src_input_files (Union[str, List[str]]): one or several input files, in source language
tgt_input_files (Union[str, List[str]]): one or several input files, in target language
Raises:
TypeError: If types of `src_input_files` and `tgt_input_files` doesn't agree.
Returns:
ParallelDataset: A `ParallelDataset` object with `self.df` holding the ingested simple bitext.
"""

if isinstance(src_input_files, str) and isinstance(tgt_input_files, str):
src_input_files = [src_input_files]
tgt_input_files = [tgt_input_files]
elif not isinstance(src_input_files, list) or not isinstance(
tgt_input_files, list
):
raise TypeError("Both file inputs must be strings or lists.")

# use default doc id for now
# but in the future it might be useful to allow customizing doc id by passing a prefix
df_files = []
# We do not use `dd.from_map` because an individual file could be pretty large,
# hence, it's not appropriate to partition based on individual files.
# What we do is that we concatenate all the individual files and perform repartition.
for src_input_file, tgt_input_file in zip(src_input_files, tgt_input_files):
df_file = ParallelDataset.read_single_simple_bitext_file_pair(
(src_input_file, tgt_input_file),
src_lang=src_lang,
tgt_lang=tgt_lang,
backend=backend,
add_filename=add_filename,
)
df_files.append(df_file)

if backend == "cudf":
df = cudf
else:
df = pd

data = dd.from_pandas(df.concat(df_files), npartitions=npartitions)
return cls(data)

def to_bitext(
self,
output_file_dir,
write_to_filename=False,
):
"""See `nemo_curator.utils.distributed_utils.write_to_disk` docstring for parameter usage."""
write_to_disk(
df=self.df,
output_file_dir=output_file_dir,
write_to_filename=write_to_filename,
output_type="bitext",
)

@staticmethod
def read_single_simple_bitext_file_pair(
input_file_pair: Tuple[str],
src_lang: str,
tgt_lang: str,
doc_id: str = None,
backend: str = "cudf",
add_filename: bool = False,
) -> Union[dd.DataFrame, "dask_cudf.DataFrame"]:
"""This function reads a pair of "simple bitext" files into a pandas DataFrame.
A simple bitext is a commonly data format in machine translation.
It consists of two plain text files with the same number of lines, each line pair being translations of each other. For example:
data.de:
```
Wir besitzen keine Reisetaschen aus Leder.
Die Firma produziert Computer für den deutschen Markt.
...
```
data.en:
```
We don't own duffel bags made of leather.
The company produces computers for the German market.
...
```
For simplicity, we also assume that the names of the two text files have the same prefix, except for different language code at the end as file extensions.
Args:
input_file_pair (Tuple[str]): A pair of file paths pointing to the input files
src_lang (str): Source language, in ISO-639-1 (two character) format (e.g. 'en')
tgt_lang (str): Target language, in ISO-639-1 (two character) format (e.g. 'en')
doc_id (str, optional): A string document id to assign to every segment in the file. Defaults to None.
backend (str, optional): Backend of the data frame. Defaults to "cudf".
add_filename (bool, optional): Add filename as an extra field to every segment in the file. Defaults to False.
Returns:
Union[dd.DataFrame, dask_cudf.DataFrame]
"""
src_input_file, tgt_input_file = input_file_pair
assert remove_path_extension(src_input_file) == remove_path_extension(
tgt_input_file
), f"Assuming source and target filenames would have common prefix before language code, but got {src_input_file} and {tgt_input_file}."

if not doc_id:
doc_id = "▁".join([src_input_file, tgt_input_file])

if backend == "cudf":
df = cudf
else:
df = pd

df_src = df.read_csv(
src_input_file, names=["src"], sep="\t", quoting=csv.QUOTE_NONE
)
df_tgt = df.read_csv(
tgt_input_file, names=["tgt"], sep="\t", quoting=csv.QUOTE_NONE
)
assert len(df_src) == len(
df_tgt
), f"We assume the source and target file would have the same number of lines, but got {len(df_src)} and {len(df_tgt)}."
df_combined = df.concat([df_src, df_tgt], axis=1)
df_combined["doc_id"] = doc_id
df_combined["src_lang"] = src_lang
df_combined["tgt_lang"] = tgt_lang

if add_filename:
df_combined["filename"] = remove_path_extension(src_input_file)

return df_combined
13 changes: 12 additions & 1 deletion nemo_curator/filters/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,12 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from .classifier_filter import FastTextLangId, FastTextQualityFilter
from .bitext_filter import BitextFilter
from .classifier_filter import (
FastTextLangId,
FastTextQualityFilter,
QualityEstimationFilter,
)
from .code import (
AlphaFilter,
GeneralCommentToCodeFilter,
Expand All @@ -29,6 +34,8 @@
BulletsFilter,
CommonEnglishWordsFilter,
EllipsisFilter,
HistogramFilter,
LengthRatioFilter,
LongWordFilter,
MeanWordLengthFilter,
NonAlphaNumericFilter,
Expand All @@ -51,6 +58,7 @@
from .synthetic import AnswerabilityFilter, EasinessFilter

__all__ = [
"BitextFilter",
"DocumentFilter",
"import_filter",
"FastTextLangId",
Expand Down Expand Up @@ -85,6 +93,9 @@
"AlphaFilter",
"HTMLBoilerplateFilter",
"PerExtensionFilter",
"LengthRatioFilter",
"HistogramFilter",
"QualityEstimationFilter",
"AnswerabilityFilter",
"EasinessFilter",
]
Loading

0 comments on commit 3d14b0d

Please sign in to comment.