Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add support for parallel data curation (#193)
* add data interface to read simple bitext Signed-off-by: Shuoyang Ding <[email protected]> * adding ParallelScoreFilter Signed-off-by: Shuoyang Ding <[email protected]> * add test for ParallelScoreFilter, small style change for ParallelDataset test, fix a few data and import bugs Signed-off-by: Shuoyang Ding <[email protected]> * allow ParallelScoreFilter to take different filters for source and target Signed-off-by: Shuoyang Ding <[email protected]> * add JointScoreFilter and LengthRatioFilter Signed-off-by: Shuoyang Ding <[email protected]> * [WIP] add heuristic filter w/o test Signed-off-by: Shuoyang Ding <[email protected]> * merge with main Signed-off-by: Shuoyang Ding <[email protected]> * add test for histogram filter, fix a few bugs Signed-off-by: Shuoyang Ding <[email protected]> * length ratio, joint score filter testing Signed-off-by: Shuoyang Ding <[email protected]> * fix typing in joint test Signed-off-by: Shuoyang Ding <[email protected]> * add a fake comet qe filter as an initial step Signed-off-by: Shuoyang Ding <[email protected]> * [WIP] adding bitext cleaning tutorial Signed-off-by: Shuoyang Ding <[email protected]> * [WIP] fixing example Signed-off-by: Shuoyang Ding <[email protected]> * fix slow histogram filter, fix faulty bitext loading Signed-off-by: Shuoyang Ding <[email protected]> * tutorial running Signed-off-by: Shuoyang Ding <[email protected]> * [WIP] documentation of bitext tutorial Signed-off-by: Shuoyang Ding <[email protected]> * add tested version of comet-qe filter Signed-off-by: Shuoyang Ding <[email protected]> * fix ParallelDataset bug where single file name is not accepted, and dataset is sometimes turned into its parent class by mistake, add write to simple bitext functionality, update bitext tutorial Signed-off-by: Shuoyang Ding <[email protected]> * add docstring to explain simple bitext format, fix a bug where file extensions are removed twice before writing Signed-off-by: Shuoyang Ding <[email protected]> * remove print line for debug Signed-off-by: Shuoyang Ding <[email protected]> * add comet filter to tutorial Signed-off-by: Shuoyang Ding <[email protected]> * refactor COMET QE filter to decouple model from filter, make sure JointScoreFilter can take more than one fields for source and target Signed-off-by: Shuoyang Ding <[email protected]> * use refactored qe filter Signed-off-by: Shuoyang Ding <[email protected]> * wrap_qe_input should be a static method Signed-off-by: Shuoyang Ding <[email protected]> * use conditional import for comet, formatting changes Signed-off-by: Shuoyang Ding <[email protected]> * [WIP] add cometoid Signed-off-by: Shuoyang Ding <[email protected]> * [WIP] attempt to resolve device conflict but is failing Signed-off-by: Shuoyang Ding <[email protected]> * [WIP] playing with cometoid arguments Signed-off-by: Shuoyang Ding <[email protected]> * [WIP] -d 0 doesn't look necessary Signed-off-by: Shuoyang Ding <[email protected]> * tested arguments for Cometoid Signed-off-by: Shuoyang Ding <[email protected]> * use proper safe import, make sure test doesn't crash sans comet/pymarian Signed-off-by: Shuoyang Ding <[email protected]> * falling back to comet for tutorial since that's easier to set up, uppdate README Signed-off-by: Shuoyang Ding <[email protected]> * give credit to original fairseq implementation of histogram filtering, run black formatter Signed-off-by: Shuoyang Ding <[email protected]> * fix pre-commit complaint Signed-off-by: Shuoyang Ding <[email protected]> * fix small bug Signed-off-by: Shuoyang Ding <[email protected]> * fix another occurrence of the same bug Signed-off-by: Shuoyang Ding <[email protected]> * introduce shard limit to a single PyMarian API call to avoid memory leakage Signed-off-by: Shuoyang Ding <[email protected]> * repartition after reading simple bitext data Signed-off-by: Shuoyang Ding <[email protected]> * -d 0 is actually needed for pymarian Signed-off-by: Shuoyang Ding <[email protected]> * remove duplicate LengthRatioFilter definition Signed-off-by: Shuoyang Ding <[email protected]> * refactor repeated code segment in file writing, change classifier to accomodate custom field names, pause doc repartition since it causes problems Signed-off-by: Shuoyang Ding <[email protected]> * [WIP] addressed comments in #193 apart from resolving .iloc pattern, test currently failing Signed-off-by: Shuoyang Ding <[email protected]> * refactor to resolve .loc pattern, test passing Signed-off-by: Shuoyang Ding <[email protected]> * add missing file Signed-off-by: Shuoyang Ding <[email protected]> * revert changes in setup.py Signed-off-by: Shuoyang Ding <[email protected]> * fix a small bug in parallel dataset, explain why repartition is disabled, fix tutorial Signed-off-by: Shuoyang Ding <[email protected]> * add api guide, small change on bitext/parallel score filter docstring Signed-off-by: Shuoyang Ding <[email protected]> * fix read_simple_bitext test issues Signed-off-by: Shuoyang Ding <[email protected]> * reinstate dependencies lost during merging Signed-off-by: Shuoyang Ding <[email protected]> * re-enable multiple partitions for simple bitext, add parallel write Signed-off-by: Shuoyang Ding <[email protected]> * take care of the case where filename is not supplied in dataframe, make logic clearer Signed-off-by: Shuoyang Ding <[email protected]> * address other minor comments in the PR, fix segment order scrambling Signed-off-by: Shuoyang Ding <[email protected]> * fix test errors, add bitext dependencies Signed-off-by: Shuoyang Ding <[email protected]> * add back more missing imports Signed-off-by: Shuoyang Ding <[email protected]> * add bitext to [all] in .toml, add platformdirs as dependency Signed-off-by: Shuoyang Ding <[email protected]> * merge upstream, remove old bitext requirement list Signed-off-by: Shuoyang Ding <[email protected]> * delete requirement file again Signed-off-by: Shuoyang Ding <[email protected]> --------- Signed-off-by: Shuoyang Ding <[email protected]> Co-authored-by: nverma1 <[email protected]>
- Loading branch information