-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add jupyter notebook tutorial for single node mulilingual dataset #30
Conversation
@ryantwolf / @Maghoumi can we review and get this merged? |
pinging again, @Maghoumi can you help review this - this will help teams trying to create multilingual datasets with Curator! |
Thanks for opening the PR @nicoleeeluo All our PR's require commits to be verified (signed) and signed off. |
@ayushdg Thanks for reminding! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for creating this tutorial! It's quite extensive which is super helpful. There are a couple of changes I'd appreciate if you could make, please let me know if you have any concerns or disagree with my requests. Thanks again.
"def pre_imports():\n", | ||
" import cudf \n", | ||
"\n", | ||
"def load_dataset(input_data_dir, file_type='jsonl'):\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think you could use the DocumentDataset.read_json
and DocumentDataset.read_parquet
methods we have added recently? Let me know if something about them would prevent you from using them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
"source": [ | ||
"## 2.Language separation and unicode fixing\n", | ||
"\n", | ||
"**Note**: In order to be run on interactive python. Please comment `from.code import *` and the related imports in `./nemo_curator/filters/__init__.py`" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try this again with the latest version of curator? And please let us know what errors you get if you get any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fixed. I will amend the notebook accordingly
"TH wikipedia data do have `id` field, but the `id` field contains number only. It will be better if we unified the `id` field and transform it to the format of `<prefix>_<id>`. In this way, when handling multiple dataset, we will able to know which document from which dataset has been removed. This `id` will be useful when we are running deduplication and heuristic filtering. The function we will be using is `AddID()`. Arguments for this function include:\n", | ||
"- `id_field`: fields will be added to input .json file. If the key already exists in the .jsonl, it's value will be replaced.\n", | ||
"- `id_prefix`: prefix used in ID. Default is 'doc-id'\n", | ||
"- `start_index`: starting index in ID. Default is 0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This recently changed. The default is now None
and the id is considered "unordered" by default to improve speed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the description. In the code section, I keep the start_index = 0 for easier reference
" 2. Fuzzy deduplication\n", | ||
"4. Heuristic filtering\n", | ||
"\n", | ||
"What is not included:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably also worth mentioning that this also doesn't include
- Distributed data classification with PyTorch models
- Personal identifiable information (PII) redaction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
@ryantwolf Hi Ryan, I have pushed a new version to include the fixes you mentioned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect! I have one last comment about the docker image you refer too, but other than that it looks great!
Though, as Ayush mentioned all our PR's require commits to be verified (signed) and signed off. Quoting from him earlier:
To achieve this you need to commit with the -sS flags. (more info in the Contributing Guide.
More details on signing commits can be found on the guide: https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits
Also it looks like the style guide is failing for the config file you made. You should be able to fix it by running pip install pre-commit && pre-commit install && pre-commit run --all
. Thanks again!
" Password: <Your NGC Key>\n", | ||
"- Get NeMo NeMo Framework Training Container\n", | ||
" ```bash\n", | ||
" docker pull nvcr.io/ea-bignlp/ga-participants/nemofw-training:24.01\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd either change this to the dev container or the 24.05 tag (which hasn't been released yet).
docker pull nvcr.io/nvidia/nemo:dev.framework
docker pull nvcr.io/nvidia/nemo:24.05.framework
Since the other container versions don't have the latest version of NeMo Curator that the tutorial uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Signed-off-by: Nicole Luo <[email protected]>
* Fix metadata inference with pandas and dask Signed-off-by: Ryan Wolf <[email protected]> * Fix datatypes for task decontamination Signed-off-by: Ryan Wolf <[email protected]> * Use targetted import Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
* Move tokenizer import Signed-off-by: Ryan Wolf <[email protected]> * Reduce inductor threads Signed-off-by: Ryan Wolf <[email protected]> * Change env int to string Signed-off-by: Ryan Wolf <[email protected]> * Change location of env var Signed-off-by: Ryan Wolf <[email protected]> * Add comment linking issue Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
* Add fast id method Signed-off-by: Ryan Wolf <[email protected]> * Add type conversion Signed-off-by: Ryan Wolf <[email protected]> * Fix off by one errors in tests Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
* Move GPU imports and make them optional Signed-off-by: Ayush Dattagupta <[email protected]> * Move gpu dependencies to a seperate install Signed-off-by: Ayush Dattagupta <[email protected]> * Remove unused import Signed-off-by: Ayush Dattagupta <[email protected]> * Switch to placeholder import that raises on usage Signed-off-by: Ayush Dattagupta <[email protected]> * Remove deprecated utils usage Signed-off-by: Ayush Dattagupta <[email protected]> * Add cuML attribution Signed-off-by: Ayush Dattagupta <[email protected]> * Safe import tests, improve install instruction, update gha workflow Signed-off-by: Ayush Dattagupta <[email protected]> * Fix pytests due to loc bug Signed-off-by: Ayush Dattagupta <[email protected]> * update install instructions Signed-off-by: Ayush Dattagupta <[email protected]> * Raise on non module-not-found errors, update logging Signed-off-by: Ayush Dattagupta <[email protected]> * Update logging to not change root logger Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
* [K8s]: Adds a helper script to create a dask cluster on k8s and includes instructions for how to a Curator workload on k8s Signed-off-by: Terry Kong <[email protected]> * black formatting Signed-off-by: Terry Kong <[email protected]> * big_english -> my_dataset Signed-off-by: Terry Kong <[email protected]> * 24.01 -> 24.03 default container Signed-off-by: Terry Kong <[email protected]> * Add help kwarg to all flags Signed-off-by: Terry Kong <[email protected]> * Clarify why venv is needed Signed-off-by: Terry Kong <[email protected]> * fix precommit failures Signed-off-by: Terry Kong <[email protected]> --------- Signed-off-by: Terry Kong <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
* Refactor common utils and remove unused code Signed-off-by: Ayush Dattagupta <[email protected]> * More cleanup Signed-off-by: Ayush Dattagupta <[email protected]> * More updates/shuffling Signed-off-by: Ayush Dattagupta <[email protected]> * Move gpu_dedup scripts into subfolder Signed-off-by: Ayush Dattagupta <[email protected]> * Remove gpu_deduplication subfolder Signed-off-by: Ayush Dattagupta <[email protected]> * Add readme to fuzzy dedup scripts section Signed-off-by: Ayush Dattagupta <[email protected]> * Fix typo and relative links Signed-off-by: Ayush Dattagupta <[email protected]> * Remove legacy script entrypoints Signed-off-by: Ayush Dattagupta <[email protected]> * Remove legacy scripts and add init file Signed-off-by: Ayush Dattagupta <[email protected]> * Update GpuDeduplication.rst Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
* Fix lang id example Signed-off-by: Ryan Wolf <[email protected]> * Add classifier unit tests Signed-off-by: Ryan Wolf <[email protected]> * Add test for failure Signed-off-by: Ryan Wolf <[email protected]> * Remove failure test Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
* Add initial dataset blending function Signed-off-by: Ryan Wolf <[email protected]> * Add blend unit tests Signed-off-by: Ryan Wolf <[email protected]> * Add self parameter Signed-off-by: Ryan Wolf <[email protected]> * Fix return type of blend dataset Signed-off-by: Ryan Wolf <[email protected]> * Fix blending tests Signed-off-by: Ryan Wolf <[email protected]> * Change assert statement for very uneven blend Signed-off-by: Ryan Wolf <[email protected]> * Fix key error Signed-off-by: Ryan Wolf <[email protected]> * Add proper proportion blending test Signed-off-by: Ryan Wolf <[email protected]> * Add four dataset blend and clarify docs Signed-off-by: Ryan Wolf <[email protected]> * Add shuffle module Signed-off-by: Ryan Wolf <[email protected]> * Add blend example and tests Signed-off-by: Ryan Wolf <[email protected]> * Fix random method name Signed-off-by: Ryan Wolf <[email protected]> * Wrap return type in DocumentDataset Signed-off-by: Ryan Wolf <[email protected]> * Save result of column drop Signed-off-by: Ryan Wolf <[email protected]> * Change equality check for shuffle tests Signed-off-by: Ryan Wolf <[email protected]> * Fix expected order after shuffle Signed-off-by: Ryan Wolf <[email protected]> * Add more documents to shuffle test Signed-off-by: Ryan Wolf <[email protected]> * Add assert statement Signed-off-by: Ryan Wolf <[email protected]> * Add within partition shuffle Signed-off-by: Ryan Wolf <[email protected]> * Refactor add rand column for shuffle Signed-off-by: Ryan Wolf <[email protected]> * Fix filename tests Signed-off-by: Ryan Wolf <[email protected]> * Add determinism handling for shuffle Signed-off-by: Ryan Wolf <[email protected]> * Change numpy random function Signed-off-by: Ryan Wolf <[email protected]> * Fix tests with new random method Signed-off-by: Ryan Wolf <[email protected]> * Remove length call from blending Signed-off-by: Ryan Wolf <[email protected]> * Improve scaling of blending function Signed-off-by: Ryan Wolf <[email protected]> * Fix blend tests Signed-off-by: Ryan Wolf <[email protected]> * Add blending script Signed-off-by: Ryan Wolf <[email protected]> * Add additional file paths call Signed-off-by: Ryan Wolf <[email protected]> * Add documentation Signed-off-by: Ryan Wolf <[email protected]> * Reformat docs Signed-off-by: Ryan Wolf <[email protected]> * Remove backticks Signed-off-by: Ryan Wolf <[email protected]> * Add context manager for shuffle tests Signed-off-by: Ryan Wolf <[email protected]> * Add better deterministic shuffle path Signed-off-by: Ryan Wolf <[email protected]> * Update documentation and reset index Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
* Initial pass at fuzzy dedup api Signed-off-by: Ayush Dattagupta <[email protected]> * Update deprecated shuffle arg Signed-off-by: Ayush Dattagupta <[email protected]> * dask_cuda gpu only import Signed-off-by: Ayush Dattagupta <[email protected]> * Move fuzzy_dedup imports to optional Signed-off-by: Ayush Dattagupta <[email protected]> * more tests Signed-off-by: Ayush Dattagupta <[email protected]> * Move FuzzyDeDupConfig to it's own class Signed-off-by: Ayush Dattagupta <[email protected]> * Add example script and config file, fix typo Signed-off-by: Ayush Dattagupta <[email protected]> * Remove slurm examples for gpu dedup Signed-off-by: Ayush Dattagupta <[email protected]> * Add config module Signed-off-by: Ayush Dattagupta <[email protected]> * Rename FuzzyDeDupConfig and minhash_length to FuzzyDuplicatesConfig, num_hashes Signed-off-by: Ayush Dattagupta <[email protected]> * Add comments and update example Signed-off-by: Ayush Dattagupta <[email protected]> * Write to same format as input in fuzzy dedup example Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
* Fix pii index issue Signed-off-by: Ryan Wolf <[email protected]> * Add sequential wrapper Signed-off-by: Ryan Wolf <[email protected]> * Fix pii tests Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
…g speed (NVIDIA#57) This commit fixes issue NVIDIA#43 (empty files created when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py) by double-checking the files size after being generated, and deleting them with size zero. In addition to that, I have noticed there is no need to parse to JSON object the content of the different lines, which should be already in json format. By removing that extra-parsing, there is a significant speed up in the execution of this method. Signed-off-by: Miguel Martínez <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
This PR adds a new tutorial to demonstrate data curation for PEFT use-cases. Signed-off-by: Mehran Maghoumi <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
* Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies Signed-off-by: Ayush Dattagupta <[email protected]> * Add comment around import, move constant import to global scope Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
Signed-off-by: Nicoel Luo <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
…zy deduplication wrapper example Signed-off-by: Nicole Luo <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>
Signed-off-by: nicoleeeluo <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>
@ryantwolf Hi Ryan, I have fixed the commits accordingly. Would you help to review and see if there is any issue? Thank you! |
} | ||
], | ||
"source": [ | ||
"client = get_client(args, args.device)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah we just merged in a change that changes the function signature of this to no longer require an argparse object. It should be easier to use, but it does mean that you need to update it here. Ping me again when this is changed and I'll merge it in ASAP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noting. I have updated accordingly.
…map_bucket section Signed-off-by: Nicole Luo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tutorial! This is super great
This PR adds a jupyter notebook workflow for a sample curation pipeline for Thai Wikipedia data.
Modules included in this workflows are