Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pre-commit style checks #14

Merged
merged 2 commits into from
Mar 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,5 +40,3 @@ jobs:
# TODO: Remove env variable when gpu dependencies are optional
run: |
RAPIDS_NO_INITIALIZE=1 python -m pytest -v --cpu
47 changes: 47 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

default_language_version:
python: python3

ci:
autofix_prs: true
autoupdate_commit_msg: '[pre-commit.ci] pre-commit suggestions'
autoupdate_schedule: quarterly

repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: check-added-large-files
args: ['--maxkb=1000']
- id: check-case-conflict
- id: check-yaml
- id: detect-private-key
- id: end-of-file-fixer
- id: requirements-txt-fixer
- id: trailing-whitespace

- repo: https://github.com/psf/black
rev: 24.3.0
hooks:
- id: black
name: Format code

- repo: https://github.com/PyCQA/isort
rev: 5.13.2
hooks:
- id: isort
name: Format imports
exclude: docs/
3 changes: 0 additions & 3 deletions .style.yapf

This file was deleted.

2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ We use ``black`` as our style guide. To fix your format run `pip install pre-com
1. Minimize the use of ``**kwargs``.
1. ``RaiseError`` is preferred to ``assert``. Write: ```if X: raise Error``` instead of ```assert X```.
1. Classes are preferred to standalone methods.
1. Methods should be atomic. A method shouldn't be longer than 75 lines, e.g. can be fit into the computer screen without scrolling.
1. Methods should be atomic. A method shouldn't be longer than 88 lines, e.g. can be fit into the computer screen without scrolling.
1. If a method has arguments that don't fit into one line, each argument should be in its own line for readability.
1. Add ``__init__.py`` for every folder.
1. F-strings are prefered to formatted strings.
Expand Down
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ We currently support the following data-curation modules. For more details on ea
- [Text reformatting and cleaning](docs/user-guide/LanguageIdentificationUnicodeFormatting.rst)
- Fix unicode decoding errors via [ftfy](https://ftfy.readthedocs.io/en/latest/)
- [Quality filtering](docs/user-guide/QualityFiltering.rst)
- Multilingual heuristic-based filtering
- Multilingual heuristic-based filtering
- Classifier-based filtering via [fastText](https://fasttext.cc/)
- [Document-level deduplication](docs/user-guide/GpuDeduplication.rst)
- Both exact and fuzzy deduplication are accelerated using cuDF and Dask.
Expand Down Expand Up @@ -79,7 +79,7 @@ Note: This is not the only way to run NeMo Curator on SLURM. There are example s

## Module Ablation and Compute Performance

The modules within NeMo Curator were in large part designed to curate high-quality documents from Common Crawl snapshots and to be able to do so
The modules within NeMo Curator were in large part designed to curate high-quality documents from Common Crawl snapshots and to be able to do so
in a scalable manner. In order to assess the quality of the Common Crawl documents curated by the modules in NeMo Curator, we performed a series
of ablation experiments in which we trained a 357M-parameter GPT-style model on the datasets resulting from the different stages of our data curation
pipeline implemented in NeMo Curator. The figure below demonstrates that the different data curation modules implemented within NeMo Curator
Expand All @@ -89,7 +89,7 @@ lead to improved model zero-shot downstream task performance.
<img src="./docs/user-guide/images/zeroshot_ablations.png" alt="drawing" width="700"/>
</p>

In terms of scalability and compute performance, using the RAPIDS + Dask fuzzy deduplication, we are able to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours using 64 A100s.
In terms of scalability and compute performance, using the RAPIDS + Dask fuzzy deduplication, we are able to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours using 64 A100s.

Additionally, using the CPU-based modules the table below shows the time required and resulting data size reduction of each step of processing the [Common Crawl snapshot from November/December of 2020](https://commoncrawl.org/2020/12/nov-dec-2020-crawl-archive-now-available/) using 30 CPU nodes (with hardware similar to the `c5.24xlarge` [Amazon AWS C5 instance](https://aws.amazon.com/ec2/instance-types/c5/)):

Expand Down Expand Up @@ -128,4 +128,4 @@ Additionally, using the CPU-based modules the table below shows the time require

As mentioned above, the modules within NeMo Curator enable users to scale data-mining and NLP processing tasks to many nodes within a compute cluster.
The modules accomplish this using [Dask](https://www.dask.org/) with [cuDF](https://docs.rapids.ai/api/cudf/nightly/user_guide/10min/) (for the GPU-accelerated modules).
At the core of the NeMo Curator, `DocumentDataset` (the main dataset class) is just a simple wrapper around a Dask dataframe. Dask allows NeMo Curator to scale to arbitrary cluster sizes, and it supports a variety of distributed computing platforms. It supports reading and writing to different file formats, and it can balance these operations among nodes in the cluster. Importantly, Dask also supports the RAPIDS cuDF library for GPU-acclerated exact and fuzzy deduplication.
At the core of the NeMo Curator, `DocumentDataset` (the main dataset class) is just a simple wrapper around a Dask dataframe. Dask allows NeMo Curator to scale to arbitrary cluster sizes, and it supports a variety of distributed computing platforms. It supports reading and writing to different file formats, and it can balance these operations among nodes in the cluster. Importantly, Dask also supports the RAPIDS cuDF library for GPU-acclerated exact and fuzzy deduplication.
2 changes: 1 addition & 1 deletion SECURITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,4 @@ While NVIDIA currently does not have a bug bounty program, we do offer acknowled

## NVIDIA Product Security

For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security
For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security
4 changes: 2 additions & 2 deletions config/arxiv_builder.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
download_module: nemo_curator.download.arxiv.ArxivDownloader
download_params: {}
iterator_module: nemo_curator.download.arxiv.ArxivIterator
iterator_params:
iterator_params:
log_frequency: 1000
extract_module: nemo_curator.download.arxiv.ArxivExtractor
extract_params: {}
format:
text: str
id: str
source_id: str
source_id: str
2 changes: 1 addition & 1 deletion config/cc_warc_builder.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@ format:
language: str
url: str
warc_id: str
source_id: str
source_id: str
2 changes: 1 addition & 1 deletion config/heuristic_filter_code.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
input_field: text
filters:
# The filters below define a chain of heuristic filters to be applied to each document in a corpus.
# This particular cascade of filters is intended to filter Python code data.
# This particular cascade of filters is intended to filter Python code data.
# The filter listed at the top will be applied first, and the following filters will be applied in
# the order they appear in this file. Each filter can be removed and re-ordered as desired.
# Change this based on the language of the data
Expand Down
18 changes: 9 additions & 9 deletions config/heuristic_filter_en.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
input_field: text
filters:
# The filters below define a chain of heuristic filters to be applied to each document in a corpus.
# This particular cascade of filters is intended to filter English language data.
# This particular cascade of filters is intended to filter English language data.
# The filter listed at the top will be applied first, and the following filters will be applied in
# the order they appear in this file. Each filter can be removed and re-ordered as desired.
- name: nemo_curator.filters.heuristic_filter.NonAlphaNumericFilter
Expand All @@ -14,16 +14,16 @@ filters:
params:
max_number_to_text_ratio: 0.15
- name: nemo_curator.filters.heuristic_filter.UrlsFilter
params:
params:
max_url_to_text_ratio: 0.2
- name: nemo_curator.filters.heuristic_filter.WhiteSpaceFilter
params:
params:
max_white_space_ratio: 0.25
- name: nemo_curator.filters.heuristic_filter.ParenthesesFilter
params:
params:
max_parentheses_ratio: 0.1
- name: nemo_curator.filters.heuristic_filter.BoilerPlateStringFilter
params:
params:
remove_if_at_top_or_bottom: True
max_boilerplate_string_ratio: 0.4
- name: nemo_curator.filters.heuristic_filter.RepeatedLinesFilter
Expand All @@ -46,18 +46,18 @@ filters:
params:
max_num_sentences_without_endmark_ratio: 0.85
- name: nemo_curator.filters.heuristic_filter.WordsWithoutAlphabetsFilter
params:
params:
min_words_with_alphabets: 0.8
- name: nemo_curator.filters.heuristic_filter.CommonEnglishWordsFilter
params:
min_num_common_words: 2
stop_at_false: True
- name: nemo_curator.filters.heuristic_filter.MeanWordLengthFilter
params:
max_mean_word_length: 10
max_mean_word_length: 10
min_mean_word_length: 3
- name: nemo_curator.filters.heuristic_filter.LongWordFilter
params:
params:
max_word_length: 1000
- name: nemo_curator.filters.heuristic_filter.EllipsisFilter
params:
Expand Down Expand Up @@ -102,4 +102,4 @@ filters:
max_repeating_duplicate_ngram_ratio: 0.10
- name: nemo_curator.filters.heuristic_filter.BulletsFilter
params:
max_bullet_lines_ratio: 0.9
max_bullet_lines_ratio: 0.9
18 changes: 9 additions & 9 deletions config/heuristic_filter_non-en.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
input_field: text
filters:
# The filters below define a chain of heuristic filters to be applied to each document in a corpus.
# This particular cascade of filters is intended to filter generic non-English data that use spaces for separating words.
# This particular cascade of filters is intended to filter generic non-English data that use spaces for separating words.
# The filter listed at the top will be applied first, and the following filters will be applied in
# the order they appear in this file. Each filter can be removed and re-ordered as desired.
- name: nemo_curator.filters.heuristic_filter.SymbolsToWordsFilter
Expand All @@ -11,16 +11,16 @@ filters:
params:
max_number_to_text_ratio: 0.15
- name: nemo_curator.filters.heuristic_filter.UrlsFilter
params:
params:
max_url_to_text_ratio: 0.2
- name: nemo_curator.filters.heuristic_filter.WhiteSpaceFilter
params:
params:
max_white_space_ratio: 0.25
- name: nemo_curator.filters.heuristic_filter.ParenthesesFilter
params:
params:
max_parentheses_ratio: 0.1
- name: nemo_curator.filters.heuristic_filter.BoilerPlateStringFilter
params:
params:
remove_if_at_top_or_bottom: True
max_boilerplate_string_ratio: 0.4
- name: nemo_curator.filters.heuristic_filter.RepeatedLinesFilter
Expand All @@ -39,17 +39,17 @@ filters:
params:
min_words: 50
max_words: 100000
# NOTE: This filter tends to remove many documents and will need to
# NOTE: This filter tends to remove many documents and will need to
# be tuned per language
- name: nemo_curator.filters.heuristic_filter.PunctuationFilter
params:
max_num_sentences_without_endmark_ratio: 0.85
- name: nemo_curator.filters.heuristic_filter.MeanWordLengthFilter
params:
max_mean_word_length: 10
max_mean_word_length: 10
min_mean_word_length: 3
- name: nemo_curator.filters.heuristic_filter.LongWordFilter
params:
params:
max_word_length: 1000
- name: nemo_curator.filters.heuristic_filter.EllipsisFilter
params:
Expand Down Expand Up @@ -94,4 +94,4 @@ filters:
max_repeating_duplicate_ngram_ratio: 0.10
- name: nemo_curator.filters.heuristic_filter.BulletsFilter
params:
max_bullet_lines_ratio: 0.9
max_bullet_lines_ratio: 0.9
2 changes: 1 addition & 1 deletion config/lm_tasks.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
tasks:
# The Python modules below define language model downstream evaluation
# task data. If one of the below tasks is specified, N-grams will
# task data. If one of the below tasks is specified, N-grams will
# be constructed from the documents that make up the task data
# using the script prepare_task_data.
# find_matching_ngrams will then search for these N-grams
Expand Down
2 changes: 1 addition & 1 deletion config/pii_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ pii_config:
#type: 'hash'
#hash_type: 'sha256'

#type: 'redact'
#type: 'redact'
2 changes: 1 addition & 1 deletion config/wikipedia_builder.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ format:
id: str
url: str
language: str
source_id: str
source_id: str
2 changes: 1 addition & 1 deletion docs/user-guide/CPUvsGPU.rst
Original file line number Diff line number Diff line change
Expand Up @@ -95,4 +95,4 @@ Every SLURM cluster is different, so make sure you understand how your SLURM clu
``start-slurm.sh`` calls ``containter-entrypoint.sh`` which sets up a Dask scheduler and workers across the cluster.

Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the ``start-slurm.sh`` to run on multiple nodes.
You can adapt your scripts easily too by simply following the pattern of adding ``get_client`` with ``add_distributed_args``.
You can adapt your scripts easily too by simply following the pattern of adding ``get_client`` with ``add_distributed_args``.
4 changes: 2 additions & 2 deletions docs/user-guide/DistributedDataClassification.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Background

When preparing text data to be used in training a large language model (LLM), it is useful to classify
text documents in various ways, to enhance the LLM's performance by making it able to produce more
contextually appropriate and accurate language across various subjects. NeMo Curator provides this module to
contextually appropriate and accurate language across various subjects. NeMo Curator provides this module to
help a user run inference with pre-trained models on large amounts of text documents. We achieve
this by chunking the datasets across multiple computing nodes, each equipped with multiple GPUs, to
accelerate the classification task in a distributed way. In other words, because the classification of
Expand Down Expand Up @@ -68,4 +68,4 @@ The key differences is that it operates on the GPU instead of the CPU.
Therefore, the Dask cluster must be started as a GPU one.
And, ``DomainClassifier`` requires ``DocumentDataset`` to be on the GPU (i.e., have ``backend=cudf``).
It is easy to extend ``DistributedDataClassifier`` to your own model.
Check out ``nemo_curator.modules.distributed_data_classifier.py`` for reference.
Check out ``nemo_curator.modules.distributed_data_classifier.py`` for reference.
8 changes: 4 additions & 4 deletions docs/user-guide/DocumentDataset.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ You could read, filter the dataset, and write it using the following methods
text_field="text",
score_field="word_count",
)
long_books = filter_step(books)
long_books.to_json("long_books/", write_to_filename=True)
Expand Down Expand Up @@ -106,7 +106,7 @@ Consider a modified version of the code above:
text_field="text",
score_field="word_count",
)
long_books = filter_step(books)
long_books.to_json("long_books/", write_to_filename=True)
Expand All @@ -130,10 +130,10 @@ In these cases, we recommend processing the input dataset in batches using a sim
text_field="text",
score_field="word_count",
)
long_books = filter_step(books)
long_books.to_json("long_books/", write_to_filename=True)
This will read in 64 shards at a time, process them, and write them back to disk.
Like ``get_remaining_files``, it only includes files that are in the input directory and not in the output directory.
Like ``get_remaining_files``, it only includes files that are in the input directory and not in the output directory.
Loading