Skip to content

Commit

Permalink
Add READMEs to examples/ and nemo_curator/scripts directories (#332)
Browse files Browse the repository at this point in the history
* save progress

Signed-off-by: Sarah Yurick <[email protected]>

* add remaining docs

Signed-off-by: Sarah Yurick <[email protected]>

* add titles and table

Signed-off-by: Sarah Yurick <[email protected]>

* remove trailing whitespace

Signed-off-by: Sarah Yurick <[email protected]>

* add --help instructions

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
  • Loading branch information
sarahyurick authored Dec 3, 2024
1 parent bc724ec commit d1f52f6
Show file tree
Hide file tree
Showing 18 changed files with 208 additions and 22 deletions.
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ There should be at least one example per module in the curator.
They should be incredibly lightweight and rely on the core `nemo_curator` modules for their functionality.
Most should be designed for a user to get up and running on their local machines, but distributed examples are welcomed if it makes sense.
Python scripts should be the primary way to showcase your module.
Though, SLURM scripts or other cluster scripts should be included if there are special steps needed to run the module.
Though, Slurm scripts or other cluster scripts should be included if there are special steps needed to run the module.

The documentation should complement each example by going through the motivation behind why a user would use each module.
It should include both an explanation of the module, and how it's used in its corresponding example.
Expand Down
12 changes: 6 additions & 6 deletions docs/user-guide/cpuvsgpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ All of the ``examples/`` use it to set up a Dask cluster.
It is possible to run entirely CPU-based workflows on a GPU cluster, though the process count (and therefore the number of parallel tasks) will be limited by the number of GPUs on your machine.

* ``scheduler_address`` and ``scheduler_file`` are used for connecting to an existing Dask cluster.
Supplying one of these is essential if you are running a Dask cluster on SLURM or Kubernetes.
Supplying one of these is essential if you are running a Dask cluster on Slurm or Kubernetes.
All other arguments are ignored if either of these are passed, as the cluster configuration will be done when you create the schduler and works on your cluster.

* The remaining arguments can be modified `here <https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/utils/distributed_utils.py>`_.
Expand Down Expand Up @@ -82,15 +82,15 @@ Even if you start a GPU dask cluster, you can't operate on datasets that use a `
The ``DocuemntDataset`` must either have been originally read in with a ``cudf`` backend, or it must be transferred during the script.

-----------------------------------------
Dask with SLURM
Dask with Slurm
-----------------------------------------

We provide an example SLURM script pipeline in ``examples/slurm``.
We provide an example Slurm script pipeline in ``examples/slurm``.
This pipeline has a script ``start-slurm.sh`` that provides configuration options similar to what ``get_client`` provides.
Every SLURM cluster is different, so make sure you understand how your SLURM cluster works so the scripts can be easily adapted.
``start-slurm.sh`` calls ``containter-entrypoint.sh`` which sets up a Dask scheduler and workers across the cluster.
Every Slurm cluster is different, so make sure you understand how your Slurm cluster works so the scripts can be easily adapted.
``start-slurm.sh`` calls ``containter-entrypoint.sh``, which sets up a Dask scheduler and workers across the cluster.

Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the ``start-slurm.sh`` to run on multiple nodes.
Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the ``start-slurm.sh`` script to run on multiple nodes.
You can adapt your scripts easily too by simply following the pattern of adding ``get_client`` with ``add_distributed_args``.

-----------------------------------------
Expand Down
6 changes: 3 additions & 3 deletions docs/user-guide/kubernetescurator.rst
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ use ``kubectl cp``, but ``exec`` has fewer surprises regarding compressed files:
Create a Dask Cluster
---------------------
Use the ``create_dask_cluster.py`` to create a CPU or GPU dask cluster.
Use the ``create_dask_cluster.py`` to create a CPU or GPU Dask cluster.
.. note::
If you are creating another Dask cluster with the same ``--name <name>``, first delete it via::
Expand Down Expand Up @@ -289,7 +289,7 @@ container, we will need to build a custom image with your code installed:
# Fill in <private-registry>/<username>/<password>
kubectl create secret docker-registry my-private-registry --docker-server=<private-registry> --docker-username=<username> --docker-password=<password>
And with this new secret, you create your new dask cluster:
And with this new secret, you create your new Dask cluster:
.. code-block:: bash
Expand Down Expand Up @@ -360,7 +360,7 @@ At this point you can tail the logs and look for ``Finished!`` in ``/nemo-worksp
Deleting Cluster
----------------
After you have finished using the created dask cluster, you can delete it to release the resources:
After you have finished using the created Dask cluster, you can delete it to release the resources:
.. code-block:: bash
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ The tool utilizes `Dask <https://dask.org>`_ to parallelize tasks and hence it c
used to scale up to terabytes of data easily. Although Dask can be deployed on various
distributed compute environments such as HPC clusters, Kubernetes and other cloud
offerings such as AWS EKS, Google cloud etc, the current implementation only supports
Dask on HPC clusters that use SLURM as the resource manager.
Dask on HPC clusters that use Slurm as the resource manager.

-----------------------------------------
Usage
Expand Down Expand Up @@ -92,7 +92,7 @@ The PII redaction module can also be invoked via ``script/find_pii_and_deidentif

``python nemo_curator/scripts/find_pii_and_deidentify.py``

To launch the script from within a SLURM environment, the script ``examples/slurm/start-slurm.sh`` can be modified and used.
To launch the script from within a Slurm environment, the script ``examples/slurm/start-slurm.sh`` can be modified and used.


############################
Expand Down
25 changes: 25 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# NeMo Curator Python API examples

This directory contains multiple Python scripts with examples of how to use various NeMo Curator classes and functions.
The goal of these examples is to give the user an overview of many of the ways your text data can be curated.
These include:

| Python Script | Description |
|---------------------------------------|---------------------------------------------------------------------------------------------------------------|
| blend_and_shuffle.py | Combine multiple datasets into one with different amounts of each dataset, then randomly permute the dataset. |
| classifier_filtering.py | Train a fastText classifier, then use it to filter high and low quality data. |
| download_arxiv.py | Download Arxiv tar files and extract them. |
| download_common_crawl.py | Download Common Crawl WARC snapshots and extract them. |
| download_wikipedia.py | Download the latest Wikipedia dumps and extract them. |
| exact_deduplication.py | Use the `ExactDuplicates` class to perform exact deduplication on text data. |
| find_pii_and_deidentify.py | Use the `PiiModifier` and `Modify` classes to remove personally identifiable information from text data. |
| fuzzy_deduplication.py | Use the `FuzzyDuplicatesConfig` and `FuzzyDuplicates` classes to perform fuzzy deduplication on text data. |
| identify_languages_and_fix_unicode.py | Use `FastTextLangId` to filter data by language, then fix the unicode in it. |
| raw_download_common_crawl.py | Download the raw compressed WARC files from Common Crawl without extracting them. |
| semdedup_example.py | Use the `SemDedup` class to perform semantic deduplication on text data. |
| task_decontamination.py | Remove segments of downstream evaluation tasks from a dataset. |
| translation_example.py | Create and use an `IndicTranslation` model for language translation. |

Before running any of these scripts, we strongly recommend displaying `python <script name>.py --help` to ensure that any needed or relevant arguments are specified.

The `classifiers`, `k8s`, `nemo_run`, and `slurm` subdirectories contain even more examples of NeMo Curator's capabilities.
21 changes: 21 additions & 0 deletions examples/classifiers/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
## Text Classification

The Python scripts in this directory demonstrate how to run classification on your text data with each of these 4 classifiers:

- Domain Classifier
- Quality Classifier
- AEGIS Safety Models
- FineWeb Educational Content Classifier

For more information about these classifiers, please see NeMo Curator's [Distributed Data Classification documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html).

Each of these scripts provide simple examples of what your own Python scripts might look like.

At a high level, you will:

1. Create a Dask client by using the `get_client` function
2. Use `DocumentDataset.read_json` (or `DocumentDataset.read_parquet`) to read your data
3. Initialize and call the classifier on your data
4. Write your results to the desired output type with `to_json` or `to_parquet`

Before running any of these scripts, we strongly recommend displaying `python <script name>.py --help` to ensure that any needed or relevant arguments are specified.
5 changes: 5 additions & 0 deletions examples/k8s/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## Kubernetes

The `create_dask_cluster.py` can be used to create a CPU or GPU Dask cluster.

See [Running NeMo Curator on Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/kubernetescurator.html) for more information.
5 changes: 5 additions & 0 deletions examples/nemo_run/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## NeMo-Run

The `launch_slurm.py` script shows an example of how to run a Slurm job via Python APIs.

See the [Dask with Slurm](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/cpuvsgpu.html?highlight=slurm#dask-with-slurm) and [NeMo-Run Quickstart](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/quickstart.html?highlight=slurm#execute-on-a-slurm-cluster) pages for more information.
4 changes: 2 additions & 2 deletions examples/nemo_run/launch_slurm.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
@run.factory
def nemo_curator_slurm_executor() -> SlurmExecutor:
"""
Configure the following function with the details of your SLURM cluster
Configure the following function with the details of your Slurm cluster
"""
return SlurmExecutor(
job_name_prefix="nemo-curator",
Expand All @@ -35,7 +35,7 @@ def nemo_curator_slurm_executor() -> SlurmExecutor:


def main():
# Path to NeMo-Curator/examples/slurm/container_entrypoint.sh on the SLURM cluster
# Path to NeMo-Curator/examples/slurm/container_entrypoint.sh on the Slurm cluster
container_entrypoint = "/cluster/path/slurm/container_entrypoint.sh"
# The NeMo Curator command to run
# This command can be susbstituted with any NeMo Curator command
Expand Down
9 changes: 9 additions & 0 deletions examples/slurm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Dask with Slurm

This directory provides an example Slurm script pipeline.
This pipeline has a script `start-slurm.sh` that provides configuration options similar to what `get_client` provides.
Every Slurm cluster is different, so make sure you understand how your Slurm cluster works so the scripts can be easily adapted.
`start-slurm.sh` calls `containter-entrypoint.sh`, which sets up a Dask scheduler and workers across the cluster.

Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the `start-slurm.sh` script to run on multiple nodes.
You can adapt your scripts easily too by simply following the pattern of adding `get_client` with `add_distributed_args`.
2 changes: 1 addition & 1 deletion examples/slurm/start-slurm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
# Begin easy customization
# =================================================================

# Base directory for all SLURM job logs and files
# Base directory for all Slurm job logs and files
# Does not affect directories referenced in your script
export BASE_JOB_DIR=`pwd`/nemo-curator-jobs
export JOB_DIR=$BASE_JOB_DIR/$SLURM_JOB_ID
Expand Down
2 changes: 1 addition & 1 deletion nemo_curator/modules/dataset_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ def blend_datasets(
target_size: int, datasets: List[DocumentDataset], sampling_weights: List[float]
) -> DocumentDataset:
"""
Combined multiple datasets into one with different amounts of each dataset
Combines multiple datasets into one with different amounts of each dataset.
Args:
target_size: The number of documents the resulting dataset should have.
The actual size of the dataset may be slightly larger if the normalized weights do not allow
Expand Down
6 changes: 3 additions & 3 deletions nemo_curator/nemo_run/slurm.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
@dataclass
class SlurmJobConfig:
"""
Configuration for running a NeMo Curator script on a SLURM cluster using
Configuration for running a NeMo Curator script on a Slurm cluster using
NeMo Run
Args:
Expand Down Expand Up @@ -74,13 +74,13 @@ def to_script(self, add_scheduler_file: bool = True, add_device: bool = True):
add_scheduler_file: Automatically appends a '--scheduler-file' argument to the
script_command where the value is job_dir/logs/scheduler.json. All
scripts included in NeMo Curator accept and require this argument to scale
properly on SLURM clusters.
properly on Slurm clusters.
add_device: Automatically appends a '--device' argument to the script_command
where the value is the member variable of device. All scripts included in
NeMo Curator accept and require this argument.
Returns:
A NeMo Run Script that will intialize a Dask cluster, and run the specified command.
It is designed to be executed on a SLURM cluster
It is designed to be executed on a Slurm cluster
"""
env_vars = self._build_env_vars()

Expand Down
29 changes: 29 additions & 0 deletions nemo_curator/scripts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# NeMo Curator CLI Scripts

The following Python scripts are designed to be executed from the command line (terminal) only.

Here, we list all of the Python scripts and their terminal commands:

| Python Command | CLI Command |
|------------------------------------------|--------------------------------|
| python add_id.py | add_id |
| python blend_datasets.py | blend_datasets |
| python download_and_extract.py | download_and_extract |
| python filter_documents.py | filter_documents |
| python find_exact_duplicates.py | gpu_exact_dups |
| python find_matching_ngrams.py | find_matching_ngrams |
| python find_pii_and_deidentify.py | deidentify |
| python get_common_crawl_urls.py | get_common_crawl_urls |
| python get_wikipedia_urls.py | get_wikipedia_urls |
| python make_data_shards.py | make_data_shards |
| python prepare_fasttext_training_data.py | prepare_fasttext_training_data |
| python prepare_task_data.py | prepare_task_data |
| python remove_matching_ngrams.py | remove_matching_ngrams |
| python separate_by_metadata.py | separate_by_metadata |
| python text_cleaning.py | text_cleaning |
| python train_fasttext.py | train_fasttext |
| python verify_classification_results.py | verify_classification_results |

For more information about the arguments needed for each script, you can use `add_id --help`, etc.

More scripts can be found in the `classifiers`, `fuzzy_deduplication`, and `semdedup` subdirectories.
92 changes: 92 additions & 0 deletions nemo_curator/scripts/classifiers/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
## Text Classification

The Python scripts in this directory demonstrate how to run classification on your text data with each of these 4 classifiers:

- Domain Classifier
- Quality Classifier
- AEGIS Safety Models
- FineWeb Educational Content Classifier

For more information about these classifiers, please see NeMo Curator's [Distributed Data Classification documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html).

### Usage

#### Domain classifier inference

```bash
# same as `python domain_classifier_inference.py`
domain_classifier_inference \
--input-data-dir /path/to/data/directory \
--output-data-dir /path/to/output/directory \
--input-file-type "jsonl" \
--input-file-extension "jsonl" \
--output-file-type "jsonl" \
--input-text-field "text" \
--batch-size 64 \
--autocast \
--max-chars 2000 \
--device "gpu"
```

Additional arguments may be added for customizing a Dask cluster and client. Run `domain_classifier_inference --help` for more information.

#### Quality classifier inference

```bash
# same as `python quality_classifier_inference.py`
quality_classifier_inference \
--input-data-dir /path/to/data/directory \
--output-data-dir /path/to/output/directory \
--input-file-type "jsonl" \
--input-file-extension "jsonl" \
--output-file-type "jsonl" \
--input-text-field "text" \
--batch-size 64 \
--autocast \
--max-chars 2000 \
--device "gpu"
```

Additional arguments may be added for customizing a Dask cluster and client. Run `quality_classifier_inference --help` for more information.

#### AEGIS classifier inference

```bash
# same as `python aegis_classifier_inference.py`
aegis_classifier_inference \
--input-data-dir /path/to/data/directory \
--output-data-dir /path/to/output/directory \
--input-file-type "jsonl" \
--input-file-extension "jsonl" \
--output-file-type "jsonl" \
--input-text-field "text" \
--batch-size 64 \
--max-chars 6000 \
--device "gpu" \
--aegis-variant "nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0" \
--token "hf_1234"
```

- `--aegis-variant` can be `nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0`, `nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0`, or a path to your own PEFT of LlamaGuard 2.
- `--token` is your HuggingFace token, which is used when downloading the base Llama Guard model.

Additional arguments may be added for customizing a Dask cluster and client. Run `aegis_classifier_inference --help` for more information.

#### FineWeb-Edu classifier inference

```bash
# same as `python fineweb_edu_classifier_inference.py`
fineweb_edu_classifier_inference \
--input-data-dir /path/to/data/directory \
--output-data-dir /path/to/output/directory \
--input-file-type "jsonl" \
--input-file-extension "jsonl" \
--output-file-type "jsonl" \
--input-text-field "text" \
--batch-size 64 \
--autocast \
--max-chars 2000 \
--device "gpu"
```

Additional arguments may be added for customizing a Dask cluster and client. Run `fineweb_edu_classifier_inference --help` for more information.
Loading

0 comments on commit d1f52f6

Please sign in to comment.