-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Curation Failure in NVIDIA NeMo Docker Container (nvcr.io/nvidia/nemo:24.07) Using DataCurator with JSONL Files #348
Comments
Thanks for opening the issue! Quick question though, I'm looking at your code and I see these import statements: from nemo.collections.nlp.data.data_curator import DataCurator
from nemo.collections.nlp.data.data_curator.filters import (
DeduplicationFilter,
QualityFilter,
ContentFilter,
ToxicityFilter,
)
from nemo.collections.nlp.data.data_curator.processors import TextNormalizationProcessor Why are you doing these import statements? |
Hello Ryan, I have a JSONL file with the following structure: json { I’d like to integrate NVIDIA NeMo into my code to improve data quality. Could you provide guidance on the best way to accomplish this? |
The best procedure depends a lot on what you are trying to do and where you got the data from. Are you trying to pretrain, fine-tune, build a RAG system, or do something else entirely? Is this data scraped from the web, or does it come from a source that has already processed the data in some way? In general, you might want to try performing heuristic filtering or classifier filtering with a fastText model to filter out low-quality data. Then, you can run fuzzy deduplication to identify similar documents (Note: you'll need an NVIDIA GPU for this step). However, if your dataset is small (<= 100,000 documents) you may not benefit from filtering at all. I also would appreciate if you could tell me where you got these import statements from and if they work at all: from nemo.collections.nlp.data.data_curator import DataCurator
from nemo.collections.nlp.data.data_curator.filters import (
DeduplicationFilter,
QualityFilter,
ContentFilter,
ToxicityFilter,
)
from nemo.collections.nlp.data.data_curator.processors import TextNormalizationProcessor |
I can also recommend some tutorials that show how to use features of NeMo Curator. The tutorials folder has them all, but here are a few highlights that are focused around pre-trainining.
|
**Bug Description
The NeMo Curator DataCurator class does not seem to work as expected when running a data curation script in the NVIDIA NeMo Docker container (nvcr.io/nvidia/nemo:24.07). The script loads a JSONL file, applies filters, and writes curated results to a new JSONL file, but fails during the curation process.
Steps to Reproduce the Bug
Pull the NVIDIA NeMo container:
bash
sudo docker pull nvcr.io/nvidia/nemo:24.07
Run the container with volume mounts:
bash
sudo docker run -it --rm
-v /home/aghadghadi/script:/workspace
-v /home/aghadghadi/test:/workspace/data
nvcr.io/nvidia/nemo:24.07
Inside the container, execute the Python script using the following code:
python
Copier le code
from nemo.collections.nlp.data.data_curator import DataCurator
from nemo.collections.nlp.data.data_curator.filters import (
DeduplicationFilter,
QualityFilter,
ContentFilter,
ToxicityFilter,
)
from nemo.collections.nlp.data.data_curator.processors import TextNormalizationProcessor
curator = DataCurator()
deduplication_filter = DeduplicationFilter()
quality_filter = QualityFilter(min_quality_score=0.8)
content_filter = ContentFilter(keywords=["relevant_topic"], exclude_keywords=["unwanted_topic"])
toxicity_filter = ToxicityFilter(max_toxicity_score=0.2)
curator.add_filter(deduplication_filter)
curator.add_filter(quality_filter)
curator.add_filter(content_filter)
curator.add_filter(toxicity_filter)
normalization_processor = TextNormalizationProcessor(lowercase=True, remove_punctuation=True)
curator.add_processor(normalization_processor)
input_path = '/workspace/data/falcon.jsonl'
output_path = '/workspace/data/falcon_sampled.jsonl'
curated_data = []
with open(input_path, 'r', encoding='utf-8') as f:
for line in f:
data_entry = json.loads(line) # Parse each line as JSON
data = curator.curate_data(data_entry)
if data is not None:
curated_data.append(json.dumps(data))
with open(output_path, 'w', encoding='utf-8') as f:
for entry in curated_data:
f.write(entry + "\n")
print("Data curation complete. Curated data saved to", output_path)
Expected Behavior
The script should:
Load each line of the JSONL file, apply defined filters, and normalize text.
Write the curated data back to a new JSONL file (falcon_sampled.jsonl) without issues.
Actual Behavior
The script fails during the curation process, with errors related to loading or filtering the JSON data lines.
Environment Overview
Environment location: Docker
Docker run command:
bash
Copier le code
sudo docker run -it --rm
-v /home/aghadghadi/script:/workspace
-v /home/aghadghadi/test:/workspace/data
nvcr.io/nvidia/nemo:24.07
NeMo Curator Installation: Pre-installed in Docker image nvcr.io/nvidia/nemo:24.07
Environment Details
Since the NVIDIA Docker image is used, the base OS version, Dask, and Python versions are not modified from the container.
Additional Context
The JSONL file is large and contains structured JSON lines.
Mounted volumes are set correctly, and file paths are accessible within the container.
Questions or Suggestions:
Please confirm compatibility with the DataCurator API in this container image.
Could there be an issue with JSON parsing or filter application within curate_data?
The text was updated successfully, but these errors were encountered: