NeMo Curator - GPU-Accelerated Data Curation
NVIDIA's toolkit for preparing high-quality training data for LLMs.
When to use NeMo Curator
Use NeMo Curator when:
-
Preparing LLM training data from web scrapes (Common Crawl)
-
Need fast deduplication (16× faster than CPU)
-
Curating multi-modal datasets (text, images, video, audio)
-
Filtering low-quality or toxic content
-
Scaling data processing across GPU cluster
Performance:
-
16× faster fuzzy deduplication (8TB RedPajama v2)
-
40% lower TCO vs CPU alternatives
-
Near-linear scaling across GPU nodes
Use alternatives instead:
-
datatrove: CPU-based, open-source data processing
-
dolma: Allen AI's data toolkit
-
Ray Data: General ML data processing (no curation focus)
Quick start
Installation
Text curation (CUDA 12)
uv pip install "nemo-curator[text_cuda12]"
All modalities
uv pip install "nemo-curator[all_cuda12]"
CPU-only (slower)
uv pip install "nemo-curator[cpu]"
Basic text curation pipeline
from nemo_curator import ScoreFilter, Modify from nemo_curator.datasets import DocumentDataset import pandas as pd
Load data
df = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]}) dataset = DocumentDataset(df)
Quality filtering
def quality_score(doc): return len(doc["text"].split()) > 5 # Filter short docs
filtered = ScoreFilter(quality_score)(dataset)
Deduplication
from nemo_curator.modules import ExactDuplicates deduped = ExactDuplicates()(filtered)
Save
deduped.to_parquet("curated_data/")
Data curation pipeline
Stage 1: Quality filtering
from nemo_curator.filters import ( WordCountFilter, RepeatedLinesFilter, UrlRatioFilter, NonAlphaNumericFilter )
Apply 30+ heuristic filters
from nemo_curator import ScoreFilter
Word count filter
dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000))
Remove repetitive content
dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3))
URL ratio filter
dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2))
Stage 2: Deduplication
Exact deduplication:
from nemo_curator.modules import ExactDuplicates
Remove exact duplicates
deduped = ExactDuplicates(id_field="id", text_field="text")(dataset)
Fuzzy deduplication (16× faster on GPU):
from nemo_curator.modules import FuzzyDuplicates
MinHash + LSH deduplication
fuzzy_dedup = FuzzyDuplicates( id_field="id", text_field="text", num_hashes=260, # MinHash parameters num_buckets=20, hash_method="md5" )
deduped = fuzzy_dedup(dataset)
Semantic deduplication:
from nemo_curator.modules import SemanticDuplicates
Embedding-based deduplication
semantic_dedup = SemanticDuplicates( id_field="id", text_field="text", embedding_model="sentence-transformers/all-MiniLM-L6-v2", threshold=0.8 # Cosine similarity threshold )
deduped = semantic_dedup(dataset)
Stage 3: PII redaction
from nemo_curator.modules import Modify from nemo_curator.modifiers import PIIRedactor
Redact personally identifiable information
pii_redactor = PIIRedactor( supported_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION"], anonymize_action="replace" # or "redact" )
redacted = Modify(pii_redactor)(dataset)
Stage 4: Classifier filtering
from nemo_curator.classifiers import QualityClassifier
Quality classification
quality_clf = QualityClassifier( model_path="nvidia/quality-classifier-deberta", batch_size=256, device="cuda" )
Filter low-quality documents
high_quality = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5)
GPU acceleration
GPU vs CPU performance
Operation CPU (16 cores) GPU (A100) Speedup
Fuzzy dedup (8TB) 120 hours 7.5 hours 16×
Exact dedup (1TB) 8 hours 0.5 hours 16×
Quality filtering 2 hours 0.2 hours 10×
Multi-GPU scaling
from nemo_curator import get_client import dask_cuda
Initialize GPU cluster
client = get_client(cluster_type="gpu", n_workers=8)
Process with 8 GPUs
deduped = FuzzyDuplicates(...)(dataset)
Multi-modal curation
Image curation
from nemo_curator.image import ( AestheticFilter, NSFWFilter, CLIPEmbedder )
Aesthetic scoring
aesthetic_filter = AestheticFilter(threshold=5.0) filtered_images = aesthetic_filter(image_dataset)
NSFW detection
nsfw_filter = NSFWFilter(threshold=0.9) safe_images = nsfw_filter(filtered_images)
Generate CLIP embeddings
clip_embedder = CLIPEmbedder(model="openai/clip-vit-base-patch32") image_embeddings = clip_embedder(safe_images)
Video curation
from nemo_curator.video import ( SceneDetector, ClipExtractor, InternVideo2Embedder )
Detect scenes
scene_detector = SceneDetector(threshold=27.0) scenes = scene_detector(video_dataset)
Extract clips
clip_extractor = ClipExtractor(min_duration=2.0, max_duration=10.0) clips = clip_extractor(scenes)
Generate embeddings
video_embedder = InternVideo2Embedder() video_embeddings = video_embedder(clips)
Audio curation
from nemo_curator.audio import ( ASRInference, WERFilter, DurationFilter )
ASR transcription
asr = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc") transcribed = asr(audio_dataset)
Filter by WER (word error rate)
wer_filter = WERFilter(max_wer=0.3) high_quality_audio = wer_filter(transcribed)
Duration filtering
duration_filter = DurationFilter(min_duration=1.0, max_duration=30.0) filtered_audio = duration_filter(high_quality_audio)
Common patterns
Web scrape curation (Common Crawl)
from nemo_curator import ScoreFilter, Modify from nemo_curator.filters import * from nemo_curator.modules import * from nemo_curator.datasets import DocumentDataset
Load Common Crawl data
dataset = DocumentDataset.read_parquet("common_crawl/*.parquet")
Pipeline
pipeline = [ # 1. Quality filtering WordCountFilter(min_words=100, max_words=50000), RepeatedLinesFilter(max_repeated_line_fraction=0.2), SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3), UrlRatioFilter(max_url_ratio=0.3),
# 2. Language filtering
LanguageIdentificationFilter(target_languages=["en"]),
# 3. Deduplication
ExactDuplicates(id_field="id", text_field="text"),
FuzzyDuplicates(id_field="id", text_field="text", num_hashes=260),
# 4. PII redaction
PIIRedactor(),
# 5. NSFW filtering
NSFWClassifier(threshold=0.8)
]
Execute
for stage in pipeline: dataset = stage(dataset)
Save
dataset.to_parquet("curated_common_crawl/")
Distributed processing
from nemo_curator import get_client from dask_cuda import LocalCUDACluster
Multi-GPU cluster
cluster = LocalCUDACluster(n_workers=8) client = get_client(cluster=cluster)
Process large dataset
dataset = DocumentDataset.read_parquet("s3://large_dataset/*.parquet") deduped = FuzzyDuplicates(...)(dataset)
Cleanup
client.close() cluster.close()
Performance benchmarks
Fuzzy deduplication (8TB RedPajama v2)
-
CPU (256 cores): 120 hours
-
GPU (8× A100): 7.5 hours
-
Speedup: 16×
Exact deduplication (1TB)
-
CPU (64 cores): 8 hours
-
GPU (4× A100): 0.5 hours
-
Speedup: 16×
Quality filtering (100GB)
-
CPU (32 cores): 2 hours
-
GPU (2× A100): 0.2 hours
-
Speedup: 10×
Cost comparison
CPU-based curation (AWS c5.18xlarge × 10):
-
Cost: $3.60/hour × 10 = $36/hour
-
Time for 8TB: 120 hours
-
Total: $4,320
GPU-based curation (AWS p4d.24xlarge × 2):
-
Cost: $32.77/hour × 2 = $65.54/hour
-
Time for 8TB: 7.5 hours
-
Total: $491.55
Savings: 89% reduction ($3,828 saved)
Supported data formats
-
Input: Parquet, JSONL, CSV
-
Output: Parquet (recommended), JSONL
-
WebDataset: TAR archives for multi-modal
Use cases
Production deployments:
-
NVIDIA used NeMo Curator to prepare Nemotron-4 training data
-
Open-source datasets curated: RedPajama v2, The Pile
References
-
Filtering Guide - 30+ quality filters, heuristics
-
Deduplication Guide - Exact, fuzzy, semantic methods
Resources
-
GitHub: https://github.com/NVIDIA/NeMo-Curator ⭐ 500+
-
Docs: https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/
-
Version: 0.4.0+
-
License: Apache 2.0