Dedupe — Data Deduplication Reference
Quick-reference skill for deduplication strategies, algorithms, and data quality patterns.
When to Use
- Removing duplicate rows from datasets or databases
- Deduplicating files in storage systems
- Implementing fuzzy matching for near-duplicate detection
- Choosing between exact and probabilistic dedup methods
- Building ETL pipelines with deduplication stages
Commands
intro
scripts/script.sh intro
Overview of deduplication — types, strategies, and tradeoffs.
exact
scripts/script.sh exact
Exact deduplication — hash-based, key-based, and sorting approaches.
fuzzy
scripts/script.sh fuzzy
Fuzzy deduplication — similarity measures, blocking, and record linkage.
files
scripts/script.sh files
File-level deduplication — fdupes, jdupes, rdfind, and storage dedup.
algorithms
scripts/script.sh algorithms
Dedup algorithms — bloom filters, HyperLogLog, MinHash, SimHash.
sql
scripts/script.sh sql
SQL deduplication patterns — ROW_NUMBER, DISTINCT, GROUP BY strategies.
cli
scripts/script.sh cli
Command-line dedup tools — sort, uniq, awk, and stream processing.
checklist
scripts/script.sh checklist
Deduplication quality checklist and validation steps.
help
scripts/script.sh help
version
scripts/script.sh version
Configuration
| Variable | Description |
|---|---|
DEDUPE_DIR | Data directory (default: ~/.dedupe/) |
Powered by BytesAgain | bytesagain.com | hello@bytesagain.com