dedupe

Deduplication reference — exact matching, fuzzy matching, hash-based dedup, bloom filters, and data quality. Use when removing duplicate records, files, or data entries.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "dedupe" with this command: npx skills add xueyetianya/dedupe

Dedupe — Data Deduplication Reference

Quick-reference skill for deduplication strategies, algorithms, and data quality patterns.

When to Use

Removing duplicate rows from datasets or databases
Deduplicating files in storage systems
Implementing fuzzy matching for near-duplicate detection
Choosing between exact and probabilistic dedup methods
Building ETL pipelines with deduplication stages

Commands

`intro`

scripts/script.sh intro

Overview of deduplication — types, strategies, and tradeoffs.

`exact`

scripts/script.sh exact

Exact deduplication — hash-based, key-based, and sorting approaches.

`fuzzy`

scripts/script.sh fuzzy

Fuzzy deduplication — similarity measures, blocking, and record linkage.

`files`

scripts/script.sh files

File-level deduplication — fdupes, jdupes, rdfind, and storage dedup.

`algorithms`

scripts/script.sh algorithms

Dedup algorithms — bloom filters, HyperLogLog, MinHash, SimHash.

`sql`

scripts/script.sh sql

SQL deduplication patterns — ROW_NUMBER, DISTINCT, GROUP BY strategies.

`cli`

scripts/script.sh cli

Command-line dedup tools — sort, uniq, awk, and stream processing.

`checklist`

scripts/script.sh checklist

Deduplication quality checklist and validation steps.

`help`

scripts/script.sh help

`version`

scripts/script.sh version

Configuration

Variable	Description
`DEDUPE_DIR`	Data directory (default: ~/.dedupe/)

Powered by BytesAgain | bytesagain.com | hello@bytesagain.com

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Open Registry Record Open in ClawHub

Related Skills

Related by shared tags or category signals.

General

Pugongying Data Skill

蒲公英数据开发工程师Skill套件 - 专为数据开发工程师设计的完整AI Skill生态系统。包含7个核心模块：需求分析、架构设计、数据建模、SQL开发、ETL Pipeline、数据质量、数据测试。当用户需要端到端数据开发解决方案、数据仓库建设、ETL开发、SQL优化、数据质量管理时触发。触发词：数据开发...

Registry SourceRecently Updated

2783Profile unavailable

Automation

Context Verifier

Know the file you're editing is the file you think it is — verify integrity before you act

Registry SourceRecently Updated

8550Profile unavailable

Security

Encoding Converter

Convert and verify data between Base64, URL encoding, HEX, MD5/SHA hashes, JWT payloads, HTML entities, and binary/octal/decimal/hex formats.

Registry SourceRecently Updated

370Profile unavailable

Automation

JEP Primitive Skills

JEP Primitive Skills — Atomic Reference Implementations of Judge, Delegate, Terminate, Verify for Agent Collaboration Grammar

Registry SourceRecently Updated

420Profile unavailable