dedupe

Deduplication reference — exact matching, fuzzy matching, hash-based dedup, bloom filters, and data quality. Use when removing duplicate records, files, or data entries.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "dedupe" with this command: npx skills add xueyetianya/dedupe

Dedupe — Data Deduplication Reference

Quick-reference skill for deduplication strategies, algorithms, and data quality patterns.

When to Use

  • Removing duplicate rows from datasets or databases
  • Deduplicating files in storage systems
  • Implementing fuzzy matching for near-duplicate detection
  • Choosing between exact and probabilistic dedup methods
  • Building ETL pipelines with deduplication stages

Commands

intro

scripts/script.sh intro

Overview of deduplication — types, strategies, and tradeoffs.

exact

scripts/script.sh exact

Exact deduplication — hash-based, key-based, and sorting approaches.

fuzzy

scripts/script.sh fuzzy

Fuzzy deduplication — similarity measures, blocking, and record linkage.

files

scripts/script.sh files

File-level deduplication — fdupes, jdupes, rdfind, and storage dedup.

algorithms

scripts/script.sh algorithms

Dedup algorithms — bloom filters, HyperLogLog, MinHash, SimHash.

sql

scripts/script.sh sql

SQL deduplication patterns — ROW_NUMBER, DISTINCT, GROUP BY strategies.

cli

scripts/script.sh cli

Command-line dedup tools — sort, uniq, awk, and stream processing.

checklist

scripts/script.sh checklist

Deduplication quality checklist and validation steps.

help

scripts/script.sh help

version

scripts/script.sh version

Configuration

VariableDescription
DEDUPE_DIRData directory (default: ~/.dedupe/)

Powered by BytesAgain | bytesagain.com | hello@bytesagain.com

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Pugongying Data Skill

蒲公英数据开发工程师Skill套件 - 专为数据开发工程师设计的完整AI Skill生态系统。 包含7个核心模块:需求分析、架构设计、数据建模、SQL开发、ETL Pipeline、数据质量、数据测试。 当用户需要端到端数据开发解决方案、数据仓库建设、ETL开发、SQL优化、数据质量管理时触发。 触发词:数据开发...

Registry SourceRecently Updated
2783Profile unavailable
Automation

Context Verifier

Know the file you're editing is the file you think it is — verify integrity before you act

Registry SourceRecently Updated
8550Profile unavailable
Security

Encoding Converter

Convert and verify data between Base64, URL encoding, HEX, MD5/SHA hashes, JWT payloads, HTML entities, and binary/octal/decimal/hex formats.

Registry SourceRecently Updated
370Profile unavailable
Automation

JEP Primitive Skills

JEP Primitive Skills — Atomic Reference Implementations of Judge, Delegate, Terminate, Verify for Agent Collaboration Grammar

Registry SourceRecently Updated
420Profile unavailable