deduplication

Canonical selection with reputation scoring and hash-based grouping for multi-source data.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "deduplication" with this command: npx skills add dadbodgeoff/drift/dadbodgeoff-drift-deduplication

Event Deduplication

Canonical selection with reputation scoring and hash-based grouping for multi-source data.

When to Use This Skill

  • Aggregating data from multiple sources (news, events, products)

  • Same content appears from different outlets/sources

  • Need to pick the "best" version from duplicates

  • Tracking deduplication metrics for optimization

Core Concepts

Simple URL deduplication isn't enough. Production needs:

  • Grouping by semantic similarity (same story, different outlets)

  • Canonical selection (pick the "best" version)

  • Reputation scoring (prefer authoritative sources)

  • Both ID-based and content-based deduplication

Two modes:

  • ID-based: When sources have unique IDs, keep the "best" version when IDs collide

  • Content-based: Group by semantic similarity, select canonical from each group

Implementation

TypeScript

import { createHash } from 'crypto';

interface DeduplicationResult<T> { items: T[]; originalCount: number; dedupedCount: number; reductionPercent: number; duplicateGroups?: number; }

// ============================================ // ID-Based Deduplication // ============================================

function deduplicateById<T extends { id: string }>( items: T[], preferFn: (existing: T, candidate: T) => T ): DeduplicationResult<T> { const seen = new Map<string, T>();

for (const item of items) { const existing = seen.get(item.id); if (existing) { seen.set(item.id, preferFn(existing, item)); } else { seen.set(item.id, item); } }

const dedupedItems = Array.from(seen.values()); const reductionPercent = items.length > 0 ? Math.round((1 - dedupedItems.length / items.length) * 100) : 0;

return { items: dedupedItems, originalCount: items.length, dedupedCount: dedupedItems.length, reductionPercent, }; }

// ============================================ // Content-Based Deduplication // ============================================

interface Article { title: string; url: string; domain: string; publishedAt: string; tone?: number; }

/**

  • Generate deduplication key from content
  • Groups by: normalized title + source country + date */ function generateDedupKey(article: Article): string { const normalizedTitle = article.title .toLowerCase() .replace(/[^\w\s]/g, '') .trim() .slice(0, 50);

const dateStr = article.publishedAt?.slice(0, 10).replace(/-/g, '') || 'unknown';

return ${normalizedTitle}|${dateStr}; }

/**

  • Generate unique ID from URL */ function generateEventId(url: string): string { return createHash('md5').update(url).digest('hex').slice(0, 12); }

/**

  • Source reputation scoring */ function getReputationScore(domain: string): number { // Tier 1: Wire services and major international const tier1 = ['reuters.com', 'apnews.com', 'bbc.com', 'bbc.co.uk', 'aljazeera.com', 'france24.com', 'dw.com']; if (tier1.some(r => domain.includes(r))) return 100;

// Tier 2: Major newspapers const tier2 = ['nytimes.com', 'washingtonpost.com', 'theguardian.com', 'ft.com', 'economist.com', 'wsj.com']; if (tier2.some(r => domain.includes(r))) return 75;

// Tier 3: Regional/national const tier3 = ['cnn.com', 'foxnews.com', 'nbcnews.com', 'abcnews.go.com']; if (tier3.some(r => domain.includes(r))) return 50;

return 10; }

/**

  • Select canonical article from duplicate group */ function selectCanonical<T extends Article>( group: { item: T; source: string }[] ): { item: T; source: string } { return group.reduce((best, current) => { const bestScore = getReputationScore(best.item.domain) + Math.abs(best.item.tone || 0); const currentScore = getReputationScore(current.item.domain) + Math.abs(current.item.tone || 0);

    return currentScore > bestScore ? current : best; }); }

/**

  • Deduplicate articles from multiple sources */ function deduplicateArticles<T extends Article>( sourceResults: { sourceName: string; articles: T[] }[] ): DeduplicationResult<T & { source: string }> { const groups = new Map<string, { item: T; source: string }[]>(); let totalArticles = 0;

// Group articles by dedup key for (const { sourceName, articles } of sourceResults) { for (const article of articles) { totalArticles++; const key = generateDedupKey(article);

  if (!groups.has(key)) {
    groups.set(key, []);
  }
  groups.get(key)!.push({ item: article, source: sourceName });
}

}

// Select canonical article from each group const items: (T & { source: string })[] = [];

for (const group of groups.values()) { const canonical = selectCanonical(group); items.push({ ...canonical.item, source: canonical.source }); }

const reductionPercent = totalArticles > 0 ? Math.round((1 - items.length / totalArticles) * 100) : 0;

console.log([Dedup] ${totalArticles} → ${items.length} (${reductionPercent}% reduction));

return { items, originalCount: totalArticles, dedupedCount: items.length, reductionPercent, duplicateGroups: groups.size, }; }

Usage Examples

ID-Based Deduplication

const events = await fetchEvents();

const result = deduplicateById(events, (existing, candidate) => { // Prefer events with coordinates if (!existing.lat && candidate.lat) return candidate; // Prefer higher sentiment magnitude if (Math.abs(candidate.sentiment) > Math.abs(existing.sentiment)) { return candidate; } return existing; });

console.log(Reduced ${result.reductionPercent}% duplicates);

Multi-Source Aggregation

const results = await Promise.all([ fetchFromSourceA(), fetchFromSourceB(), fetchFromSourceC(), ]);

const { items, reductionPercent } = deduplicateArticles([ { sourceName: 'source-a', articles: results[0] }, { sourceName: 'source-b', articles: results[1] }, { sourceName: 'source-c', articles: results[2] }, ]);

// items now contains canonical articles with source attribution

Best Practices

  • Semantic grouping - Group by normalized content, not just URL

  • Reputation scoring - Prefer authoritative sources as canonical

  • Best version selection - When IDs collide, keep version with most data

  • Reduction tracking - Log how much deduplication helped

  • Source attribution - Track which source the canonical came from

Common Mistakes

  • Simple URL deduplication (misses same story from different outlets)

  • Random selection from duplicates (lose quality signal)

  • No normalization (case/punctuation differences create false negatives)

  • Not tracking reduction metrics (can't optimize)

  • Hardcoded source lists (make configurable)

Related Patterns

  • batch-processing - Process deduplicated items efficiently

  • validation-quarantine - Validate before deduplication

  • checkpoint-resume - Track which files have been deduplicated

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

oauth-social-login

No summary provided by upstream source.

Repository SourceNeeds Review
General

sse-streaming

No summary provided by upstream source.

Repository SourceNeeds Review
General

multi-tenancy

No summary provided by upstream source.

Repository SourceNeeds Review