code-clone-assistant

Detect code clones and guide refactoring using PMD CPD (exact duplicates) + Semgrep (patterns).

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "code-clone-assistant" with this command: npx skills add terrylica/cc-skills/terrylica-cc-skills-code-clone-assistant

Code Clone Assistant

Detect code clones and guide refactoring using PMD CPD (exact duplicates) + Semgrep (patterns).

Tools

  • PMD CPD v7.17.0+: Exact duplicate detection

  • Semgrep v1.140.0+: Pattern-based detection

Tested: October 2025 - 30 violations detected across 3 sample files Coverage: ~3x more violations than using either tool alone

When to Use This Skill

Use this skill when:

  • Finding duplicate code in a codebase

  • Detecting DRY violations

  • Refactoring similar code patterns

  • Identifying copy-paste code

Why Two Tools?

PMD CPD and Semgrep detect different clone types:

Aspect PMD CPD Semgrep

Detects Exact copy-paste duplicates Similar patterns with variations

Scope Across files ✅ Within/across files (Pro only)

Matching Token-based (ignores formatting) Pattern-based (AST matching)

Rules ❌ No custom rules ✅ Custom rules

Result: Using both finds ~3x more DRY violations.

Clone Types

Type Description PMD CPD Semgrep

Type-1 Exact copies ✅ Default ✅

Type-2 Renamed identifiers ✅ --ignore-*

Type-3 Near-miss with variations ⚠️ Partial ✅ Patterns

Type-4 Semantic clones (same behavior) ❌ ❌

Quick Start Workflow

Step 1: Detect exact duplicates (PMD CPD)

pmd cpd -d . -l python --minimum-tokens 20 -f markdown > pmd-results.md

Step 2: Detect pattern violations (Semgrep)

semgrep --config=clone-rules.yaml --sarif --quiet > semgrep-results.sarif

Step 3: Analyze combined results (Claude Code)

Parse both outputs, prioritize by severity

Step 4: Refactor (Claude Code with user approval)

Extract shared functions, consolidate patterns, verify tests

Accepted Exceptions (Known Intentional Duplication)

Not all code duplication is a problem. Some codebases deliberately use copy-and-adapt patterns where refactoring would be harmful. When running clone detection, always check for accepted exceptions before recommending refactoring.

When Duplication Is Acceptable

Pattern Why Acceptable Example

Generation-per-directory experiments Each generation is an immutable, self-contained experiment. Sharing code across generations would break provenance and make past experiments non-reproducible. SQL templates, sweep scripts where each gen{NNN}/ is independent

SQL templates with placeholder substitution SQL has no import/include mechanism. Templates use sed placeholder replacement (PLACEHOLDER ), not function calls. Extracting shared CTEs into separate files would break the single-file execution model. ClickHouse sweep templates sharing signal detection + metrics CTEs

Protocol/schema boilerplate Serialization formats, API contracts, and wire protocols require exact structure in each location. Abstracting them hides the contract. NDJSON telemetry line construction in wrapper scripts

Test fixtures and golden files Test data intentionally duplicates production patterns to verify behavior. Sharing fixtures creates brittle cross-test dependencies. Test setup code, expected output snapshots

How to Report Accepted Exceptions

When clone detection finds duplication that matches an accepted exception pattern:

  • Report it — always show the user what was found (lines, tokens, files)

  • Flag as accepted — explicitly state it matches a known exception pattern

  • Explain why — cite the specific reason refactoring is not recommended

  • Do NOT recommend refactoring — this is the key difference from actionable findings

Example output format:

Code Clone Analysis Results

PMD CPD Findings: Clone 1: 115 lines (575 tokens) — base_bars → signals CTEs gen610_template.sql:33 ↔ gen710_template.sql:38 Status: ACCEPTED EXCEPTION (generation-per-directory experiment) Reason: Each generation is immutable. Shared CTEs would break experiment provenance and reproducibility.

Clone 2: 36 lines (478 tokens) — metrics aggregation gen610_template.sql:207 ↔ gen710_template.sql:244 Status: ACCEPTED EXCEPTION (SQL template without include mechanism)

Actionable Findings: 0 Accepted Exceptions: 2

Project-Level Exception Configuration

Projects can declare accepted exception patterns in their CLAUDE.md :

Code Clone Exceptions

  • sql/gen*_template.sql — generation-per-directory experiments (immutable)
  • scripts/gen*/ — copy-and-adapt sweep scripts (no shared infrastructure)
  • tests/fixtures/ — intentional duplication for test isolation

When this section exists in a project's CLAUDE.md , the code-clone-assistant should check it before classifying findings.

Reference Documentation

For detailed information, see:

  • Detection Commands - PMD CPD and Semgrep command details

  • Complete Workflow - Detection, analysis, and presentation phases

  • Refactoring Strategies - Approaches for addressing violations

Troubleshooting

Issue Cause Solution

PMD CPD not found Not installed or not in PATH brew install pmd or download from PMD releases

Semgrep timeout Large codebase scan Use --exclude to limit scope

No duplicates detected minimum-tokens too high Lower --minimum-tokens value (try 15)

Too many false positives minimum-tokens too low Increase --minimum-tokens (try 30+)

Language not recognized Wrong -l flag Check PMD CPD supported languages list

SARIF parse error Semgrep output malformed Upgrade Semgrep to latest version

Memory error on large repo Java heap too small Set PMD_JAVA_OPTS=-Xmx4g

Missing clone rules file Custom rules not created Create clone-rules.yaml or use default config

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

python-logging-best-practices

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

clickhouse-architect

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

mlflow-python

No summary provided by upstream source.

Repository SourceNeeds Review