smart-code-search

Search code and docs by meaning, not keywords. Powered by ColGREP/NextPlaid,

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "smart-code-search" with this command: npx skills add brettmhammond/smart-code-search

Smart Code Search

Search code and docs by meaning, not just strings.

Powered by ColGREP and NextPlaid from LightOn — the engine behind the #1 ranked code retrieval model on MTEB and the #1 retriever on BrowseComp-Plus, OpenAI's hardest agentic search benchmark.

grep finds strings. This finds intent. Ask "payment capture logic" and get results from files that never contain those exact words — because it understands what your code does, not just what it says.

Why This Exists

Every developer has been here: you know what you're looking for but not where it lives. You chain 4 different grep -r attempts, guess filenames, scroll through directory trees. Coding agents are even worse — they grep, miss things, hallucinate file paths, waste tokens exploring blind.

ColGREP fixes this with multi-vector semantic search. It parses your code with Tree-sitter, embeds each function/method/class with token-level vectors, and ranks results by meaning. The model is 17M parameters, runs on CPU, and returns results in under a second.

The Numbers

MetricValue
MTEB Code Leaderboard#1 (LateOn-Code)
BrowseComp-Plus87.59% accuracy, beating all models up to 8B params (blog)
vs grep in coding agents70% win rate head-to-head
Model size17M params — 54× smaller than competing 8B models
Search latency200–900ms on CPU
API cost$0. Forever. Runs 100% local
PrivacyCode never leaves your machine

Install

brew install lightonai/tap/colgrep

Verify: colgrep --version

Quick Start

1. Index Your Project

cd /path/to/project
colgrep init

That's it. ColGREP parses every file with Tree-sitter, builds multi-vector embeddings on CPU, and stores the index in .colgrep/. Takes 30–60 seconds for ~1000 files. After this, the index auto-updates on every search — changed files are detected and re-indexed automatically.

2. Search

colgrep "natural language description of what you want"

Results are ranked by semantic relevance score. Higher = better match.

Examples:

colgrep "authentication middleware token validation"
colgrep "database migration rollback strategy"
colgrep "React form validation with error display"
colgrep "webhook retry logic with exponential backoff"

3. Combine Regex + Semantics

Filter files by regex pattern first, then rank semantically:

colgrep -e "async.*await" "error handling patterns"
colgrep -e "def test_" "payment capture edge cases"
colgrep -e "\.tsx$" "patient dashboard layout"

Search Options

colgrep "query"              # Default output: file:lines (score: X.XX)
colgrep "query" --json       # JSON output for piping to other tools
colgrep "query" -n 5         # Top 5 results only

When to Use This vs grep

You know...Use
The exact string or function namegrep -r "functionName"
The concept but not the wordscolgrep "what it does"
A pattern + a conceptcolgrep -e "pattern" "meaning"
Where something is implementedcolgrep "description of behavior"
How a feature works across filescolgrep "feature workflow"

Coding Agent Integration

ColGREP provides built-in integration with popular coding agents. After installing, restart your agent to enable semantic search:

  • Claude Code: colgrep --install-claude-code
  • OpenCode: colgrep --install-opencode
  • Codex: colgrep --install-codex

These commands register ColGREP as a search tool within the agent. The agent will automatically use semantic search when navigating indexed projects.

Multi-Project Setup

Index each project independently. Search from the project directory:

cd ~/code/api && colgrep init
cd ~/code/frontend && colgrep init
cd ~/code/infrastructure && colgrep init
cd ~/docs && colgrep init

# Search each independently
cd ~/code/api && colgrep "payment processing service"
cd ~/code/frontend && colgrep "checkout form validation"

Works great for monorepos, microservices, documentation vaults, and any directory with text/code files.

How It Works

ColGREP uses ColBERT late-interaction retrieval — a fundamentally different approach than traditional single-vector embeddings:

  1. Tree-sitter parses your code into structured units (functions, methods, classes, signatures)
  2. LateOn-Code-edge (17M params) creates multiple token-level embeddings per code unit — not one lossy summary vector
  3. NextPlaid stores these in a quantized, memory-mapped Rust index
  4. At search time, query tokens interact with document tokens for fine-grained relevance scoring

This is why a 17M model beats 8B models — late interaction preserves token-level semantics that single-vector approaches compress away. Read the full technical story: The Bloated Retriever Era Is Over

Interpreting Scores

  • 6.0+ — Near-exact conceptual match. The code does exactly what you described.
  • 5.0–6.0 — Strong semantic match. Highly relevant code.
  • 4.0–5.0 — Good match. Related code worth reviewing.
  • 3.0–4.0 — Weak match. May or may not be relevant.
  • Below 3.0 — Likely noise. Ignore these results.

Troubleshooting

"Index is being updated by another process" — Another colgrep instance is updating. Current search uses existing index. Safe to ignore.

Re-index from scratch:

rm -rf .colgrep/ && colgrep init

Add to .gitignore:

.colgrep/

Links

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

run.dev — Local Dev Environment Manager

Local dev environment manager. Process management, automatic HTTPS domains, SSL certificates, reverse proxy, and AI crash diagnosis — single binary, zero con...

Registry SourceRecently Updated
Coding

ifly-image-understanding

iFlytek Image Understanding (图片理解) — analyze and answer questions about images using Spark Vision model. WebSocket API, pure Python stdlib, no pip dependencies.

Registry SourceRecently Updated
Coding

Civic Google

Use gog (Google CLI) without manual OAuth setup — Civic handles token management automatically

Registry SourceRecently Updated
2000Profile unavailable
Coding

Agent Browser.Skip

A fast Rust-based headless browser automation CLI with Node.js fallback that enables AI agents to navigate, click, type, and snapshot pages via structured co...

Registry SourceRecently Updated
2000Profile unavailable