Semtools: Semantic Search

Perform semantic (meaning-based) search across code and documents using embedding-based similarity matching.

Purpose

The semtools skill provides access to Semtools, a high-performance Rust-based CLI for semantic search and document processing. Unlike traditional text search (ripgrep) which matches exact strings, or structural search (ast-grep) which matches syntax patterns, semtools understands semantic meaning through embeddings.

Key capabilities:

Semantic Search: Find code/text by meaning, not just keywords
Workspace Management: Index large codebases for fast repeated searches
Document Parsing: Convert PDFs, DOCX, PPTX to searchable text (requires API key)

Semtools excels at discovery - finding relevant code when you don't know the exact keywords, function names, or syntax patterns.

When to Use This Skill

Use the semtools skill when you need meaning-based search:

Semantic Code Discovery:

Finding code that implements a concept ("error handling", "data validation")
Discovering similar functionality across different modules
Locating examples of a pattern when you don't know exact names
Understanding what code does without reading everything

Documentation & Knowledge:

Searching documentation by concept, not keywords
Finding related discussions in comments or docs
Discovering similar issues or solutions
Analyzing technical documents (PDFs, reports)

Use Cases:

"Find all authentication-related code" (without knowing function names)
"Show me error handling patterns" (regardless of specific error types)
"Find code similar to this implementation" (semantic similarity)
"Search research papers for 'distributed consensus'" (document search)

Choose semtools over file-search (ripgrep/ast-grep) when:

You know the concept but not the keywords
Exact string matching misses relevant results
You want semantically similar code, not exact matches
Searching across languages or mixed content

Still use file-search when:

You know exact keywords, function names, or patterns
You need structural code matching (ast-grep)
Speed is critical (ripgrep is faster for exact matches)
You're searching for specific symbols or references

Available Commands

Semtools provides three CLI commands you can use via execute_command :

search
Semantic search across code and text files
workspace
Manage workspaces for caching embeddings
parse
Convert documents (PDF, DOCX, PPTX) to searchable text

All commands work out-of-the-box in your execution environment. Document parsing requires the LLAMA_CLOUD_API_KEY environment variable to be set.

Core Operations

Semantic Search (search )

Find files and code sections by semantic meaning:

Basic semantic search

search "authentication logic" src/

Search with more context (5 lines before/after)

search "error handling" --n-lines 5 src/

Get more results (default: 3)

search "database queries" --top-k 10 src/

Control similarity threshold (0.0-1.0, lower = more lenient)

search "API endpoints" --max-distance 0.4 src/

Parameters:

--n-lines N : Show N lines of context around matches (default: 3)
--top-k K : Return top K most similar matches (default: 3)
--max-distance D : Maximum embedding distance (0.0-1.0, default: 0.3)
-i : Case-insensitive matching

Output format:

Match 1 (similarity: 0.12) File: src/auth/handlers.py Lines: 42-47

def authenticate_user(username: str, password: str) -> Optional[User]: """Authenticate user credentials against database.""" user = get_user_by_username(username) if user and verify_password(password, user.password_hash): return user return None

Match 2 (similarity: 0.18) File: src/middleware/auth.py ...

Workspace Management (workspace )

For large codebases, create workspaces to cache embeddings and enable fast repeated searches:

Create/activate workspace

workspace use my-project

Set workspace via environment variable

export SEMTOOLS_WORKSPACE=my-project

Index files in workspace (workspace auto-detected from env var)

search "query" src/

Check workspace status

workspace status

Clean up old workspaces

workspace prune

Benefits:

Fast repeated searches: Embeddings cached, no re-computation
Large codebases: IVF_PQ indexing for scalability
Session persistence: Maintain context across multiple searches

When to use workspaces:

Searching the same codebase multiple times
Very large projects (1000+ files)
Interactive exploration sessions
CI/CD pipelines with repeated searches

Document Parsing (parse ) ⚠️ Requires API Key

Convert documents to searchable markdown (requires LlamaParse API key):

Parse PDFs to markdown

parse research_papers/*.pdf

Parse Word documents

parse reports/*.docx

Parse presentations

parse slides/*.pptx

Parse and pipe to search

parse docs/*.pdf | xargs search "neural networks"

Supported formats:

PDF (.pdf)
Word (.docx)
PowerPoint (.pptx)

Configuration:

Via environment variable

export LLAMA_CLOUD_API_KEY="llx-..."

Via config file

cat > ~/.parse_config.json << EOF { "api_key": "llx-...", "max_concurrent_requests": 10, "timeout_seconds": 3600 } EOF

Important: Document parsing is optional. Semantic search works without it.

Workflow Patterns

Pattern 1: Concept Discovery

When you know what you're looking for conceptually but not by name:

Step 1: Broad semantic search

search "rate limiting implementation" src/

Step 2: Review results, refine query

search "throttle requests per user" src/ --top-k 10

Step 3: Use ripgrep for exact follow-up

rg "RateLimiter" --type py src/

Pattern 2: Similar Code Finder

When you want to find code similar to a reference implementation:

Step 1: Extract key concepts from reference code

[Read example_auth.py and identify key concepts]

Step 2: Search for similar implementations

search "user authentication with JWT tokens" src/

Step 3: Compare implementations

[Review semantic matches to find similar approaches]

Pattern 3: Documentation Search

When researching concepts in documentation or comments:

Search code comments semantically

search "thread safety guarantees" src/ --n-lines 10

Search markdown documentation

search "deployment best practices" docs/

Combined search

search "performance optimization" --top-k 20

Pattern 4: Cross-Language Search

When searching for concepts across different languages:

Semantic search works across languages

search "connection pooling" src/

May find:

- Java: "ConnectionPool manager"

- Python: "database connection reuse"

- Go: "pool of persistent connections"

All semantically related despite different terminology

Pattern 5: Document Analysis (with API key)

When analyzing PDFs or documents:

Step 1: Parse documents to markdown

parse research/*.pdf > papers.md

Step 2: Search converted content

search "transformer architecture" papers.md

Step 3: Combine with code search

search "attention mechanism implementation" src/

Integration with file-search

Semtools and file-search (ripgrep/ast-grep) are complementary tools. Use them together for comprehensive search:

Search Strategy Matrix

You Know Use First Then Use Why

Exact keywords ripgrep search Fast exact match, then find similar

Concept only search ripgrep Find relevant code, then search specifics

Function name ripgrep search Find definition, then find similar usage

Code pattern ast-grep search Find structure, then find similar logic

Approximate idea search ripgrep + ast-grep Discover, then drill down

Layered Search Approach

Layer 1: Semantic discovery (what's related?)

search "user session management" --top-k 10

Layer 2: Exact text search (what's the implementation?)

rg "SessionManager|session_store" --type py

Layer 3: Structural search (how is it used?)

sg --pattern 'session.$METHOD($$$)' --lang python

Layer 4: Reference tracking (where is it called?)

[Use serena skill for symbol-level tracking]

Best Practices

Start Broad, Then Narrow

Use semantic search for discovery, then narrow with exact search:

GOOD: Broad semantic discovery first

search "authentication" src/ --top-k 10

[Review results to learn terminology]

rg "authenticate|verify_credentials" --type py src/

AVOID: Starting too narrow and missing variations

rg "authenticate" --type py # Misses "verify_credentials", "check_auth", etc.

Adjust Similarity Threshold

Tune --max-distance based on results:

Too many irrelevant results? Decrease distance (more strict)

search "query" --max-distance 0.2

Missing relevant results? Increase distance (more lenient)

search "query" --max-distance 0.5

Default (0.3) works well for most cases

search "query"

Use Workspaces for Repeated Searches

For interactive exploration, always use workspaces:

GOOD: Create workspace once, search many times

export SEMTOOLS_WORKSPACE=my-analysis search "concept1" src/ search "concept2" src/ search "concept3" src/

INEFFICIENT: Re-compute embeddings every time

search "concept1" src/ search "concept2" src/

Combine with Context Tools

Get more context around semantic matches:

Find semantically similar code

search "retry logic" src/ --n-lines 2

Get more context with ripgrep

rg -C 10 "retry" src/specific_file.py

Or read the full file

cat src/specific_file.py

Phrase Queries Conceptually

Write queries as concepts, not exact keywords:

GOOD: Conceptual queries

search "handling network timeouts" search "user input validation" search "concurrent data access"

LESS EFFECTIVE: Exact keyword queries (use ripgrep instead)

search "timeout" # Use: rg "timeout" search "validate" # Use: rg "validate"

Understanding Semantic Distance

Semtools uses embedding vectors to measure semantic similarity:

Distance 0.0: Identical meaning
Distance 0.1-0.2: Very similar (synonyms, paraphrases)
Distance 0.2-0.3: Related concepts (default threshold)
Distance 0.3-0.4: Loosely related
Distance 0.5+: Weakly related or unrelated

Practical guidelines:

Strict matching (only close matches)

--max-distance 0.2

Balanced matching (default, recommended)

--max-distance 0.3

Lenient matching (exploratory search)

--max-distance 0.4

Very lenient (may include false positives)

--max-distance 0.5

Local vs. Cloud Embeddings

Semantic Search (Local):

Uses local embeddings (model2vec, potion-multilingual-128M)
No API calls or cloud dependencies
Fast, private, no cost
Works offline

Document Parsing (Cloud):

Uses LlamaParse API (cloud-based)
Requires API key and internet connection
Processes PDFs, DOCX, PPTX
Usage-based pricing (check LlamaIndex pricing)

Privacy consideration: Semantic search is 100% local. Only document parsing sends data to LlamaParse API.

Performance Considerations

Speed Characteristics

Without workspace:

First search: ~2-5 seconds (embedding computation)
Subsequent searches: ~2-5 seconds each (re-compute embeddings)

With workspace (cached embeddings):

First search: ~2-5 seconds (builds index)
Subsequent searches: ~0.1-0.5 seconds (cached)
Large codebases: IVF_PQ indexing for scalability

Comparison:

ripgrep: 0.01-0.1 seconds (fastest, exact match)
ast-grep: 0.1-0.5 seconds (fast, structural)
semtools (cached): 0.1-0.5 seconds (fast, semantic)
semtools (uncached): 2-5 seconds (slower, semantic)

Optimization Tips

1. Use workspaces for repeated searches

export SEMTOOLS_WORKSPACE=my-project

2. Limit search scope to relevant directories

search "query" src/ --not tests/

3. Use --top-k to control result count

search "query" --top-k 5

4. Pipe to head for quick preview

search "query" | head -50

Unix Pipeline Integration

Semtools is designed for Unix-style composition:

Find and parse PDFs, then search

find docs/ -name "*.pdf" | xargs parse | xargs search "topic"

Search and filter with grep

search "authentication" src/ | grep -i "jwt"

Count matches

search "error handling" src/ | grep "Match" | wc -l

Combine with other tools

search "API" src/ | xargs -I {} rg -l "REST" {}

Limitations

When NOT to Use Semtools

Exact keyword search: Use ripgrep for known keywords

WRONG TOOL: Semantic search for exact function name

search "authenticate_user"

RIGHT TOOL: Use ripgrep for exact matches

rg "authenticate_user" --type py

Structural code patterns: Use ast-grep for syntax matching

WRONG TOOL: Semantic search for code structure

search "class with constructor"

RIGHT TOOL: Use ast-grep for structure

sg --pattern 'class $NAME { constructor($$$) { $$$ } }'

Symbol references: Use serena for LSP-based tracking

WRONG TOOL: Semantic search for all usages

search "MyClass usage"

RIGHT TOOL: Use serena for precise references

serena find_referencing_symbols --name 'MyClass'

Small codebases: Overhead not worth it for <100 files

ripgrep is faster and simpler for small projects

Known Edge Cases

Ambiguous queries: Vague concepts return broad results
Technical jargon: Domain-specific terms may have lower accuracy
Short code snippets: Limited context reduces embedding quality
Mixed languages: Embeddings tuned for English (multilingual model used)
Generated code: Repetitive patterns may cluster together

Troubleshooting

No Semantic Matches Found

If semantic search returns zero results:

Verify files exist: Use ripgrep to confirm content

rg "concept" src/

Increase similarity threshold: Be more lenient

search "query" --max-distance 0.5

Rephrase query: Try different terminology

search "user authentication" search "verify user credentials" search "login validation"

Check file types: Ensure searching correct extensions

search "query" src/*.py # Target specific types

Too Many Irrelevant Results

If semantic search returns too much noise:

Decrease similarity threshold: Be more strict

search "query" --max-distance 0.2

Limit result count: Review top matches only

search "query" --top-k 3

Narrow directory scope: Search specific paths

search "query" src/specific_module/

Refine query: Add more specific concepts

Vague

search "data"

Specific

search "data validation with regex patterns"

Document Parsing Fails

If parse fails:

Verify API key is set:

echo $LLAMA_CLOUD_API_KEY

Check file format: Ensure supported format (PDF, DOCX, PPTX)

file document.pdf # Verify file type

Check file size: Large files may timeout

du -h document.pdf # Check size

Review parse config: Adjust timeouts if needed

cat ~/.parse_config.json

Workspace Issues

If workspace commands fail:

Check workspace status

workspace status

Prune corrupted workspaces

workspace prune

Recreate workspace

rm -rf ~/.semtools/workspaces/my-workspace export SEMTOOLS_WORKSPACE=my-workspace

Resources

Semtools GitHub
LlamaParse Documentation
model2vec Embeddings
Semantic Search Concepts

semtools

Safety Notice

Copy this and send it to your AI assistant to learn

Basic semantic search

Search with more context (5 lines before/after)

Get more results (default: 3)

Control similarity threshold (0.0-1.0, lower = more lenient)

Match 1 (similarity: 0.12) File: src/auth/handlers.py Lines: 42-47

def authenticate_user(username: str, password: str) -> Optional[User]: """Authenticate user credentials against database.""" user = get_user_by_username(username) if user and verify_password(password, user.password_hash): return user return None

Create/activate workspace

Set workspace via environment variable

Index files in workspace (workspace auto-detected from env var)

Check workspace status

Clean up old workspaces

Parse PDFs to markdown

Parse Word documents

Parse presentations

Parse and pipe to search

Via environment variable

Via config file

Step 1: Broad semantic search

Step 2: Review results, refine query

Step 3: Use ripgrep for exact follow-up

Step 1: Extract key concepts from reference code

[Read example_auth.py and identify key concepts]

Step 2: Search for similar implementations

Step 3: Compare implementations

[Review semantic matches to find similar approaches]

Search code comments semantically

Search markdown documentation

Combined search

Semantic search works across languages

May find:

- Java: "ConnectionPool manager"

- Python: "database connection reuse"

- Go: "pool of persistent connections"

All semantically related despite different terminology

Step 1: Parse documents to markdown

Step 2: Search converted content

Step 3: Combine with code search

Layer 1: Semantic discovery (what's related?)

Layer 2: Exact text search (what's the implementation?)

Layer 3: Structural search (how is it used?)

Layer 4: Reference tracking (where is it called?)

[Use serena skill for symbol-level tracking]

GOOD: Broad semantic discovery first

[Review results to learn terminology]

AVOID: Starting too narrow and missing variations

Too many irrelevant results? Decrease distance (more strict)

Missing relevant results? Increase distance (more lenient)

Default (0.3) works well for most cases

GOOD: Create workspace once, search many times

INEFFICIENT: Re-compute embeddings every time

Find semantically similar code

Get more context with ripgrep

Or read the full file

GOOD: Conceptual queries

LESS EFFECTIVE: Exact keyword queries (use ripgrep instead)

Strict matching (only close matches)

Balanced matching (default, recommended)

Lenient matching (exploratory search)

Very lenient (may include false positives)

1. Use workspaces for repeated searches

2. Limit search scope to relevant directories

3. Use --top-k to control result count

4. Pipe to head for quick preview

Find and parse PDFs, then search

Search and filter with grep

Count matches

Combine with other tools

WRONG TOOL: Semantic search for exact function name

RIGHT TOOL: Use ripgrep for exact matches

WRONG TOOL: Semantic search for code structure

RIGHT TOOL: Use ast-grep for structure

WRONG TOOL: Semantic search for all usages

RIGHT TOOL: Use serena for precise references

Vague

Specific

Check workspace status

Prune corrupted workspaces