ChromaDB Integration Skills
Purpose: This skill teaches agents how to integrate ChromaDB for semantic search, persistent storage, and pattern matching across ANY domain - research, code, trading, legal, documentation, and more.
Critical Use Case: When agents need to work with large datasets (1000+ items), perform semantic search, maintain persistent knowledge, or learn from historical patterns, ChromaDB eliminates token limits and enables powerful vector-based retrieval.
Used By: All agent types - researchers, developers, traders, legal analysts, documentation writers, QA testers, etc.
When to Use ChromaDB Integration
Use ChromaDB when:
-
Large Datasets: Working with 1000+ items (documents, code files, bugs, trades, contracts, etc.)
-
Semantic Search: Finding items by meaning, not just keywords
-
Persistent Memory: Knowledge needs to survive across sessions, days, months
-
Pattern Matching: Identifying similar historical cases/patterns for decision-making
-
Cross-Session Learning: Building institutional knowledge over time
-
Token Limits: Data too large to fit in context window (100K+ tokens)
-
Aggregation: Combining results from multiple queries/sources
Core ChromaDB Concepts
Collections
Definition: Named vector databases storing documents with embeddings and metadata
Naming Strategy:
-
Domain-based: {domain}{purpose}{identifier}
-
Examples:
-
Research: research_prior_art_blockchain_2024 , research_literature_ml_transformers
-
Code: codebase_api_endpoints , codebase_bug_patterns_auth
-
Trading: backtest_results_sma_strategy , market_conditions_spy_2024
-
Legal: case_law_patent_eligibility , contracts_saas_clauses
-
Documentation: api_docs_v2 , architecture_decisions_2024
Documents
Definition: Text content to be searched semantically
Best Practices:
-
Chunk Size: 200-500 words optimal (too small = context loss, too large = poor granularity)
-
Content Format: Title + summary + key details (e.g., "Patent US10123456 - Blockchain Authentication. Abstract: A method for..." ))
-
Deduplication: Use unique IDs to prevent duplicate storage
Metadata
Definition: Structured data for filtering, not semantic search
Strategy:
{ // Temporal filters "date": "2024-11-14", "year": 2024, "month": 11,
// Categorical filters "type": "bug_report", "category": "authentication", "severity": "high",
// Numeric filters "citations": 42, "price": 150.25, "performance_score": 0.87,
// Source tracking "source": "github_issue", "author": "kim-asplund", "url": "https://..." }
Embeddings
Definition: Vector representations enabling semantic similarity
How It Works:
-
ChromaDB automatically generates embeddings from document text
-
Similar meanings → similar vectors → close in vector space
-
Distance metrics (cosine, euclidean) measure similarity
Universal ChromaDB Workflow
Phase 1: Collection Design
// Step 1: Design collection strategy based on agent type const collectionStrategy = { research_agent: "One collection per research topic/question", code_agent: "Collections by codebase module/feature", trading_agent: "Collections by strategy/timeframe/symbol", legal_agent: "Collections by practice area/jurisdiction", documentation_agent: "Collections by project/version" };
// Step 2: Create collection with descriptive metadata mcp__chroma__create_collection({ collection_name: "{domain}{purpose}{identifier}", embedding_function_name: "default", // Uses sentence transformers metadata: { created_date: "2024-11-14", domain: "research|code|trading|legal|docs", purpose: "Descriptive purpose", total_items: 0, // Will update last_updated: "2024-11-14" } });
Phase 2: Data Ingestion
// Step 1: Batch data collection (minimize API calls) const items = collectAllItems(); // From API, files, database, etc.
// Step 2: Transform to ChromaDB format const documents = items.map(item => formatDocument(item)); const ids = items.map(item => item.id || generateUniqueId()); const metadatas = items.map(item => extractMetadata(item));
// Step 3: Batch insert (ChromaDB handles chunking automatically) mcp__chroma__add_documents({ collection_name: collectionName, documents: documents, ids: ids, metadatas: metadatas });
// Step 4: Update collection metadata mcp__chroma__modify_collection({ collection_name: collectionName, new_metadata: { ...existingMetadata, total_items: items.length, last_updated: new Date().toISOString() } });
Phase 3: Semantic Search
// Step 1: Formulate semantic query (natural language works!) const query = "authentication failures in production environment";
// Step 2: Execute semantic search with filters const results = mcp__chroma__query_documents({ collection_name: collectionName, query_texts: [query], n_results: 20, where: { "$and": [ { "environment": "production" }, { "severity": { "$in": ["high", "critical"] } }, { "date": { "$gte": "2024-01-01" } } ] }, include: ["documents", "metadatas", "distances"] });
// Step 3: Filter by semantic similarity (distance threshold) const highlyRelevant = results.ids[0].filter((id, idx) => results.distances[0][idx] < 0.3 // Adjust threshold based on use case );
// Step 4: Retrieve full details if needed const fullDetails = mcp__chroma__get_documents({ collection_name: collectionName, ids: highlyRelevant, include: ["documents", "metadatas"] });
Phase 4: Pattern Matching
// Cross-collection pattern detection const allCollections = mcp__chroma__list_collections(); const relevantCollections = allCollections.filter(c => c.startsWith(collectionPrefix) );
const patterns = []; for (const collection of relevantCollections) { const matches = mcp__chroma__query_documents({ collection_name: collection, query_texts: [patternQuery], n_results: 10, where: { "outcome": "success" } // Only successful cases });
if (matches.ids[0].length > 0) { patterns.push({ collection: collection, matches: matches, success_rate: calculateSuccessRate(matches) }); } }
// Identify best pattern const bestPattern = patterns.sort((a, b) => b.success_rate - a.success_rate )[0];
Use Case Templates
Template 1: Research Agent - Literature Review
Problem: Store 1000+ research papers, find semantically similar work
// Collection: research_literature_{topic} const papers = fetchPapersFromAPI("machine learning transformers");
mcp__chroma__create_collection({ collection_name: "research_literature_ml_transformers", metadata: { topic: "ML Transformers", papers_count: 0 } });
// Store papers with rich metadata
papers.forEach(paper => {
mcp__chroma__add_documents({
collection_name: "research_literature_ml_transformers",
documents: [${paper.title}. ${paper.abstract}],
ids: [paper.doi || paper.id],
metadatas: [{
title: paper.title,
authors: paper.authors.join(", "),
year: paper.year,
citations: paper.citation_count,
venue: paper.venue,
url: paper.url
}]
});
});
// Semantic search: "Find papers about attention mechanisms for vision" const relevant = mcp__chroma__query_documents({ collection_name: "research_literature_ml_transformers", query_texts: ["attention mechanisms computer vision"], n_results: 20, where: { "year": { "$gte": 2020 }, "citations": { "$gte": 50 } } });
Benefits: No token limits, semantic discovery, citation filtering, persistent library
Template 2: Code Agent - Bug Pattern Recognition
Problem: Store bug reports, identify similar issues, suggest solutions
// Collection: codebase_bug_patterns_{module} const bugs = fetchAllGitHubIssues("is:issue label:bug");
mcp__chroma__create_collection({ collection_name: "codebase_bug_patterns_auth", metadata: { module: "authentication", total_bugs: 0 } });
// Store bugs with solutions
bugs.forEach(bug => {
mcp__chroma__add_documents({
collection_name: "codebase_bug_patterns_auth",
documents: [Bug #${bug.number}: ${bug.title}. ${bug.body}],
ids: [bug_${bug.number}],
metadatas: [{
number: bug.number,
title: bug.title,
severity: bug.labels.find(l => l.startsWith("severity:"))?.split(":")[1],
status: bug.state,
solution: bug.resolution || "No solution yet",
created_at: bug.created_at,
resolved_at: bug.closed_at,
url: bug.html_url
}]
});
});
// New bug arrives - find similar historical bugs const newBugDescription = "User login fails with 401 error after password reset";
const similarBugs = mcp__chroma__query_documents({ collection_name: "codebase_bug_patterns_auth", query_texts: [newBugDescription], n_results: 10, where: { "status": "closed", "solution": { "$ne": "No solution yet" } } });
// Extract solution from most similar resolved bug const suggestedSolution = similarBugs.metadatas[0][0].solution;
Benefits: Instant bug pattern matching, solution reuse, similar issue detection
Template 3: Trading Agent - Backtest Results Database
Problem: Store 10,000+ backtest results, identify optimal parameter patterns
// Collection: backtest_results_{strategy_name} const backtests = runParameterSweep(strategyCode, parameterRanges);
mcp__chroma__create_collection({ collection_name: "backtest_results_sma_crossover", metadata: { strategy: "SMA Crossover", total_backtests: 0 } });
// Store each backtest with parameters + results
backtests.forEach(backtest => {
const description = SMA Crossover strategy with fast=${backtest.params.fast_period}, slow=${backtest.params.slow_period}, stop_loss=${backtest.params.stop_loss}. Market conditions: ${backtest.market_regime}, volatility=${backtest.avg_volatility}. ;
mcp__chroma__add_documents({
collection_name: "backtest_results_sma_crossover",
documents: [description],
ids: [backtest_${backtest.id}],
metadatas: [{
fast_period: backtest.params.fast_period,
slow_period: backtest.params.slow_period,
stop_loss: backtest.params.stop_loss,
sharpe_ratio: backtest.sharpe_ratio,
max_drawdown: backtest.max_drawdown,
win_rate: backtest.win_rate,
total_return: backtest.total_return,
market_regime: backtest.market_regime,
symbol: backtest.symbol,
timeframe: backtest.timeframe,
start_date: backtest.start_date,
end_date: backtest.end_date
}]
});
});
// Find optimal parameters for current market conditions
const currentMarket = analyzeCurrentMarket();
const marketDescription = Market regime: ${currentMarket.regime}, volatility: ${currentMarket.volatility}, trend strength: ${currentMarket.trend_strength};
const optimalBacktests = mcp__chroma__query_documents({ collection_name: "backtest_results_sma_crossover", query_texts: [marketDescription], n_results: 20, where: { "$and": [ { "sharpe_ratio": { "$gte": 1.5 } }, { "max_drawdown": { "$lte": -0.15 } }, { "symbol": currentMarket.symbol } ] } });
// Extract best parameter set const bestParams = optimalBacktests.metadatas[0][0];
Benefits: Parameter optimization, market regime matching, performance pattern discovery
Template 4: Documentation Agent - Style Guide Enforcement
Problem: Store API documentation examples, ensure consistent style
// Collection: api_docs_{project_version} const existingDocs = parseAllApiDocs("./docs/api/");
mcp__chroma__create_collection({ collection_name: "api_docs_v2", metadata: { version: "2.0", total_endpoints: 0 } });
// Store documentation with style metadata existingDocs.forEach(doc => { mcp__chroma__add_documents({ collection_name: "api_docs_v2", documents: [doc.fullContent], ids: [doc.endpoint], metadatas: [{ endpoint: doc.endpoint, method: doc.method, category: doc.category, style_score: doc.styleScore, // Computed during ingestion has_examples: doc.examples.length > 0, has_error_codes: doc.errorCodes.length > 0, last_updated: doc.lastModified }] }); });
// New endpoint documented - find similar endpoints for style consistency const newEndpoint = "POST /api/v2/users/{id}/preferences";
const similarEndpoints = mcp__chroma__query_documents({
collection_name: "api_docs_v2",
query_texts: [${newEndpoint} user preferences update],
n_results: 5,
where: {
"$and": [
{ "method": "POST" },
{ "style_score": { "$gte": 0.9 } },
{ "has_examples": true }
]
}
});
// Use similar endpoint as template const template = similarEndpoints.documents[0][0];
Benefits: Style consistency, template discovery, automated quality checks
Template 5: QA Testing Agent - Test Pattern Library
Problem: Store test cases, identify gaps, suggest new tests
// Collection: test_cases_{module} const existingTests = parseTestFiles("./tests/");
mcp__chroma__create_collection({ collection_name: "test_cases_authentication", metadata: { module: "authentication", total_tests: 0 } });
// Store test cases with coverage metadata
existingTests.forEach(test => {
mcp__chroma__add_documents({
collection_name: "test_cases_authentication",
documents: [${test.description}. Covers: ${test.coveredScenarios.join(", ")}],
ids: [test.id],
metadatas: [{
test_type: test.type, // "unit", "integration", "e2e"
file_path: test.filePath,
line_number: test.lineNumber,
last_run: test.lastRun,
status: test.lastStatus,
execution_time_ms: test.executionTime,
assertions: test.assertionCount
}]
});
});
// New feature added - identify missing test coverage const newFeature = "Password reset with 2FA verification";
const existingCoverage = mcp__chroma__query_documents({ collection_name: "test_cases_authentication", query_texts: [newFeature], n_results: 10 });
// If distance > 0.5, probably not covered const isCovered = existingCoverage.distances[0][0] < 0.5;
if (!isCovered) { // Suggest test cases based on similar features const similarFeatures = mcp__chroma__query_documents({ collection_name: "test_cases_authentication", query_texts: ["password reset", "2FA verification"], n_results: 5 });
// Use similar tests as templates const testTemplates = similarFeatures.documents[0]; }
Benefits: Coverage gap detection, test template discovery, pattern-based test generation
Advanced Patterns
Pattern 1: Multi-Collection Aggregation
Use Case: Search across multiple related collections simultaneously
// Example: Search all research topics for a cross-cutting concept const researchCollections = mcp__chroma__list_collections(); const topicCollections = researchCollections.filter(c => c.startsWith("research_literature_") );
const crossTopicResults = []; for (const collection of topicCollections) { const results = mcp__chroma__query_documents({ collection_name: collection, query_texts: ["transfer learning"], n_results: 10 });
crossTopicResults.push({ topic: collection.replace("research_literature_", ""), papers: results }); }
// Aggregate and rank by relevance across topics const allPapers = crossTopicResults.flatMap(r => r.papers.ids[0].map((id, idx) => ({ id: id, topic: r.topic, distance: r.papers.distances[0][idx], metadata: r.papers.metadatas[0][idx] })) );
const rankedPapers = allPapers.sort((a, b) => a.distance - b.distance);
Pattern 2: Hierarchical Collections
Use Case: Parent-child relationship between collections
// Parent: codebase_architecture_decisions // Children: codebase_architecture_decisions_{year}
// Create parent collection with aggregated data mcp__chroma__create_collection({ collection_name: "codebase_architecture_decisions", metadata: { type: "parent", child_collections: [] } });
// Create child collections by year
[2022, 2023, 2024].forEach(year => {
mcp__chroma__create_collection({
collection_name: codebase_architecture_decisions_${year},
metadata: { type: "child", parent: "codebase_architecture_decisions", year }
});
});
// Query strategy: Try child first (faster), fallback to parent
const queryYear = 2024;
let results = mcp__chroma__query_documents({
collection_name: codebase_architecture_decisions_${queryYear},
query_texts: [query],
n_results: 10
});
if (results.ids[0].length < 5) { // Not enough results in child, query parent results = mcp__chroma__query_documents({ collection_name: "codebase_architecture_decisions", query_texts: [query], n_results: 10 }); }
Pattern 3: Temporal Decay
Use Case: Prioritize recent items while keeping historical context
// Store items with temporal metadata mcp__chroma__add_documents({ collection_name: collectionName, documents: documents, ids: ids, metadatas: metadatas.map(m => ({ ...m, timestamp: Date.now(), age_days: 0 // Will be updated })) });
// Query with temporal boost const results = mcp__chroma__query_documents({ collection_name: collectionName, query_texts: [query], n_results: 50 // Get more results for re-ranking });
// Re-rank with temporal decay const now = Date.now(); const rankedResults = results.ids[0].map((id, idx) => { const ageDays = (now - results.metadatas[0][idx].timestamp) / (1000 * 60 * 60 * 24); const decayFactor = Math.exp(-ageDays / 30); // Half-life ~30 days const semanticScore = 1 - results.distances[0][idx]; const combinedScore = semanticScore * 0.7 + decayFactor * 0.3;
return { id: id, semantic_score: semanticScore, decay_factor: decayFactor, combined_score: combinedScore, metadata: results.metadatas[0][idx] }; }).sort((a, b) => b.combined_score - a.combined_score);
Performance Optimization
Batching Strategy
// BAD: One document at a time (slow) for (const item of items) { mcp__chroma__add_documents({ collection_name: collectionName, documents: [item.document], ids: [item.id], metadatas: [item.metadata] }); }
// GOOD: Batch insert (100x faster) const BATCH_SIZE = 100; for (let i = 0; i < items.length; i += BATCH_SIZE) { const batch = items.slice(i, i + BATCH_SIZE); mcp__chroma__add_documents({ collection_name: collectionName, documents: batch.map(item => item.document), ids: batch.map(item => item.id), metadatas: batch.map(item => item.metadata) }); }
Caching Strategy
// Check collection exists before creating const existingCollections = mcp__chroma__list_collections(); if (!existingCollections.includes(collectionName)) { mcp__chroma__create_collection({ collection_name: collectionName }); }
// Check document exists before adding const existing = mcp__chroma__get_documents({ collection_name: collectionName, ids: [documentId] });
if (!existing.ids || existing.ids.length === 0) { // Document doesn't exist, add it mcp__chroma__add_documents({ ... }); } else { // Document exists, update instead mcp__chroma__update_documents({ ... }); }
Query Optimization
// Use metadata filters to reduce search space const results = mcp__chroma__query_documents({ collection_name: collectionName, query_texts: [query], n_results: 20, where: { // Pre-filter with metadata (faster than post-filtering semantic results) "date": { "$gte": "2024-01-01" }, "category": { "$in": ["high_priority", "critical"] } } });
// Only include what you need const minimalResults = mcp__chroma__query_documents({ collection_name: collectionName, query_texts: [query], n_results: 10, include: ["metadatas", "distances"] // Exclude documents if not needed });
Success Criteria
ChromaDB integration is SUCCESSFUL when:
-
✅ Collections Created: Meaningful naming, appropriate metadata
-
✅ Data Ingested: Batched efficiently, deduplicated
-
✅ Semantic Search Works: Returns relevant results (distance < 0.4)
-
✅ Metadata Filters Applied: Correctly scopes search space
-
✅ Performance Optimized: Batching, caching, minimal queries
-
✅ Cross-Collection Queries: When appropriate for use case
-
✅ Persistent Knowledge: Data survives across sessions
-
✅ Pattern Matching: Identifies similar historical cases
-
✅ Token Limits Eliminated: Handles 1000+ items without context overflow
Skill Version: 1.0 Created: 2025-11-14 Purpose: Teach universal ChromaDB integration patterns for all agent types Target Quality: 65/70 Dependencies: ChromaDB MCP (mcp__chroma__*) Universal: Works for research, code, trading, legal, documentation, QA, and all other domains
🔴 Error Handling & Resilience (Priority 1)
Critical for Production: Prevent data loss, handle failures gracefully
Retry with Exponential Backoff
async function retryWithBackoff(operation, maxRetries = 3) { for (let attempt = 0; attempt < maxRetries; attempt++) { try { return await operation(); } catch (error) { if (attempt === maxRetries - 1) throw error; const delay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s await sleep(delay); } } }
Get-or-Create Pattern
function getOrCreateCollection(collectionName, metadata = {}) { const collections = mcp__chroma__list_collections();
if (collections.includes(collectionName)) { return { created: false, collection_name: collectionName }; }
mcp__chroma__create_collection({ collection_name: collectionName, embedding_function_name: "default", metadata: metadata });
return { created: true, collection_name: collectionName }; }
Document Validation
function validateDocument(document, id, metadata) {
if (!document || typeof document !== 'string') {
throw new Error(Document must be non-empty string);
}
if (!id || id.includes(' ')) {
throw new Error(ID must be non-empty string without spaces);
}
if (metadata && typeof metadata !== 'object') {
throw new Error(Metadata must be object);
}
}
See full patterns: Load chromadb-error-handling sub-skill
🧪 Testing Patterns (Priority 1)
Essential for Quality: Ensure ChromaDB integrations work correctly
Unit Tests (Mock ChromaDB)
// Mock ChromaDB for fast unit tests class MockChromaDB { constructor() { this.collections = {}; }
create_collection({ collection_name, metadata }) { this.collections[collection_name] = { documents: [], ids: [], metadatas: [], metadata }; }
query_documents({ collection_name, query_texts, n_results }) { const collection = this.collections[collection_name]; return { ids: [collection.ids.slice(0, n_results)], distances: [collection.ids.slice(0, n_results).map(() => 0.2)] }; } }
Integration Tests (Real ChromaDB)
describe('Semantic Search Integration', () => { test('returns relevant documents', async () => { await chromaClient.add({ collection_name: testCollection, documents: [ 'Machine learning uses neural networks', 'Python is a programming language' ], ids: ['doc1', 'doc2'] });
const results = await chromaClient.query({
collection_name: testCollection,
query_texts: ['neural networks deep learning'],
n_results: 2
});
expect(results.ids[0]).toContain('doc1');
expect(results.distances[0][0]).toBeLessThan(0.4);
}); });
See full patterns: Load chromadb-testing-patterns sub-skill
🔐 Security & Privacy (Priority 2)
Critical for Compliance: Protect PII, sanitize data
PII Redaction
function redactPII(text) { let redacted = text;
// Email redaction redacted = redacted.replace( /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}\b/g, '[EMAIL_REDACTED]' );
// Phone redaction redacted = redacted.replace(/\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, '[PHONE_REDACTED]');
// SSN redaction redacted = redacted.replace(/\b\d{3}-\d{2}-\d{4}\b/g, '[SSN_REDACTED]');
return redacted; }
Secret Sanitization
function redactSecrets(text) { let redacted = text;
// GitHub tokens redacted = redacted.replace(/ghp_[a-zA-Z0-9]{36}/g, '[GITHUB_TOKEN]');
// AWS keys redacted = redacted.replace(/AKIA[0-9A-Z]{16}/g, '[AWS_KEY]');
// API keys redacted = redacted.replace( /api[-]?key['"]?\s*[:=]\s*['"]?([a-zA-Z0-9-]{20,})/gi, 'api_key: [REDACTED]' );
return redacted; }
Access Control
const collectionPermissions = { 'research_confidential': ['research_team', 'admin'], 'customer_pii': ['support_team', 'admin'] };
function checkAccess(collectionName, userRole) { const allowedRoles = collectionPermissions[collectionName] || ['admin'];
if (!allowedRoles.includes(userRole)) {
throw new Error(Access denied for role '${userRole}');
}
}
See full patterns: Load chromadb-security-patterns sub-skill
🗄️ Data Lifecycle Management (Priority 2)
Sustain Production: Version collections, archive old data
Schema Versioning
// Semantic versioning for collections
function createVersionedCollection(domain, purpose, version = 'v1') {
const collectionName = ${domain}_${purpose}_${version};
mcp__chroma__create_collection({ collection_name: collectionName, metadata: { version: version, created_at: new Date().toISOString(), schema_version: '1.0', retention_days: 730, // 2 years lifecycle_stage: 'active' } });
return collectionName; }
Schema Migration
async function migrateCollectionSchema(oldCollection, newVersion) {
const newCollection = ${oldCollection}_v${newVersion};
// Create new collection await mcp__chroma__create_collection({ collection_name: newCollection, metadata: { migrated_from: oldCollection, version: newVersion } });
// Copy all documents with transformed metadata const allDocs = await mcp__chroma__get_documents({ collection_name: oldCollection, limit: 100000 });
const transformedMetadatas = allDocs.metadatas.map(transformMetadata);
await mcp__chroma__add_documents({ collection_name: newCollection, documents: allDocs.documents, ids: allDocs.ids, metadatas: transformedMetadatas });
// Mark old as deprecated await mcp__chroma__modify_collection({ collection_name: oldCollection, new_metadata: { lifecycle_stage: 'deprecated', replacement: newCollection } }); }
Retention Enforcement
async function enforceRetentionPolicies() { const collections = await mcp__chroma__list_collections();
for (const collectionName of collections) { const info = await mcp__chroma__get_collection_info({ collection_name }); const retentionDays = info.metadata.retention_days || 730; const ageDays = calculateAgeDays(info.metadata.created_at);
if (ageDays > retentionDays) {
await archiveCollection(collectionName); // Backup first
await mcp__chroma__delete_collection({ collection_name: collectionName });
}
} }
See full patterns: Load chromadb-lifecycle-management sub-skill
🐛 Debugging & Troubleshooting (Priority 1)
Common Issues
Issue: No results returned
-
Cause: Distance threshold too strict, wrong collection
-
Fix: Increase threshold (0.3 → 0.5), verify collection name
-
Debug: Check results.distances[0] values
Issue: Poor semantic matches
-
Cause: Document chunking too large/small
-
Fix: Optimal chunk size 200-500 words
-
Debug: Review document length, split long documents
Issue: Slow queries
-
Cause: Large collection without metadata filters
-
Fix: Add metadata pre-filters (where clause)
-
Debug: Check collection size, add filters
Distance Threshold Guide
Distance Similarity Use Case
< 0.2 Almost exact Duplicate detection
0.2-0.3 Very similar High precision search
0.3-0.5 Moderately similar Balanced search
0.5-0.7 Weakly similar Broad exploration
0.7 Different topics Not relevant
Antipatterns to Avoid
❌ Storing entire files as single document
-
Loses granularity, poor search relevance
-
✅ Fix: Chunk into 200-500 word sections
❌ No metadata filters on large collections
-
Slow queries, high latency
-
✅ Fix: Always filter by date, category, type
❌ Not deduplicating documents
-
Wasted storage, duplicate results
-
✅ Fix: Check existence before adding
❌ Ignoring connection failures
-
Data loss, silent failures
-
✅ Fix: Implement retry logic, fallback
📚 Sub-Skill Reference
Load targeted sub-skills for deep dives:
-
chromadb-error-handling: Retry patterns, validation, circuit breakers (~150 lines)
-
chromadb-testing-patterns: Unit/integration tests, mocking, fixtures (~120 lines)
-
chromadb-security-patterns: PII redaction, access control, GDPR compliance (~90 lines)
-
chromadb-lifecycle-management: Versioning, migration, archival, retention (~100 lines)
Usage: Skill({ skill: "chromadb-error-handling" })
Updated Success Criteria
ChromaDB integration is PRODUCTION-READY when:
Core Functionality (Original):
-
✅ Collections created with meaningful naming
-
✅ Data ingested efficiently (batching)
-
✅ Semantic search returns relevant results
-
✅ Metadata filters applied correctly
-
✅ Performance optimized
Production Readiness (New):
-
✅ Error Handling: Retry logic, validation, graceful degradation
-
✅ Testing: Unit tests (mocked), integration tests (real ChromaDB)
-
✅ Security: PII redacted, secrets sanitized, access control
-
✅ Lifecycle: Versioning strategy, retention policies, archival
Quality Score: 65/70 → 85/100 (with all enhancements)
Skill Version: 2.0 Updated: 2025-11-14 Enhancements: Error handling, testing, security, lifecycle management Quality Score: 85/100 (Production-Ready) Dependencies: ChromaDB MCP (mcp__chroma__*) Sub-Skills: 4 modular sub-skills for targeted loading