Knowledge Base Builder Skill
Overview
This skill creates searchable knowledge bases from document collections using SQLite FTS5 full-text search indexing. It handles PDF extraction, text chunking, metadata cataloging, and search interface creation.
Quick Start
import sqlite3
conn = sqlite3.connect("knowledge.db", timeout=30) cursor = conn.cursor()
Create FTS5 search table
cursor.execute(''' CREATE VIRTUAL TABLE IF NOT EXISTS search_index USING fts5(content, filename) ''')
Add content
cursor.execute('INSERT INTO search_index VALUES (?, ?)', ("Sample document text...", "doc.pdf"))
Search
cursor.execute("SELECT * FROM search_index WHERE search_index MATCH 'sample'") print(cursor.fetchall())
When to Use
-
Building searchable technical standards libraries
-
Creating research paper databases
-
Indexing corporate document repositories
-
Setting up knowledge management systems
-
Converting file-based document collections to queryable databases
Architecture
Document Collection | v +------------------+ | 1. Inventory | Scan files, extract metadata +--------+---------+ v +------------------+ | 2. Extract | PDF -> text, chunk by pages +--------+---------+ v +------------------+ | 3. Index | SQLite FTS5 full-text search +--------+---------+ v +------------------+ | 4. Search CLI | Query interface with filtering +------------------+
Implementation Steps
Step 1: Create Database Schema
import sqlite3
def create_knowledge_base(db_path): conn = sqlite3.connect(db_path, timeout=30) cursor = conn.cursor()
# Documents table
cursor.execute('''
CREATE TABLE IF NOT EXISTS documents (
id INTEGER PRIMARY KEY,
filename TEXT NOT NULL,
filepath TEXT NOT NULL,
category TEXT,
file_size INTEGER,
page_count INTEGER,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
''')
# Text chunks table
cursor.execute('''
CREATE TABLE IF NOT EXISTS chunks (
id INTEGER PRIMARY KEY,
doc_id INTEGER,
page_num INTEGER,
chunk_text TEXT,
char_count INTEGER,
FOREIGN KEY (doc_id) REFERENCES documents(id)
)
''')
# FTS5 search index
cursor.execute('''
CREATE VIRTUAL TABLE IF NOT EXISTS chunks_fts
USING fts5(chunk_text, content='chunks', content_rowid='id')
''')
# Triggers for FTS sync
cursor.execute('''
CREATE TRIGGER IF NOT EXISTS chunks_ai AFTER INSERT ON chunks BEGIN
INSERT INTO chunks_fts(rowid, chunk_text) VALUES (new.id, new.chunk_text);
END
''')
conn.commit()
return conn
Step 2: Document Inventory
from pathlib import Path import os
def build_inventory(root_path, extensions=['.pdf', '.docx', '.txt']): """Scan directory and catalog all documents.""" documents = []
for filepath in Path(root_path).rglob('*'):
if filepath.suffix.lower() in extensions:
stat = filepath.stat()
documents.append({
'filename': filepath.name,
'filepath': str(filepath),
'category': categorize_document(filepath),
'file_size': stat.st_size,
})
return documents
def categorize_document(filepath): """Auto-categorize based on path or filename patterns.""" name = filepath.name.upper()
patterns = {
'API': 'API',
'ISO': 'ISO',
'DNV': 'DNV',
'ASME': 'ASME',
'NORSOK': 'NORSOK',
}
for pattern, category in patterns.items():
if pattern in name:
return category
return 'Unknown'
Step 3: PDF Text Extraction
import fitz # PyMuPDF
def extract_pdf_text(filepath, chunk_size=2000): """Extract text from PDF, chunked by approximate size.""" doc = fitz.open(filepath) chunks = []
for page_num, page in enumerate(doc, 1):
text = page.get_text()
if text.strip():
# Split into manageable chunks
for i in range(0, len(text), chunk_size):
chunk = text[i:i + chunk_size]
chunks.append({
'page_num': page_num,
'text': chunk,
'char_count': len(chunk)
})
doc.close()
return chunks
Step 4: Search Interface
def search_knowledge_base(db_path, query, limit=20): """Full-text search with ranking.""" conn = sqlite3.connect(db_path, timeout=30) cursor = conn.cursor()
cursor.execute('''
SELECT
d.filename,
d.category,
c.page_num,
snippet(chunks_fts, 0, '>>> ', ' <<<', '...', 32) as snippet,
bm25(chunks_fts) as score
FROM chunks_fts
JOIN chunks c ON chunks_fts.rowid = c.id
JOIN documents d ON c.doc_id = d.id
WHERE chunks_fts MATCH ?
ORDER BY score
LIMIT ?
''', (query, limit))
return cursor.fetchall()
CLI Template
#!/bin/bash
kb - Knowledge Base Search CLI
DB_PATH="${KB_DATABASE:-./knowledge.db}"
search() { sqlite3 "$DB_PATH" " SELECT d.filename, c.page_num, snippet(chunks_fts, 0, '>>>', '<<<', '...', 20) FROM chunks_fts JOIN chunks c ON chunks_fts.rowid = c.id JOIN documents d ON c.doc_id = d.id WHERE chunks_fts MATCH '$1' LIMIT 20 " }
case "$1" in search) search "$2" ;; status) sqlite3 "$DB_PATH" "SELECT COUNT(*) FROM documents" ;; *) echo "Usage: kb {search|status} [query]" ;; esac
Execution Checklist
-
Scan and inventory target document collection
-
Create SQLite database with FTS5 support
-
Extract text from all documents
-
Chunk text appropriately (1000-2000 chars)
-
Build FTS5 search index
-
Test search with sample queries
-
Validate search results accuracy
-
Create CLI or API interface
Error Handling
Common Errors
Error: sqlite3.OperationalError (database is locked)
-
Cause: Concurrent access without proper timeout
-
Solution: Use timeout=30 when connecting
Error: FTS5 not available
-
Cause: SQLite compiled without FTS5 support
-
Solution: Upgrade SQLite or use FTS4 fallback
Error: Empty search results
-
Cause: FTS index not synced with data
-
Solution: Rebuild FTS index with INSERT INTO chunks_fts(chunks_fts) VALUES('rebuild')
Error: Memory issues with large collections
-
Cause: Loading all chunks at once
-
Solution: Process in batches, commit every 500 documents
Metrics
Metric Typical Value
Indexing speed ~1000 documents/minute
Search latency <50ms for 100K chunks
Storage overhead ~10-20% over raw text
FTS5 index size ~30% of text size
Memory usage ~100MB per 50K documents
Best Practices
-
Use SQLite timeout - Add timeout=30 for concurrent access
-
Chunk appropriately - 1000-2000 chars optimal for search
-
Index progressively - Process in batches for large collections
-
Background processing - Use service scripts for long extractions
-
Category detection - Auto-categorize from filename/path patterns
Example Usage
Build knowledge base
python inventory.py /path/to/documents python extract.py --db knowledge.db python index.py --db knowledge.db
Search
./kb search "fatigue analysis" ./kb search "API AND riser"
Related Skills
-
semantic-search-setup
-
Add vector embeddings for AI search
-
rag-system-builder
-
Build AI Q&A on top of knowledge base
-
pdf-text-extractor
-
Detailed PDF extraction options
Dependencies
pip install PyMuPDF
System tools:
- SQLite 3.9+ (for FTS5 support)
Version History
-
1.1.0 (2026-01-02): Added Quick Start, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills
-
1.0.0 (2024-10-15): Initial release with SQLite FTS5 full-text search, PDF extraction, CLI