knowledge-base-builder

Knowledge Base Builder Skill

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "knowledge-base-builder" with this command: npx skills add vamseeachanta/workspace-hub/vamseeachanta-workspace-hub-knowledge-base-builder

Knowledge Base Builder Skill

Overview

This skill creates searchable knowledge bases from document collections using SQLite FTS5 full-text search indexing. It handles PDF extraction, text chunking, metadata cataloging, and search interface creation.

Quick Start

import sqlite3

conn = sqlite3.connect("knowledge.db", timeout=30) cursor = conn.cursor()

Create FTS5 search table

cursor.execute(''' CREATE VIRTUAL TABLE IF NOT EXISTS search_index USING fts5(content, filename) ''')

Add content

cursor.execute('INSERT INTO search_index VALUES (?, ?)', ("Sample document text...", "doc.pdf"))

Search

cursor.execute("SELECT * FROM search_index WHERE search_index MATCH 'sample'") print(cursor.fetchall())

When to Use

  • Building searchable technical standards libraries

  • Creating research paper databases

  • Indexing corporate document repositories

  • Setting up knowledge management systems

  • Converting file-based document collections to queryable databases

Architecture

Document Collection | v +------------------+ | 1. Inventory | Scan files, extract metadata +--------+---------+ v +------------------+ | 2. Extract | PDF -> text, chunk by pages +--------+---------+ v +------------------+ | 3. Index | SQLite FTS5 full-text search +--------+---------+ v +------------------+ | 4. Search CLI | Query interface with filtering +------------------+

Implementation Steps

Step 1: Create Database Schema

import sqlite3

def create_knowledge_base(db_path): conn = sqlite3.connect(db_path, timeout=30) cursor = conn.cursor()

# Documents table
cursor.execute('''
    CREATE TABLE IF NOT EXISTS documents (
        id INTEGER PRIMARY KEY,
        filename TEXT NOT NULL,
        filepath TEXT NOT NULL,
        category TEXT,
        file_size INTEGER,
        page_count INTEGER,
        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    )
''')

# Text chunks table
cursor.execute('''
    CREATE TABLE IF NOT EXISTS chunks (
        id INTEGER PRIMARY KEY,
        doc_id INTEGER,
        page_num INTEGER,
        chunk_text TEXT,
        char_count INTEGER,
        FOREIGN KEY (doc_id) REFERENCES documents(id)
    )
''')

# FTS5 search index
cursor.execute('''
    CREATE VIRTUAL TABLE IF NOT EXISTS chunks_fts
    USING fts5(chunk_text, content='chunks', content_rowid='id')
''')

# Triggers for FTS sync
cursor.execute('''
    CREATE TRIGGER IF NOT EXISTS chunks_ai AFTER INSERT ON chunks BEGIN
        INSERT INTO chunks_fts(rowid, chunk_text) VALUES (new.id, new.chunk_text);
    END
''')

conn.commit()
return conn

Step 2: Document Inventory

from pathlib import Path import os

def build_inventory(root_path, extensions=['.pdf', '.docx', '.txt']): """Scan directory and catalog all documents.""" documents = []

for filepath in Path(root_path).rglob('*'):
    if filepath.suffix.lower() in extensions:
        stat = filepath.stat()
        documents.append({
            'filename': filepath.name,
            'filepath': str(filepath),
            'category': categorize_document(filepath),
            'file_size': stat.st_size,
        })

return documents

def categorize_document(filepath): """Auto-categorize based on path or filename patterns.""" name = filepath.name.upper()

patterns = {
    'API': 'API',
    'ISO': 'ISO',
    'DNV': 'DNV',
    'ASME': 'ASME',
    'NORSOK': 'NORSOK',
}

for pattern, category in patterns.items():
    if pattern in name:
        return category

return 'Unknown'

Step 3: PDF Text Extraction

import fitz # PyMuPDF

def extract_pdf_text(filepath, chunk_size=2000): """Extract text from PDF, chunked by approximate size.""" doc = fitz.open(filepath) chunks = []

for page_num, page in enumerate(doc, 1):
    text = page.get_text()
    if text.strip():
        # Split into manageable chunks
        for i in range(0, len(text), chunk_size):
            chunk = text[i:i + chunk_size]
            chunks.append({
                'page_num': page_num,
                'text': chunk,
                'char_count': len(chunk)
            })

doc.close()
return chunks

Step 4: Search Interface

def search_knowledge_base(db_path, query, limit=20): """Full-text search with ranking.""" conn = sqlite3.connect(db_path, timeout=30) cursor = conn.cursor()

cursor.execute('''
    SELECT
        d.filename,
        d.category,
        c.page_num,
        snippet(chunks_fts, 0, '>>> ', ' <<<', '...', 32) as snippet,
        bm25(chunks_fts) as score
    FROM chunks_fts
    JOIN chunks c ON chunks_fts.rowid = c.id
    JOIN documents d ON c.doc_id = d.id
    WHERE chunks_fts MATCH ?
    ORDER BY score
    LIMIT ?
''', (query, limit))

return cursor.fetchall()

CLI Template

#!/bin/bash

kb - Knowledge Base Search CLI

DB_PATH="${KB_DATABASE:-./knowledge.db}"

search() { sqlite3 "$DB_PATH" " SELECT d.filename, c.page_num, snippet(chunks_fts, 0, '>>>', '<<<', '...', 20) FROM chunks_fts JOIN chunks c ON chunks_fts.rowid = c.id JOIN documents d ON c.doc_id = d.id WHERE chunks_fts MATCH '$1' LIMIT 20 " }

case "$1" in search) search "$2" ;; status) sqlite3 "$DB_PATH" "SELECT COUNT(*) FROM documents" ;; *) echo "Usage: kb {search|status} [query]" ;; esac

Execution Checklist

  • Scan and inventory target document collection

  • Create SQLite database with FTS5 support

  • Extract text from all documents

  • Chunk text appropriately (1000-2000 chars)

  • Build FTS5 search index

  • Test search with sample queries

  • Validate search results accuracy

  • Create CLI or API interface

Error Handling

Common Errors

Error: sqlite3.OperationalError (database is locked)

  • Cause: Concurrent access without proper timeout

  • Solution: Use timeout=30 when connecting

Error: FTS5 not available

  • Cause: SQLite compiled without FTS5 support

  • Solution: Upgrade SQLite or use FTS4 fallback

Error: Empty search results

  • Cause: FTS index not synced with data

  • Solution: Rebuild FTS index with INSERT INTO chunks_fts(chunks_fts) VALUES('rebuild')

Error: Memory issues with large collections

  • Cause: Loading all chunks at once

  • Solution: Process in batches, commit every 500 documents

Metrics

Metric Typical Value

Indexing speed ~1000 documents/minute

Search latency <50ms for 100K chunks

Storage overhead ~10-20% over raw text

FTS5 index size ~30% of text size

Memory usage ~100MB per 50K documents

Best Practices

  • Use SQLite timeout - Add timeout=30 for concurrent access

  • Chunk appropriately - 1000-2000 chars optimal for search

  • Index progressively - Process in batches for large collections

  • Background processing - Use service scripts for long extractions

  • Category detection - Auto-categorize from filename/path patterns

Example Usage

Build knowledge base

python inventory.py /path/to/documents python extract.py --db knowledge.db python index.py --db knowledge.db

Search

./kb search "fatigue analysis" ./kb search "API AND riser"

Related Skills

  • semantic-search-setup

  • Add vector embeddings for AI search

  • rag-system-builder

  • Build AI Q&A on top of knowledge base

  • pdf-text-extractor

  • Detailed PDF extraction options

Dependencies

pip install PyMuPDF

System tools:

  • SQLite 3.9+ (for FTS5 support)

Version History

  • 1.1.0 (2026-01-02): Added Quick Start, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills

  • 1.0.0 (2024-10-15): Initial release with SQLite FTS5 full-text search, PDF extraction, CLI

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

numpy-numerical-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

hydrodynamic-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

core-researcher

No summary provided by upstream source.

Repository SourceNeeds Review