Library RAG: Semantic Search
Semantic search over your local library of markdown-converted papers using sentence-transformers embeddings and ChromaDB.
Prerequisites
-
uv installed (standard in this project)
-
Papers ingested via ingest.py (which converts to markdown, organizes files, and adds metadata to references.bib )
-
references.bib with md_path fields linking citation keys to markdown files
Important: Only files registered in references.bib are indexed. Loose markdown files in library/markdown/ without a bib entry will be flagged as "unlinked" during indexing. Run ingest.py to register them.
Quick Start
Index your library (first time or after adding papers)
uv run plugins/sociology-skillset/scripts/rag.py index
Search by meaning
uv run plugins/sociology-skillset/scripts/rag.py search "cultural capital and educational attainment"
Script Location
All commands use:
uv run plugins/sociology-skillset/scripts/rag.py <command>
Dependencies (sentence-transformers , chromadb ) are auto-installed by uv on first run via PEP 723 inline metadata. No manual installation needed.
Commands
Index
Build or update the vector index from library/markdown/ files.
Index all markdown files (incremental — skips unchanged files)
uv run plugins/sociology-skillset/scripts/rag.py index
Index specific citation keys only
uv run plugins/sociology-skillset/scripts/rag.py index --keys Smith2020_Cultural Jones2019_Institutional
The index is stored at library/.rag-index/ . First run downloads the all-MiniLM-L6-v2 embedding model (~80MB, cached by sentence-transformers).
Run this after adding new papers to keep the index current.
Search
Semantic search across all indexed documents. Returns JSON lines ranked by similarity.
uv run plugins/sociology-skillset/scripts/rag.py search "social movements and collective identity" uv run plugins/sociology-skillset/scripts/rag.py search "interview methodology" --top-k 5 uv run plugins/sociology-skillset/scripts/rag.py search "Bourdieu field theory" --min-score 0.3
Each result includes: chunk_id , citation_key , section_title , score , text (truncated), plus title , author , year from references.bib .
Similar
Find passages similar to a given chunk (from search results).
uv run plugins/sociology-skillset/scripts/rag.py similar <chunk_id> uv run plugins/sociology-skillset/scripts/rag.py similar abc123def456 --top-k 5
Use this to explore thematic connections: find a relevant passage via search , then use similar to discover related content across other papers.
Context
Show the full context around a chunk — the target chunk plus surrounding chunks from the same document.
uv run plugins/sociology-skillset/scripts/rag.py context <chunk_id> uv run plugins/sociology-skillset/scripts/rag.py context abc123def456 --window 3
Returns the target chunk and neighboring chunks (default: 2 on each side), so you can read the passage in its original context.
Status
Show index statistics: number of documents, chunks, and last modified time.
uv run plugins/sociology-skillset/scripts/rag.py status
List
List all indexed documents with chunk counts.
uv run plugins/sociology-skillset/scripts/rag.py list
Remove
Remove a document from the index by citation key.
uv run plugins/sociology-skillset/scripts/rag.py remove Smith2020_Cultural
Typical Workflows
First-time setup
-
Ensure papers are in library/markdown/ (run ingest.py for each PDF/EPUB)
-
Run uv run rag.py index to build the index
-
Search with uv run rag.py search "your topic"
Adding new papers
-
Ingest the paper: uv run plugins/sociology-skillset/scripts/ingest.py --file paper.pdf
-
Update the index: uv run plugins/sociology-skillset/scripts/rag.py index
Adding a PDF for a paper already in references.bib
-
Ingest with update: uv run plugins/sociology-skillset/scripts/ingest.py --file paper.pdf --citekey ExistingKey2022 --update
-
Update the index: uv run plugins/sociology-skillset/scripts/rag.py index
Deep exploration
-
Search for a topic: search "concept or question"
-
Read context of a promising hit: context <chunk_id>
-
Find similar passages across other papers: similar <chunk_id>
-
Read the full paper if needed: open the source_file path from results
When to Use RAG vs. Grep
Need Tool
Conceptual/semantic search (find passages about a concept even if they don't use the exact words) rag.py search
Exact keyword/phrase search (find specific terms, author names, method names) grep library/markdown/
Metadata search (by author, year, journal) grep references.bib
Both approaches complement each other. Use semantic search for exploratory discovery and grep for precise retrieval.
Technical Details
-
Embedding model: all-MiniLM-L6-v2 (384 dimensions, same as old Zotero RAG)
-
Vector store: ChromaDB with file-based persistence at library/.rag-index/
-
Chunking: Split by ## headers (section-level); fallback to ~512-token fixed chunks for headerless documents
-
Incremental indexing: Content hashes stored in metadata; unchanged files are skipped on re-index
-
Output format: JSON lines for easy parsing by Claude or other tools