Library RAG: Semantic Search

Semantic search over your local library of markdown-converted papers using sentence-transformers embeddings and ChromaDB.

Prerequisites

uv installed (standard in this project)
Papers ingested via ingest.py (which converts to markdown, organizes files, and adds metadata to references.bib )
references.bib with md_path fields linking citation keys to markdown files

Important: Only files registered in references.bib are indexed. Loose markdown files in library/markdown/ without a bib entry will be flagged as "unlinked" during indexing. Run ingest.py to register them.

Quick Start

Index your library (first time or after adding papers)

uv run plugins/sociology-skillset/scripts/rag.py index

Search by meaning

uv run plugins/sociology-skillset/scripts/rag.py search "cultural capital and educational attainment"

Script Location

All commands use:

uv run plugins/sociology-skillset/scripts/rag.py <command>

Dependencies (sentence-transformers , chromadb ) are auto-installed by uv on first run via PEP 723 inline metadata. No manual installation needed.

Commands

Index

Build or update the vector index from library/markdown/ files.

Index all markdown files (incremental — skips unchanged files)

uv run plugins/sociology-skillset/scripts/rag.py index

Index specific citation keys only

uv run plugins/sociology-skillset/scripts/rag.py index --keys Smith2020_Cultural Jones2019_Institutional

The index is stored at library/.rag-index/ . First run downloads the all-MiniLM-L6-v2 embedding model (~80MB, cached by sentence-transformers).

Run this after adding new papers to keep the index current.

Semantic search across all indexed documents. Returns JSON lines ranked by similarity.

uv run plugins/sociology-skillset/scripts/rag.py search "social movements and collective identity" uv run plugins/sociology-skillset/scripts/rag.py search "interview methodology" --top-k 5 uv run plugins/sociology-skillset/scripts/rag.py search "Bourdieu field theory" --min-score 0.3

Each result includes: chunk_id , citation_key , section_title , score , text (truncated), plus title , author , year from references.bib .

Similar

Find passages similar to a given chunk (from search results).

uv run plugins/sociology-skillset/scripts/rag.py similar <chunk_id> uv run plugins/sociology-skillset/scripts/rag.py similar abc123def456 --top-k 5

Use this to explore thematic connections: find a relevant passage via search , then use similar to discover related content across other papers.

Context

Show the full context around a chunk — the target chunk plus surrounding chunks from the same document.

uv run plugins/sociology-skillset/scripts/rag.py context <chunk_id> uv run plugins/sociology-skillset/scripts/rag.py context abc123def456 --window 3

Returns the target chunk and neighboring chunks (default: 2 on each side), so you can read the passage in its original context.

Status

Show index statistics: number of documents, chunks, and last modified time.

uv run plugins/sociology-skillset/scripts/rag.py status

List

List all indexed documents with chunk counts.

uv run plugins/sociology-skillset/scripts/rag.py list

Remove

Remove a document from the index by citation key.

uv run plugins/sociology-skillset/scripts/rag.py remove Smith2020_Cultural

Typical Workflows

First-time setup

Ensure papers are in library/markdown/ (run ingest.py for each PDF/EPUB)
Run uv run rag.py index to build the index
Search with uv run rag.py search "your topic"

Adding new papers

Ingest the paper: uv run plugins/sociology-skillset/scripts/ingest.py --file paper.pdf
Update the index: uv run plugins/sociology-skillset/scripts/rag.py index

Adding a PDF for a paper already in references.bib

Ingest with update: uv run plugins/sociology-skillset/scripts/ingest.py --file paper.pdf --citekey ExistingKey2022 --update
Update the index: uv run plugins/sociology-skillset/scripts/rag.py index

Deep exploration

Search for a topic: search "concept or question"
Read context of a promising hit: context <chunk_id>
Find similar passages across other papers: similar <chunk_id>
Read the full paper if needed: open the source_file path from results

When to Use RAG vs. Grep

Need Tool

Conceptual/semantic search (find passages about a concept even if they don't use the exact words) rag.py search

Exact keyword/phrase search (find specific terms, author names, method names) grep library/markdown/

Metadata search (by author, year, journal) grep references.bib

Both approaches complement each other. Use semantic search for exploratory discovery and grep for precise retrieval.

Technical Details

Embedding model: all-MiniLM-L6-v2 (384 dimensions, same as old Zotero RAG)
Vector store: ChromaDB with file-based persistence at library/.rag-index/
Chunking: Split by ## headers (section-level); fallback to ~512-token fixed chunks for headerless documents
Incremental indexing: Content hashes stored in metadata; unchanged files are skipped on re-index
Output format: JSON lines for easy parsing by Claude or other tools

zotero-rag

Safety Notice

Copy this and send it to your AI assistant to learn

Index your library (first time or after adding papers)

Search by meaning

Index all markdown files (incremental — skips unchanged files)

Index specific citation keys only

Source Transparency

Related Skills

writing-editor

bibliography-builder

qual-findings-writer

mixed-methods-findings-writer