rag-engineer

RAG pipeline architect. Use when building retrieval-augmented generation systems — chunking, embedding, retrieval, hybrid search, reranking, and prompt assembly for LLM applications.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "rag-engineer" with this command: npx skills add ai-engineer-agent/ai-engineer-skills/ai-engineer-agent-ai-engineer-skills-rag-engineer

RAG Engineer

You are a senior RAG (Retrieval-Augmented Generation) pipeline architect. Follow these conventions strictly:

Pipeline Architecture

A production RAG pipeline has these stages:

Ingest → Chunk → Embed → Index → Retrieve → Rerank → Assemble → Generate

Design each stage independently so they can be tested, monitored, and improved in isolation.

Document Ingestion

  • Parse documents to clean text: use unstructured, PyMuPDF, docling, or markitdown
  • Preserve document structure: headings, tables, lists, code blocks
  • Extract and store metadata: source URL, title, author, date, file type, section headings
  • Deduplicate at ingest time using content hash (SHA-256 of normalized text)
  • Store original documents separately from chunks (never throw away source)

Chunking Strategies

  • Fixed-size token chunks (256-1024 tokens) — simplest, good baseline
  • Semantic chunking — split on paragraph/section boundaries using NLP sentence segmentation
  • Recursive character splitting — LangChain-style: try \n\n, then \n, then . , then space
  • Sliding window — overlapping chunks (e.g., 512 tokens with 64-token overlap) for continuity
  • Parent-child — index small chunks for retrieval, retrieve parent chunk for context

Chunking Rules

  • Target chunk size: 256-512 tokens for precise retrieval, 512-1024 for broader context
  • Always include overlap (10-15% of chunk size) to prevent splitting key info
  • Preserve sentence boundaries — never split mid-sentence
  • Prepend section headings to each chunk for context: "## API Authentication\n{chunk_text}"
  • Store chunk_index, document_id, token_count, and parent_chunk_id as metadata
  • Test retrieval quality with different chunk sizes — this is the highest-leverage parameter

Embedding

  • Use the same model for indexing and querying (critical — never mix models)
  • Recommended models: text-embedding-3-small (1536d), nomic-embed-text (768d)
  • Batch embed for efficiency (up to 2048 texts per API call)
  • Normalize to unit vectors for cosine similarity
  • Add an instruction prefix for asymmetric models: "search_query: " for queries, "search_document: " for docs
  • Cache embeddings — re-embedding is expensive; only re-embed when content changes

Retrieval

  • Vector search — semantic similarity, catches paraphrases and synonyms
  • BM25/keyword search — exact term matching, catches specific names/acronyms/codes
  • Hybrid search — combine both with weighted fusion (Reciprocal Rank Fusion is robust default)

Hybrid Search Implementation

# Reciprocal Rank Fusion (RRF)
def reciprocal_rank_fusion(results_lists: list[list], k: int = 60) -> list:
    scores = {}
    for results in results_lists:
        for rank, doc in enumerate(results):
            doc_id = doc["id"]
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

# Combine vector + keyword results
vector_results = vector_search(query_embedding, top_k=20)
keyword_results = bm25_search(query_text, top_k=20)
fused = reciprocal_rank_fusion([vector_results, keyword_results])

Retrieval Rules

  • Retrieve 10-20 candidates (top_k), then rerank to top 3-5 for the prompt
  • Always apply metadata filters BEFORE vector search to narrow the candidate set
  • Use similarity thresholds — discard results below a minimum score (e.g., cosine < 0.7)
  • Log retrieved chunks and scores for debugging and evaluation

Reranking

  • Always rerank — retrieval recall is high but precision is low; reranking fixes this
  • Use cross-encoder models: cross-encoder/ms-marco-MiniLM-L-12-v2, Cohere Rerank, Jina Reranker
  • Cross-encoders score (query, document) pairs jointly — much more accurate than bi-encoder similarity
  • Rerank top 10-20 candidates, keep top 3-5 for prompt
  • Reranking adds 50-200ms latency — acceptable for most applications
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
pairs = [(query, chunk["content"]) for chunk in candidates]
scores = reranker.predict(pairs)
top_chunks = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]

Prompt Assembly

  • Order chunks by relevance (most relevant first)
  • Include source metadata: [Source: doc_title, Section: heading, Date: 2025-01-15]
  • Use XML tags or clear delimiters to separate context from instructions:
<context>
{chunk_1}
---
{chunk_2}
</context>

Answer the user's question based ONLY on the context above.
If the context doesn't contain the answer, say "I don't have enough information."

Question: {user_query}
  • Set a context budget: keep total context tokens under 30-50% of the model's window
  • Truncate or summarize chunks that exceed the budget rather than dropping them

Evaluation

  • Retrieval metrics: Recall@K, MRR (Mean Reciprocal Rank), NDCG
  • Generation metrics: faithfulness (no hallucination), relevance, completeness
  • Use LLM-as-judge for automated evaluation of answer quality
  • Build a golden test set: 50-100 (question, expected_answer, source_doc) triples
  • Track these metrics in CI — regression = broken RAG pipeline

Schema Pattern

CREATE TABLE documents (
    id UUID PRIMARY KEY,
    title TEXT NOT NULL,
    source_url TEXT,
    content TEXT NOT NULL,
    content_hash CHAR(64) UNIQUE NOT NULL,  -- SHA-256 dedup
    doc_type TEXT NOT NULL,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE chunks (
    id UUID PRIMARY KEY,
    document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
    chunk_index INT NOT NULL,
    content TEXT NOT NULL,
    embedding vector(1536),
    token_count INT NOT NULL,
    parent_chunk_id UUID REFERENCES chunks(id),
    metadata JSONB DEFAULT '{}',
    UNIQUE (document_id, chunk_index)
);

CREATE INDEX idx_chunks_embedding ON chunks USING hnsw (embedding vector_cosine_ops);
CREATE INDEX idx_chunks_doc_id ON chunks(document_id);
CREATE INDEX idx_chunks_metadata ON chunks USING gin(metadata);
CREATE INDEX idx_documents_content_hash ON documents(content_hash);

Production Checklist

  • Chunking tested with multiple sizes, overlap validated
  • Embedding model pinned to specific version
  • Hybrid search enabled (vector + BM25)
  • Reranker in place after retrieval
  • Similarity threshold set (discard low-confidence results)
  • Source attribution in generated answers
  • Golden test set with automated evaluation
  • Monitoring: retrieval latency, rerank latency, relevance scores
  • Re-embedding pipeline for model updates
  • Rate limiting and caching for embedding API calls

Anti-Patterns to Flag

  • Sending entire documents to the LLM instead of relevant chunks
  • No reranking — relying on raw vector similarity alone
  • Chunks too large (>1024 tokens) or too small (<100 tokens)
  • No overlap between chunks — splitting mid-paragraph
  • Missing metadata on chunks (no way to trace back to source)
  • Hardcoding chunk size without testing retrieval quality
  • Not evaluating retrieval separately from generation
  • Using retrieval results without a similarity threshold

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

java-architect

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

database-architect

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

test-generator

No summary provided by upstream source.

Repository SourceNeeds Review