rag-implementation

Master Retrieval-Augmented Generation (RAG) to build LLM applications that provide accurate, grounded responses using external knowledge sources.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "rag-implementation" with this command: npx skills add wshobson/agents/wshobson-agents-rag-implementation

RAG Implementation

Master Retrieval-Augmented Generation (RAG) to build LLM applications that provide accurate, grounded responses using external knowledge sources.

When to Use This Skill

  • Building Q&A systems over proprietary documents

  • Creating chatbots with current, factual information

  • Implementing semantic search with natural language queries

  • Reducing hallucinations with grounded responses

  • Enabling LLMs to access domain-specific knowledge

  • Building documentation assistants

  • Creating research tools with source citation

Core Components

  1. Vector Databases

Purpose: Store and retrieve document embeddings efficiently

Options:

  • Pinecone: Managed, scalable, serverless

  • Weaviate: Open-source, hybrid search, GraphQL

  • Milvus: High performance, on-premise

  • Chroma: Lightweight, easy to use, local development

  • Qdrant: Fast, filtered search, Rust-based

  • pgvector: PostgreSQL extension, SQL integration

  1. Embeddings

Purpose: Convert text to numerical vectors for similarity search

Models (2026):

Model Dimensions Best For

voyage-3-large 1024 Claude apps (Anthropic recommended)

voyage-code-3 1024 Code search

text-embedding-3-large 3072 OpenAI apps, high accuracy

text-embedding-3-small 1536 OpenAI apps, cost-effective

bge-large-en-v1.5 1024 Open source, local deployment

multilingual-e5-large 1024 Multi-language support

  1. Retrieval Strategies

Approaches:

  • Dense Retrieval: Semantic similarity via embeddings

  • Sparse Retrieval: Keyword matching (BM25, TF-IDF)

  • Hybrid Search: Combine dense + sparse with weighted fusion

  • Multi-Query: Generate multiple query variations

  • HyDE: Generate hypothetical documents for better retrieval

  1. Reranking

Purpose: Improve retrieval quality by reordering results

Methods:

  • Cross-Encoders: BERT-based reranking (ms-marco-MiniLM)

  • Cohere Rerank: API-based reranking

  • Maximal Marginal Relevance (MMR): Diversity + relevance

  • LLM-based: Use LLM to score relevance

Quick Start with LangGraph

from langgraph.graph import StateGraph, START, END from langchain_anthropic import ChatAnthropic from langchain_voyageai import VoyageAIEmbeddings from langchain_pinecone import PineconeVectorStore from langchain_core.documents import Document from langchain_core.prompts import ChatPromptTemplate from langchain_text_splitters import RecursiveCharacterTextSplitter from typing import TypedDict, Annotated

class RAGState(TypedDict): question: str context: list[Document] answer: str

Initialize components

llm = ChatAnthropic(model="claude-sonnet-4-6") embeddings = VoyageAIEmbeddings(model="voyage-3-large") vectorstore = PineconeVectorStore(index_name="docs", embedding=embeddings) retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

RAG prompt

rag_prompt = ChatPromptTemplate.from_template( """Answer based on the context below. If you cannot answer, say so.

Context:
{context}

Question: {question}

Answer:"""

)

async def retrieve(state: RAGState) -> RAGState: """Retrieve relevant documents.""" docs = await retriever.ainvoke(state["question"]) return {"context": docs}

async def generate(state: RAGState) -> RAGState: """Generate answer from context.""" context_text = "\n\n".join(doc.page_content for doc in state["context"]) messages = rag_prompt.format_messages( context=context_text, question=state["question"] ) response = await llm.ainvoke(messages) return {"answer": response.content}

Build RAG graph

builder = StateGraph(RAGState) builder.add_node("retrieve", retrieve) builder.add_node("generate", generate) builder.add_edge(START, "retrieve") builder.add_edge("retrieve", "generate") builder.add_edge("generate", END)

rag_chain = builder.compile()

Use

result = await rag_chain.ainvoke({"question": "What are the main features?"}) print(result["answer"])

Advanced RAG Patterns

Pattern 1: Hybrid Search with RRF

from langchain_community.retrievers import BM25Retriever from langchain.retrievers import EnsembleRetriever

Sparse retriever (BM25 for keyword matching)

bm25_retriever = BM25Retriever.from_documents(documents) bm25_retriever.k = 10

Dense retriever (embeddings for semantic search)

dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

Combine with Reciprocal Rank Fusion weights

ensemble_retriever = EnsembleRetriever( retrievers=[bm25_retriever, dense_retriever], weights=[0.3, 0.7] # 30% keyword, 70% semantic )

Pattern 2: Multi-Query Retrieval

from langchain.retrievers.multi_query import MultiQueryRetriever

Generate multiple query perspectives for better recall

multi_query_retriever = MultiQueryRetriever.from_llm( retriever=vectorstore.as_retriever(search_kwargs={"k": 5}), llm=llm )

Single query → multiple variations → combined results

results = await multi_query_retriever.ainvoke("What is the main topic?")

Pattern 3: Contextual Compression

from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import LLMChainExtractor

Compressor extracts only relevant portions

compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10}) )

Returns only relevant parts of documents

compressed_docs = await compression_retriever.ainvoke("specific query")

Pattern 4: Parent Document Retriever

from langchain.retrievers import ParentDocumentRetriever from langchain.storage import InMemoryStore from langchain_text_splitters import RecursiveCharacterTextSplitter

Small chunks for precise retrieval, large chunks for context

child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50) parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)

Store for parent documents

docstore = InMemoryStore()

parent_retriever = ParentDocumentRetriever( vectorstore=vectorstore, docstore=docstore, child_splitter=child_splitter, parent_splitter=parent_splitter )

Add documents (splits children, stores parents)

await parent_retriever.aadd_documents(documents)

Retrieval returns parent documents with full context

results = await parent_retriever.ainvoke("query")

Pattern 5: HyDE (Hypothetical Document Embeddings)

from langchain_core.prompts import ChatPromptTemplate

class HyDEState(TypedDict): question: str hypothetical_doc: str context: list[Document] answer: str

hyde_prompt = ChatPromptTemplate.from_template( """Write a detailed passage that would answer this question:

Question: {question}

Passage:"""

)

async def generate_hypothetical(state: HyDEState) -> HyDEState: """Generate hypothetical document for better retrieval.""" messages = hyde_prompt.format_messages(question=state["question"]) response = await llm.ainvoke(messages) return {"hypothetical_doc": response.content}

async def retrieve_with_hyde(state: HyDEState) -> HyDEState: """Retrieve using hypothetical document.""" # Use hypothetical doc for retrieval instead of original query docs = await retriever.ainvoke(state["hypothetical_doc"]) return {"context": docs}

Build HyDE RAG graph

builder = StateGraph(HyDEState) builder.add_node("hypothetical", generate_hypothetical) builder.add_node("retrieve", retrieve_with_hyde) builder.add_node("generate", generate) builder.add_edge(START, "hypothetical") builder.add_edge("hypothetical", "retrieve") builder.add_edge("retrieve", "generate") builder.add_edge("generate", END)

hyde_rag = builder.compile()

Document Chunking Strategies

Recursive Character Text Splitter

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, length_function=len, separators=["\n\n", "\n", ". ", " ", ""] # Try in order )

chunks = splitter.split_documents(documents)

Token-Based Splitting

from langchain_text_splitters import TokenTextSplitter

splitter = TokenTextSplitter( chunk_size=512, chunk_overlap=50, encoding_name="cl100k_base" # OpenAI tiktoken encoding )

Semantic Chunking

from langchain_experimental.text_splitter import SemanticChunker

splitter = SemanticChunker( embeddings=embeddings, breakpoint_threshold_type="percentile", breakpoint_threshold_amount=95 )

Markdown Header Splitter

from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ]

splitter = MarkdownHeaderTextSplitter( headers_to_split_on=headers_to_split_on, strip_headers=False )

Vector Store Configurations

Pinecone (Serverless)

from pinecone import Pinecone, ServerlessSpec from langchain_pinecone import PineconeVectorStore

Initialize Pinecone client

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

Create index if needed

if "my-index" not in pc.list_indexes().names(): pc.create_index( name="my-index", dimension=1024, # voyage-3-large dimensions metric="cosine", spec=ServerlessSpec(cloud="aws", region="us-east-1") )

Create vector store

index = pc.Index("my-index") vectorstore = PineconeVectorStore(index=index, embedding=embeddings)

Weaviate

import weaviate from langchain_weaviate import WeaviateVectorStore

client = weaviate.connect_to_local() # or connect_to_weaviate_cloud()

vectorstore = WeaviateVectorStore( client=client, index_name="Documents", text_key="content", embedding=embeddings )

Chroma (Local Development)

from langchain_chroma import Chroma

vectorstore = Chroma( collection_name="my_collection", embedding_function=embeddings, persist_directory="./chroma_db" )

pgvector (PostgreSQL)

from langchain_postgres.vectorstores import PGVector

connection_string = "postgresql+psycopg://user:pass@localhost:5432/vectordb"

vectorstore = PGVector( embeddings=embeddings, collection_name="documents", connection=connection_string, )

Retrieval Optimization

  1. Metadata Filtering

from langchain_core.documents import Document

Add metadata during indexing

docs_with_metadata = [] for doc in documents: doc.metadata.update({ "source": doc.metadata.get("source", "unknown"), "category": determine_category(doc.page_content), "date": datetime.now().isoformat() }) docs_with_metadata.append(doc)

Filter during retrieval

results = await vectorstore.asimilarity_search( "query", filter={"category": "technical"}, k=5 )

  1. Maximal Marginal Relevance (MMR)

Balance relevance with diversity

results = await vectorstore.amax_marginal_relevance_search( "query", k=5, fetch_k=20, # Fetch 20, return top 5 diverse lambda_mult=0.5 # 0=max diversity, 1=max relevance )

  1. Reranking with Cross-Encoder

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

async def retrieve_and_rerank(query: str, k: int = 5) -> list[Document]: # Get initial results candidates = await vectorstore.asimilarity_search(query, k=20)

# Rerank
pairs = [[query, doc.page_content] for doc in candidates]
scores = reranker.predict(pairs)

# Sort by score and take top k
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:k]]

4. Cohere Rerank

from langchain.retrievers import CohereRerank from langchain_cohere import CohereRerank

reranker = CohereRerank(model="rerank-english-v3.0", top_n=5)

Wrap retriever with reranking

reranked_retriever = ContextualCompressionRetriever( base_compressor=reranker, base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20}) )

Prompt Engineering for RAG

Contextual Prompt with Citations

rag_prompt = ChatPromptTemplate.from_template( """Answer the question based on the context below. Include citations using [1], [2], etc.

If you cannot answer based on the context, say "I don't have enough information."

Context:
{context}

Question: {question}

Instructions:
1. Use only information from the context
2. Cite sources with [1], [2] format
3. If uncertain, express uncertainty

Answer (with citations):"""

)

Structured Output for RAG

from pydantic import BaseModel, Field

class RAGResponse(BaseModel): answer: str = Field(description="The answer based on context") confidence: float = Field(description="Confidence score 0-1") sources: list[str] = Field(description="Source document IDs used") reasoning: str = Field(description="Brief reasoning for the answer")

Use with structured output

structured_llm = llm.with_structured_output(RAGResponse)

Evaluation Metrics

from typing import TypedDict

class RAGEvalMetrics(TypedDict): retrieval_precision: float # Relevant docs / retrieved docs retrieval_recall: float # Retrieved relevant / total relevant answer_relevance: float # Answer addresses question faithfulness: float # Answer grounded in context context_relevance: float # Context relevant to question

async def evaluate_rag_system( rag_chain, test_cases: list[dict] ) -> RAGEvalMetrics: """Evaluate RAG system on test cases.""" metrics = {k: [] for k in RAGEvalMetrics.annotations}

for test in test_cases:
    result = await rag_chain.ainvoke({"question": test["question"]})

    # Retrieval metrics
    retrieved_ids = {doc.metadata["id"] for doc in result["context"]}
    relevant_ids = set(test["relevant_doc_ids"])

    precision = len(retrieved_ids & relevant_ids) / len(retrieved_ids)
    recall = len(retrieved_ids & relevant_ids) / len(relevant_ids)

    metrics["retrieval_precision"].append(precision)
    metrics["retrieval_recall"].append(recall)

    # Use LLM-as-judge for quality metrics
    quality = await evaluate_answer_quality(
        question=test["question"],
        answer=result["answer"],
        context=result["context"],
        expected=test.get("expected_answer")
    )
    metrics["answer_relevance"].append(quality["relevance"])
    metrics["faithfulness"].append(quality["faithfulness"])
    metrics["context_relevance"].append(quality["context_relevance"])

return {k: sum(v) / len(v) for k, v in metrics.items()}

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

market-sizing-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

team-composition-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

stride-analysis-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
Research

binary-analysis-patterns

No summary provided by upstream source.

Repository SourceNeeds Review