RAG Infrastructure
Production infrastructure for Retrieval-Augmented Generation: ingest documents, generate embeddings, store in vector databases, and serve grounded LLM responses.
When to Use This Skill
Use this skill when:
-
Building a knowledge base Q&A system over internal documents
-
Implementing semantic search over large document collections
-
Reducing LLM hallucinations with retrieved context
-
Setting up embedding pipelines and vector store infrastructure
-
Deploying hybrid search (dense + sparse/BM25)
Prerequisites
-
Python 3.10+ with pip
-
A vector database (Qdrant, Weaviate, Pinecone, or pgvector)
-
An embedding model (OpenAI, Cohere, or local via sentence-transformers )
-
An LLM endpoint (OpenAI API or self-hosted vLLM)
-
Docker for local vector DB deployment
Architecture Overview
Documents → Chunker → Embedder → Vector Store ↓ User Query → Embedder → Vector Store (search) → Reranker → LLM → Answer
Embedding Pipeline
from sentence_transformers import SentenceTransformer from qdrant_client import QdrantClient from qdrant_client.models import Distance, VectorParams, PointStruct import uuid
Local embedding model (no API cost)
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
Connect to Qdrant
client = QdrantClient("http://localhost:6333")
Create collection
client.create_collection( collection_name="knowledge-base", vectors_config=VectorParams(size=1024, distance=Distance.COSINE), )
def ingest_documents(docs: list[dict]): """Chunk, embed, and upsert documents.""" points = [] for doc in docs: chunks = chunk_text(doc["text"], chunk_size=512, overlap=50) embeddings = model.encode(chunks, batch_size=32, show_progress_bar=True) for chunk, embedding in zip(chunks, embeddings): points.append(PointStruct( id=str(uuid.uuid4()), vector=embedding.tolist(), payload={"text": chunk, "source": doc["source"], "title": doc["title"]}, )) client.upsert(collection_name="knowledge-base", points=points) print(f"Ingested {len(points)} chunks")
Chunking Strategies
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_text(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]: """Recursive character splitter — best general-purpose strategy.""" splitter = RecursiveCharacterTextSplitter( chunk_size=chunk_size, chunk_overlap=overlap, separators=["\n\n", "\n", ". ", " ", ""], ) return splitter.split_text(text)
For code/markdown — use language-aware splitter
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers = [("#", "H1"), ("##", "H2"), ("###", "H3")] md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
Hybrid Search (Dense + Sparse)
from qdrant_client.models import SparseVector, SparseVectorParams, NamedSparseVector from fastembed import SparseTextEmbedding
Qdrant hybrid collection (dense + BM25 sparse)
client.create_collection( collection_name="hybrid-kb", vectors_config={"dense": VectorParams(size=1024, distance=Distance.COSINE)}, sparse_vectors_config={"sparse": SparseVectorParams()}, )
sparse_model = SparseTextEmbedding("prithivida/Splade_PP_en_v1")
def hybrid_search(query: str, top_k: int = 10) -> list[dict]: dense_vec = model.encode(query).tolist() sparse_vec = list(sparse_model.embed(query))[0]
results = client.query_points(
collection_name="hybrid-kb",
prefetch=[
{"query": dense_vec, "using": "dense", "limit": 20},
{"query": SparseVector(indices=sparse_vec.indices.tolist(),
values=sparse_vec.values.tolist()),
"using": "sparse", "limit": 20},
],
query={"fusion": "rrf"}, # Reciprocal Rank Fusion
limit=top_k,
)
return [{"text": p.payload["text"], "score": p.score} for p in results.points]
Reranking
import cohere
co = cohere.Client("your-api-key")
def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]: """Rerank retrieved chunks for relevance (improves RAG quality ~20-30%).""" response = co.rerank( model="rerank-english-v3.0", query=query, documents=candidates, top_n=top_n, ) return [candidates[r.index] for r in response.results]
Alternative: local reranker (no API cost)
from sentence_transformers import CrossEncoder reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def local_rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]: pairs = [[query, c] for c in candidates] scores = reranker.predict(pairs) ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True) return [text for text, _ in ranked[:top_n]]
RAG Query Pipeline
from openai import OpenAI
llm = OpenAI(base_url="http://localhost:8000/v1", api_key="your-key")
def rag_query(user_question: str) -> str: # 1. Retrieve candidates = hybrid_search(user_question, top_k=20) texts = [c["text"] for c in candidates]
# 2. Rerank
top_chunks = local_rerank(user_question, texts, top_n=5)
# 3. Generate
context = "\n\n---\n\n".join(top_chunks)
response = llm.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": (
"Answer the question using only the provided context. "
"If the answer isn't in the context, say so.\n\nContext:\n" + context
)},
{"role": "user", "content": user_question},
],
temperature=0.1,
max_tokens=1024,
)
return response.choices[0].message.content
Docker Compose: Full RAG Stack
services: qdrant: image: qdrant/qdrant:latest volumes: - qdrant-data:/qdrant/storage ports: - "6333:6333" restart: unless-stopped
redis: image: redis:7-alpine volumes: - redis-data:/data restart: unless-stopped
ingestion-worker: build: ./ingestion environment: - QDRANT_URL=http://qdrant:6333 - REDIS_URL=redis://redis:6379 depends_on: [qdrant, redis] restart: unless-stopped
rag-api: build: ./api ports: - "8080:8080" environment: - QDRANT_URL=http://qdrant:6333 - LLM_BASE_URL=http://vllm:8000/v1 depends_on: [qdrant] restart: unless-stopped
volumes: qdrant-data: redis-data:
Common Issues
Issue Cause Fix
Poor retrieval quality Chunk size too large Try 256–512 tokens; overlap 10–15%
LLM ignores retrieved context Context too long Rerank and keep top 3–5 chunks
Slow ingestion Sequential embedding Use batch_size=64 and async upserts
Stale documents No re-ingestion pipeline Track doc_hash ; re-embed on change
High embedding costs All chunks re-embedded Cache embeddings with hash-based dedup
Best Practices
-
Use BAAI/bge-large-en-v1.5 or nomic-embed-text for strong free embeddings.
-
Always rerank before passing to LLM — 5 precise chunks beat 20 noisy ones.
-
Store source metadata (URL, page, section) in vector payloads for citations.
-
Use namespace/tenant isolation in the vector store for multi-tenant RAG.
-
Evaluate with RAGAS metrics: faithfulness, answer relevancy, context precision.
Related Skills
-
vector-database-ops - Qdrant/Weaviate management
-
vllm-server - Self-hosted LLM endpoint
-
ollama-stack - Local LLM for development
-
ai-pipeline-orchestration - Ingestion pipelines