RAG Infrastructure

Production infrastructure for Retrieval-Augmented Generation: ingest documents, generate embeddings, store in vector databases, and serve grounded LLM responses.

When to Use This Skill

Use this skill when:

Building a knowledge base Q&A system over internal documents
Implementing semantic search over large document collections
Reducing LLM hallucinations with retrieved context
Setting up embedding pipelines and vector store infrastructure
Deploying hybrid search (dense + sparse/BM25)

Prerequisites

Python 3.10+ with pip
A vector database (Qdrant, Weaviate, Pinecone, or pgvector)
An embedding model (OpenAI, Cohere, or local via sentence-transformers )
An LLM endpoint (OpenAI API or self-hosted vLLM)
Docker for local vector DB deployment

Architecture Overview

Documents → Chunker → Embedder → Vector Store ↓ User Query → Embedder → Vector Store (search) → Reranker → LLM → Answer

Embedding Pipeline

from sentence_transformers import SentenceTransformer from qdrant_client import QdrantClient from qdrant_client.models import Distance, VectorParams, PointStruct import uuid

Local embedding model (no API cost)

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

Connect to Qdrant

client = QdrantClient("http://localhost:6333")

Create collection

client.create_collection( collection_name="knowledge-base", vectors_config=VectorParams(size=1024, distance=Distance.COSINE), )

def ingest_documents(docs: list[dict]): """Chunk, embed, and upsert documents.""" points = [] for doc in docs: chunks = chunk_text(doc["text"], chunk_size=512, overlap=50) embeddings = model.encode(chunks, batch_size=32, show_progress_bar=True) for chunk, embedding in zip(chunks, embeddings): points.append(PointStruct( id=str(uuid.uuid4()), vector=embedding.tolist(), payload={"text": chunk, "source": doc["source"], "title": doc["title"]}, )) client.upsert(collection_name="knowledge-base", points=points) print(f"Ingested {len(points)} chunks")

Chunking Strategies

from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_text(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]: """Recursive character splitter — best general-purpose strategy.""" splitter = RecursiveCharacterTextSplitter( chunk_size=chunk_size, chunk_overlap=overlap, separators=["\n\n", "\n", ". ", " ", ""], ) return splitter.split_text(text)

For code/markdown — use language-aware splitter

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers = [("#", "H1"), ("##", "H2"), ("###", "H3")] md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)

Hybrid Search (Dense + Sparse)

from qdrant_client.models import SparseVector, SparseVectorParams, NamedSparseVector from fastembed import SparseTextEmbedding

Qdrant hybrid collection (dense + BM25 sparse)

client.create_collection( collection_name="hybrid-kb", vectors_config={"dense": VectorParams(size=1024, distance=Distance.COSINE)}, sparse_vectors_config={"sparse": SparseVectorParams()}, )

sparse_model = SparseTextEmbedding("prithivida/Splade_PP_en_v1")

def hybrid_search(query: str, top_k: int = 10) -> list[dict]: dense_vec = model.encode(query).tolist() sparse_vec = list(sparse_model.embed(query))[0]

results = client.query_points(
    collection_name="hybrid-kb",
    prefetch=[
        {"query": dense_vec, "using": "dense", "limit": 20},
        {"query": SparseVector(indices=sparse_vec.indices.tolist(),
                               values=sparse_vec.values.tolist()),
         "using": "sparse", "limit": 20},
    ],
    query={"fusion": "rrf"},   # Reciprocal Rank Fusion
    limit=top_k,
)
return [{"text": p.payload["text"], "score": p.score} for p in results.points]

Reranking

import cohere

co = cohere.Client("your-api-key")

def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]: """Rerank retrieved chunks for relevance (improves RAG quality ~20-30%).""" response = co.rerank( model="rerank-english-v3.0", query=query, documents=candidates, top_n=top_n, ) return [candidates[r.index] for r in response.results]

Alternative: local reranker (no API cost)

from sentence_transformers import CrossEncoder reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def local_rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]: pairs = [[query, c] for c in candidates] scores = reranker.predict(pairs) ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True) return [text for text, _ in ranked[:top_n]]

RAG Query Pipeline

from openai import OpenAI

llm = OpenAI(base_url="http://localhost:8000/v1", api_key="your-key")

def rag_query(user_question: str) -> str: # 1. Retrieve candidates = hybrid_search(user_question, top_k=20) texts = [c["text"] for c in candidates]

# 2. Rerank
top_chunks = local_rerank(user_question, texts, top_n=5)

# 3. Generate
context = "\n\n---\n\n".join(top_chunks)
response = llm.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": (
            "Answer the question using only the provided context. "
            "If the answer isn't in the context, say so.\n\nContext:\n" + context
        )},
        {"role": "user", "content": user_question},
    ],
    temperature=0.1,
    max_tokens=1024,
)
return response.choices[0].message.content

Docker Compose: Full RAG Stack

services: qdrant: image: qdrant/qdrant:latest volumes: - qdrant-data:/qdrant/storage ports: - "6333:6333" restart: unless-stopped

redis: image: redis:7-alpine volumes: - redis-data:/data restart: unless-stopped

ingestion-worker: build: ./ingestion environment: - QDRANT_URL=http://qdrant:6333 - REDIS_URL=redis://redis:6379 depends_on: [qdrant, redis] restart: unless-stopped

rag-api: build: ./api ports: - "8080:8080" environment: - QDRANT_URL=http://qdrant:6333 - LLM_BASE_URL=http://vllm:8000/v1 depends_on: [qdrant] restart: unless-stopped

volumes: qdrant-data: redis-data:

Common Issues

Issue Cause Fix

Poor retrieval quality Chunk size too large Try 256–512 tokens; overlap 10–15%

LLM ignores retrieved context Context too long Rerank and keep top 3–5 chunks

Slow ingestion Sequential embedding Use batch_size=64 and async upserts

Stale documents No re-ingestion pipeline Track doc_hash ; re-embed on change

High embedding costs All chunks re-embedded Cache embeddings with hash-based dedup

Best Practices

Use BAAI/bge-large-en-v1.5 or nomic-embed-text for strong free embeddings.
Always rerank before passing to LLM — 5 precise chunks beat 20 noisy ones.
Store source metadata (URL, page, section) in vector payloads for citations.
Use namespace/tenant isolation in the vector store for multi-tenant RAG.
Evaluate with RAGAS metrics: faithfulness, answer relevancy, context precision.

Related Skills

vector-database-ops - Qdrant/Weaviate management
vllm-server - Self-hosted LLM endpoint
ollama-stack - Local LLM for development
ai-pipeline-orchestration - Ingestion pipelines

rag-infrastructure

Safety Notice

Copy this and send it to your AI assistant to learn

Local embedding model (no API cost)

Connect to Qdrant

Create collection

For code/markdown — use language-aware splitter

Qdrant hybrid collection (dense + BM25 sparse)

Alternative: local reranker (no API cost)

Source Transparency

Related Skills

linux-administration

sops-encryption

linux-hardening

windows-server