rag-infrastructure

Production infrastructure for Retrieval-Augmented Generation: ingest documents, generate embeddings, store in vector databases, and serve grounded LLM responses.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "rag-infrastructure" with this command: npx skills add bagelhole/devops-security-agent-skills/bagelhole-devops-security-agent-skills-rag-infrastructure

RAG Infrastructure

Production infrastructure for Retrieval-Augmented Generation: ingest documents, generate embeddings, store in vector databases, and serve grounded LLM responses.

When to Use This Skill

Use this skill when:

  • Building a knowledge base Q&A system over internal documents

  • Implementing semantic search over large document collections

  • Reducing LLM hallucinations with retrieved context

  • Setting up embedding pipelines and vector store infrastructure

  • Deploying hybrid search (dense + sparse/BM25)

Prerequisites

  • Python 3.10+ with pip

  • A vector database (Qdrant, Weaviate, Pinecone, or pgvector)

  • An embedding model (OpenAI, Cohere, or local via sentence-transformers )

  • An LLM endpoint (OpenAI API or self-hosted vLLM)

  • Docker for local vector DB deployment

Architecture Overview

Documents → Chunker → Embedder → Vector Store ↓ User Query → Embedder → Vector Store (search) → Reranker → LLM → Answer

Embedding Pipeline

from sentence_transformers import SentenceTransformer from qdrant_client import QdrantClient from qdrant_client.models import Distance, VectorParams, PointStruct import uuid

Local embedding model (no API cost)

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

Connect to Qdrant

client = QdrantClient("http://localhost:6333")

Create collection

client.create_collection( collection_name="knowledge-base", vectors_config=VectorParams(size=1024, distance=Distance.COSINE), )

def ingest_documents(docs: list[dict]): """Chunk, embed, and upsert documents.""" points = [] for doc in docs: chunks = chunk_text(doc["text"], chunk_size=512, overlap=50) embeddings = model.encode(chunks, batch_size=32, show_progress_bar=True) for chunk, embedding in zip(chunks, embeddings): points.append(PointStruct( id=str(uuid.uuid4()), vector=embedding.tolist(), payload={"text": chunk, "source": doc["source"], "title": doc["title"]}, )) client.upsert(collection_name="knowledge-base", points=points) print(f"Ingested {len(points)} chunks")

Chunking Strategies

from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_text(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]: """Recursive character splitter — best general-purpose strategy.""" splitter = RecursiveCharacterTextSplitter( chunk_size=chunk_size, chunk_overlap=overlap, separators=["\n\n", "\n", ". ", " ", ""], ) return splitter.split_text(text)

For code/markdown — use language-aware splitter

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers = [("#", "H1"), ("##", "H2"), ("###", "H3")] md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)

Hybrid Search (Dense + Sparse)

from qdrant_client.models import SparseVector, SparseVectorParams, NamedSparseVector from fastembed import SparseTextEmbedding

Qdrant hybrid collection (dense + BM25 sparse)

client.create_collection( collection_name="hybrid-kb", vectors_config={"dense": VectorParams(size=1024, distance=Distance.COSINE)}, sparse_vectors_config={"sparse": SparseVectorParams()}, )

sparse_model = SparseTextEmbedding("prithivida/Splade_PP_en_v1")

def hybrid_search(query: str, top_k: int = 10) -> list[dict]: dense_vec = model.encode(query).tolist() sparse_vec = list(sparse_model.embed(query))[0]

results = client.query_points(
    collection_name="hybrid-kb",
    prefetch=[
        {"query": dense_vec, "using": "dense", "limit": 20},
        {"query": SparseVector(indices=sparse_vec.indices.tolist(),
                               values=sparse_vec.values.tolist()),
         "using": "sparse", "limit": 20},
    ],
    query={"fusion": "rrf"},   # Reciprocal Rank Fusion
    limit=top_k,
)
return [{"text": p.payload["text"], "score": p.score} for p in results.points]

Reranking

import cohere

co = cohere.Client("your-api-key")

def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]: """Rerank retrieved chunks for relevance (improves RAG quality ~20-30%).""" response = co.rerank( model="rerank-english-v3.0", query=query, documents=candidates, top_n=top_n, ) return [candidates[r.index] for r in response.results]

Alternative: local reranker (no API cost)

from sentence_transformers import CrossEncoder reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def local_rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]: pairs = [[query, c] for c in candidates] scores = reranker.predict(pairs) ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True) return [text for text, _ in ranked[:top_n]]

RAG Query Pipeline

from openai import OpenAI

llm = OpenAI(base_url="http://localhost:8000/v1", api_key="your-key")

def rag_query(user_question: str) -> str: # 1. Retrieve candidates = hybrid_search(user_question, top_k=20) texts = [c["text"] for c in candidates]

# 2. Rerank
top_chunks = local_rerank(user_question, texts, top_n=5)

# 3. Generate
context = "\n\n---\n\n".join(top_chunks)
response = llm.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": (
            "Answer the question using only the provided context. "
            "If the answer isn't in the context, say so.\n\nContext:\n" + context
        )},
        {"role": "user", "content": user_question},
    ],
    temperature=0.1,
    max_tokens=1024,
)
return response.choices[0].message.content

Docker Compose: Full RAG Stack

services: qdrant: image: qdrant/qdrant:latest volumes: - qdrant-data:/qdrant/storage ports: - "6333:6333" restart: unless-stopped

redis: image: redis:7-alpine volumes: - redis-data:/data restart: unless-stopped

ingestion-worker: build: ./ingestion environment: - QDRANT_URL=http://qdrant:6333 - REDIS_URL=redis://redis:6379 depends_on: [qdrant, redis] restart: unless-stopped

rag-api: build: ./api ports: - "8080:8080" environment: - QDRANT_URL=http://qdrant:6333 - LLM_BASE_URL=http://vllm:8000/v1 depends_on: [qdrant] restart: unless-stopped

volumes: qdrant-data: redis-data:

Common Issues

Issue Cause Fix

Poor retrieval quality Chunk size too large Try 256–512 tokens; overlap 10–15%

LLM ignores retrieved context Context too long Rerank and keep top 3–5 chunks

Slow ingestion Sequential embedding Use batch_size=64 and async upserts

Stale documents No re-ingestion pipeline Track doc_hash ; re-embed on change

High embedding costs All chunks re-embedded Cache embeddings with hash-based dedup

Best Practices

  • Use BAAI/bge-large-en-v1.5 or nomic-embed-text for strong free embeddings.

  • Always rerank before passing to LLM — 5 precise chunks beat 20 noisy ones.

  • Store source metadata (URL, page, section) in vector payloads for citations.

  • Use namespace/tenant isolation in the vector store for multi-tenant RAG.

  • Evaluate with RAGAS metrics: faithfulness, answer relevancy, context precision.

Related Skills

  • vector-database-ops - Qdrant/Weaviate management

  • vllm-server - Self-hosted LLM endpoint

  • ollama-stack - Local LLM for development

  • ai-pipeline-orchestration - Ingestion pipelines

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Security

linux-administration

No summary provided by upstream source.

Repository SourceNeeds Review
Security

sops-encryption

No summary provided by upstream source.

Repository SourceNeeds Review
Security

linux-hardening

No summary provided by upstream source.

Repository SourceNeeds Review
Security

windows-server

No summary provided by upstream source.

Repository SourceNeeds Review