reranking-patterns

Reranking Patterns

Improve search precision by re-scoring retrieved documents with more powerful models.

Overview

Improving precision after initial retrieval
When bi-encoder embeddings miss semantic nuance
Combining multiple relevance signals
Production RAG systems requiring high accuracy

Improve search precision by re-scoring retrieved documents with more powerful models.

Why Rerank?

Initial retrieval (bi-encoder) prioritizes speed over accuracy:

Bi-encoder: Embeds query and docs separately → fast but approximate
Cross-encoder/LLM: Processes query+doc together → slow but accurate

Solution: Retrieve many (top-50), rerank few (top-10)

Pattern 1: Cross-Encoder Reranking

from sentence_transformers import CrossEncoder

class CrossEncoderReranker: def init(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"): self.model = CrossEncoder(model_name)

def rerank(
    self,
    query: str,
    documents: list[dict],
    top_k: int = 10,
) -> list[dict]:
    """Rerank documents using cross-encoder."""

    # Create query-document pairs
    pairs = [(query, doc["content"]) for doc in documents]

    # Score all pairs
    scores = self.model.predict(pairs)

    # Sort by score
    scored_docs = list(zip(documents, scores))
    scored_docs.sort(key=lambda x: x[1], reverse=True)

    # Return top-k with updated scores
    return [
        {**doc, "score": float(score)}
        for doc, score in scored_docs[:top_k]
    ]

Pattern 2: LLM Reranking (Batch)

from openai import AsyncOpenAI

async def llm_rerank( query: str, documents: list[dict], llm: AsyncOpenAI, top_k: int = 10, ) -> list[dict]: """Rerank using LLM relevance scoring."""

# Build prompt with all candidates
docs_text = "\n\n".join([
    f"[Doc {i+1}]\n{doc['content'][:300]}..."
    for i, doc in enumerate(documents)
])

response = await llm.chat.completions.create(
    model="gpt-5.2-mini",  # Fast, cheap
    messages=[
        {"role": "system", "content": """

Rate each document's relevance to the query (0.0-1.0). Output one score per line, in order: 0.95 0.72 0.45 ..."""}, {"role": "user", "content": f"Query: {query}\n\nDocuments:\n{docs_text}"} ], temperature=0, )

# Parse scores
scores = parse_scores(response.choices[0].message.content, len(documents))

# Sort and return
scored_docs = list(zip(documents, scores))
scored_docs.sort(key=lambda x: x[1], reverse=True)

return [
    {**doc, "score": score}
    for doc, score in scored_docs[:top_k]
]

def parse_scores(response: str, expected_count: int) -> list[float]: """Parse LLM response into scores.""" scores = [] for line in response.strip().split("\n"): try: score = float(line.strip()) scores.append(max(0.0, min(1.0, score))) except ValueError: scores.append(0.5) # Default on parse error

# Pad if needed
while len(scores) &#x3C; expected_count:
    scores.append(0.5)

return scores[:expected_count]

Pattern 3: Cohere Rerank API

import cohere

class CohereReranker: def init(self, api_key: str): self.client = cohere.Client(api_key)

def rerank(
    self,
    query: str,
    documents: list[dict],
    top_k: int = 10,
) -> list[dict]:
    """Rerank using Cohere's rerank API."""

    results = self.client.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=[doc["content"] for doc in documents],
        top_n=top_k,
    )

    return [
        {**documents[r.index], "score": r.relevance_score}
        for r in results.results
    ]

Pattern 4: Combined Scoring

Combine multiple signals with weighted average:

from dataclasses import dataclass

@dataclass class ReRankScore: doc_id: str base_score: float # Original retrieval score llm_score: float # LLM relevance score recency_score: float # Metadata-based (e.g., freshness) final_score: float

def combined_rerank( documents: list[dict], llm_scores: dict[str, float], alpha: float = 0.3, # Base weight beta: float = 0.5, # LLM weight gamma: float = 0.2, # Recency weight ) -> list[dict]: """Combine multiple scoring signals."""

scored = []
for doc in documents:
    base = doc.get("score", 0.5)
    llm = llm_scores.get(doc["id"], 0.5)
    recency = calculate_recency_score(doc.get("created_at"))

    final = (alpha * base) + (beta * llm) + (gamma * recency)

    scored.append({
        **doc,
        "score": final,
        "score_components": {
            "base": base,
            "llm": llm,
            "recency": recency,
        }
    })

scored.sort(key=lambda x: x["score"], reverse=True)
return scored

Complete Reranking Service

class ReRankingService: def init( self, llm: AsyncOpenAI, timeout_seconds: float = 5.0, ): self.llm = llm self.timeout = timeout_seconds

async def rerank(
    self,
    query: str,
    documents: list[dict],
    top_k: int = 10,
) -> list[dict]:
    """Rerank with timeout and fallback."""
    import asyncio

    if len(documents) &#x3C;= top_k:
        return documents

    try:
        async with asyncio.timeout(self.timeout):
            return await llm_rerank(
                query, documents, self.llm, top_k
            )
    except TimeoutError:
        # Fallback: return by original score
        return sorted(
            documents,
            key=lambda x: x.get("score", 0),
            reverse=True
        )[:top_k]

Model Selection Guide

Model Latency Cost Quality

cross-encoder/ms-marco-MiniLM-L-6-v2

~50ms Free Good

BAAI/bge-reranker-large

~100ms Free Better

cohere rerank-english-v3.0

~200ms $1/1K Best

gpt-5.2-mini (LLM) ~500ms $0.15/1M Great

Best Practices

Retrieve more, rerank less: Retrieve 50-100, rerank to 10
Truncate content: 200-400 chars per doc for LLM reranking
Set timeouts: Always fallback to base ranking
Cache scores: Same query+doc pair = same score
Batch when possible: One LLM call for all docs

Related Skills

rag-retrieval
Core RAG pipeline that reranking enhances
contextual-retrieval
Contextual embeddings combined with reranking for best results
embeddings
Bi-encoder embeddings for initial retrieval before reranking
llm-evaluation
Evaluation patterns for measuring reranking quality

Key Decisions

Decision Choice Rationale

Retrieve/rerank ratio Retrieve 50-100, rerank to 10 Balance coverage and precision

Default reranker cross-encoder/ms-marco-MiniLM-L-6-v2 Good quality, free, fast (~50ms)

LLM reranking Batch all docs in one call Reduces latency vs per-doc calls

Timeout handling Fallback to base ranking Graceful degradation on slow reranking

References

Cohere Rerank
Sentence Transformers Cross-Encoders
BGE Reranker

reranking-patterns

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

responsive-patterns

domain-driven-design

dashboard-patterns