multimodal-rag

Build retrieval-augmented generation systems that handle images, text, and mixed content.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "multimodal-rag" with this command: npx skills add yonatangross/orchestkit/yonatangross-orchestkit-multimodal-rag

Multimodal RAG ()

Build retrieval-augmented generation systems that handle images, text, and mixed content.

Overview

  • Image + text retrieval (product search, documentation)

  • Cross-modal search (text query -> image results)

  • Multimodal document processing (PDFs with charts)

  • Visual question answering with context

  • Image similarity and deduplication

  • Hybrid search pipelines

Architecture Approaches

Approach Pros Cons Best For

Joint Embedding (CLIP) Direct comparison Limited context Pure image search

Caption-based Works with text LLMs Lossy conversion Existing text RAG

Hybrid Best accuracy More complex Production systems

Embedding Models ()

Model Context Modalities Best For

Voyage multimodal-3 32K tokens Text + Image Long documents

SigLIP 2 Standard Text + Image Large-scale retrieval

CLIP ViT-L/14 77 tokens Text + Image General purpose

ImageBind Standard 6 modalities Audio/video included

ColPali Document Text + Image PDF/document RAG

CLIP-Based Image Embeddings

import torch from PIL import Image from transformers import CLIPProcessor, CLIPModel

Load CLIP model

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14") processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

def embed_image(image_path: str) -> list[float]: """Generate CLIP embedding for an image.""" image = Image.open(image_path) inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    embeddings = model.get_image_features(**inputs)

# Normalize for cosine similarity
embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings[0].tolist()

def embed_text(text: str) -> list[float]: """Generate CLIP embedding for text query.""" inputs = processor(text=[text], return_tensors="pt", padding=True)

with torch.no_grad():
    embeddings = model.get_text_features(**inputs)

embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings[0].tolist()

Cross-modal search: text -> images

def search_images(query: str, image_embeddings: list, top_k: int = 5): """Search images using text query.""" query_embedding = embed_text(query)

# Compute similarities (cosine)
similarities = [
    np.dot(query_embedding, img_emb)
    for img_emb in image_embeddings
]

top_indices = np.argsort(similarities)[-top_k:][::-1]
return top_indices, [similarities[i] for i in top_indices]

Voyage Multimodal-3 (Long Context)

import voyageai

client = voyageai.Client()

def embed_multimodal_voyage( texts: list[str] = None, images: list[str] = None # File paths or URLs ) -> list[list[float]]: """Embed text and/or images with 32K token context.""" inputs = []

if texts:
    inputs.extend([{"type": "text", "content": t} for t in texts])

if images:
    for img_path in images:
        with open(img_path, "rb") as f:
            import base64
            b64 = base64.b64encode(f.read()).decode()
            inputs.append({
                "type": "image",
                "content": f"data:image/png;base64,{b64}"
            })

response = client.multimodal_embed(
    inputs=inputs,
    model="voyage-multimodal-3"
)

return response.embeddings

Hybrid RAG Pipeline

from typing import Optional import numpy as np

class MultimodalRAG: """Production multimodal RAG with hybrid retrieval."""

def __init__(self, vector_db, vision_model, text_model):
    self.vector_db = vector_db
    self.vision_model = vision_model
    self.text_model = text_model

async def index_document(
    self,
    doc_id: str,
    text: Optional[str] = None,
    image_path: Optional[str] = None,
    metadata: dict = None
):
    """Index a document with text and/or image."""
    embeddings = []

    if text:
        text_emb = embed_text(text)
        embeddings.append(("text", text_emb))

    if image_path:
        # Option 1: Direct image embedding
        img_emb = embed_image(image_path)
        embeddings.append(("image", img_emb))

        # Option 2: Generate caption for text search
        caption = await self.generate_caption(image_path)
        caption_emb = embed_text(caption)
        embeddings.append(("caption", caption_emb))

    # Store with shared document ID
    for emb_type, emb in embeddings:
        await self.vector_db.upsert(
            id=f"{doc_id}_{emb_type}",
            embedding=emb,
            metadata={
                "doc_id": doc_id,
                "type": emb_type,
                "image_url": image_path,
                "text": text,
                **(metadata or {})
            }
        )

async def generate_caption(self, image_path: str) -> str:
    """Generate text caption for image indexing."""
    # Use GPT-4o or Claude for high-quality captions
    response = await self.vision_model.analyze(
        image_path,
        prompt="Describe this image in detail for search indexing. "
               "Include objects, text, colors, and context."
    )
    return response

async def retrieve(
    self,
    query: str,
    query_image: Optional[str] = None,
    top_k: int = 10
) -> list[dict]:
    """Hybrid retrieval with optional image query."""
    results = []

    # Text query embedding
    text_emb = embed_text(query)
    text_results = await self.vector_db.search(
        embedding=text_emb,
        top_k=top_k
    )
    results.extend(text_results)

    # Image query embedding (if provided)
    if query_image:
        img_emb = embed_image(query_image)
        img_results = await self.vector_db.search(
            embedding=img_emb,
            top_k=top_k
        )
        results.extend(img_results)

    # Dedupe by doc_id, keep highest score
    seen = {}
    for r in results:
        doc_id = r["metadata"]["doc_id"]
        if doc_id not in seen or r["score"] > seen[doc_id]["score"]:
            seen[doc_id] = r

    return sorted(seen.values(), key=lambda x: x["score"], reverse=True)[:top_k]

Claude Code PDF Handling (CC 2.1.30+)

For large PDFs, use the pages parameter to process in batches:

Process large PDF in page-range batches for embedding

async def process_large_pdf_for_rag(pdf_path: str, pages_per_batch: int = 10): """Process large PDF by page ranges before embedding.""" import subprocess

# Get total page count
result = subprocess.run(
    ["pdfinfo", pdf_path],
    capture_output=True, text=True
)
total_pages = int([l for l in result.stdout.split('\n')
                   if 'Pages:' in l][0].split(':')[1].strip())

chunks = []
for start in range(1, total_pages + 1, pages_per_batch):
    end = min(start + pages_per_batch - 1, total_pages)

    # Read page range (CC 2.1.30 pages parameter)
    # Read(file_path=pdf_path, pages=f"{start}-{end}")

    # Extract and embed content from this range
    page_chunks = extract_chunks_from_range(pdf_path, start, end)
    chunks.extend(page_chunks)

return chunks

Limits

  • Max 20 pages per Read request

  • Max 20MB file size

  • Process large documents in batches for embedding

Multimodal Document Chunking

from dataclasses import dataclass from typing import Literal

@dataclass class Chunk: content: str chunk_type: Literal["text", "image", "table", "chart"] page: int image_path: Optional[str] = None embedding: Optional[list[float]] = None

def chunk_multimodal_document(pdf_path: str) -> list[Chunk]: """Chunk PDF preserving images and tables.""" from pdf2image import convert_from_path import fitz # PyMuPDF

doc = fitz.open(pdf_path)
chunks = []

for page_num, page in enumerate(doc):
    # Extract text blocks
    text_blocks = page.get_text("blocks")
    current_text = ""

    for block in text_blocks:
        if block[6] == 0:  # Text block
            current_text += block[4] + "\n"
        else:  # Image block
            # Save current text chunk
            if current_text.strip():
                chunks.append(Chunk(
                    content=current_text.strip(),
                    chunk_type="text",
                    page=page_num
                ))
                current_text = ""

            # Extract and save image
            xref = block[7]
            img = doc.extract_image(xref)
            img_path = f"/tmp/page{page_num}_img{xref}.{img['ext']}"
            with open(img_path, "wb") as f:
                f.write(img["image"])

            # Generate caption for the image
            caption = generate_image_caption(img_path)

            chunks.append(Chunk(
                content=caption,
                chunk_type="image",
                page=page_num,
                image_path=img_path
            ))

    # Final text chunk
    if current_text.strip():
        chunks.append(Chunk(
            content=current_text.strip(),
            chunk_type="text",
            page=page_num
        ))

return chunks

Vector Database Setup (Milvus)

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType

def setup_multimodal_collection(): """Create Milvus collection for multimodal embeddings.""" connections.connect("default", host="localhost", port="19530")

fields = [
    FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, max_length=256),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768),
    FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=256),
    FieldSchema(name="chunk_type", dtype=DataType.VARCHAR, max_length=32),
    FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
    FieldSchema(name="image_url", dtype=DataType.VARCHAR, max_length=1024),
    FieldSchema(name="page", dtype=DataType.INT64)
]

schema = CollectionSchema(fields, "Multimodal document collection")
collection = Collection("multimodal_docs", schema)

# Create index for vector search
index_params = {
    "metric_type": "COSINE",
    "index_type": "HNSW",
    "params": {"M": 16, "efConstruction": 256}
}
collection.create_index("embedding", index_params)

return collection

Multimodal Generation

async def generate_with_context( query: str, retrieved_chunks: list[Chunk], model: str = "claude-opus-4-6" ) -> str: """Generate response using multimodal context.""" content = []

# Add retrieved images first (attention positioning)
for chunk in retrieved_chunks:
    if chunk.chunk_type == "image" and chunk.image_path:
        base64_data, media_type = encode_image_base64(chunk.image_path)
        content.append({
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": media_type,
                "data": base64_data
            }
        })

# Add text context
text_context = "\n\n".join([
    f"[Page {c.page}]: {c.content}"
    for c in retrieved_chunks if c.chunk_type == "text"
])

content.append({
    "type": "text",
    "text": f"""Use the following context to answer the question.

Context: {text_context}

Question: {query}

Provide a detailed answer based on the context and images provided.""" })

response = client.messages.create(
    model=model,
    max_tokens=4096,
    messages=[{"role": "user", "content": content}]
)

return response.content[0].text

Key Decisions

Decision Recommendation

Long documents Voyage multimodal-3 (32K context)

Scale retrieval SigLIP 2 (optimized for large-scale)

PDF processing ColPali (document-native)

Multi-modal search Hybrid: CLIP + text embeddings

Production DB Milvus or Pinecone with hybrid

Common Mistakes

  • Embedding images without captions (limits text search)

  • Not deduplicating by document ID

  • Missing image URL storage (can't display results)

  • Using only image OR text embeddings (use both)

  • Ignoring chunk boundaries (split mid-paragraph)

  • Not validating image retrieval quality

Related Skills

  • vision-language-models

  • Image analysis

  • embeddings

  • Text embedding patterns

  • rag-retrieval

  • Text RAG patterns

  • contextual-retrieval

  • Hybrid BM25+vector

Capability Details

image-embeddings

Keywords: CLIP, image embedding, visual features, SigLIP Solves:

  • Convert images to vector representations

  • Enable image similarity search

  • Cross-modal retrieval

cross-modal-search

Keywords: text to image, image to text, cross-modal Solves:

  • Find images from text queries

  • Find text from image queries

  • Bridge modalities

multimodal-chunking

Keywords: chunk PDF, split document, extract images Solves:

  • Process documents with mixed content

  • Preserve image-text relationships

  • Handle tables and charts

hybrid-retrieval

Keywords: hybrid search, fusion, multi-embedding Solves:

  • Combine text and image search

  • Improve retrieval accuracy

  • Handle diverse queries

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

responsive-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
General

domain-driven-design

No summary provided by upstream source.

Repository SourceNeeds Review
General

dashboard-patterns

No summary provided by upstream source.

Repository SourceNeeds Review