Vision Language Models ()

Integrate vision capabilities from leading multimodal models for image understanding, document analysis, and visual reasoning.

Overview

Image captioning and description generation
Visual question answering (VQA)
Document/chart/diagram analysis with OCR
Multi-image comparison and reasoning
Bounding box detection and region analysis
Video frame analysis

Model Comparison (January )

Model Context Strengths Vision Input

GPT-5.2 128K Best general reasoning, multimodal Up to 10 images

Claude Opus 4.6 1M Best coding, sustained agent tasks, adaptive thinking Up to 100 images

Gemini 2.5 Pro 1M+ Longest context, video analysis 3,600 images max

Gemini 3 Pro 1M Deep Think, 100% AIME 2025 Enhanced segmentation

Grok 4 2M Real-time X integration, DeepSearch Images + upcoming video

Image Input Methods

Base64 Encoding (All Providers)

import base64 import mimetypes

def encode_image_base64(image_path: str) -> tuple[str, str]: """Encode local image to base64 with MIME type.""" mime_type, _ = mimetypes.guess_type(image_path) mime_type = mime_type or "image/png"

with open(image_path, "rb") as f:
    base64_data = base64.standard_b64encode(f.read()).decode("utf-8")

return base64_data, mime_type

OpenAI GPT-5/4o Vision

from openai import OpenAI

client = OpenAI()

def analyze_image_openai(image_path: str, prompt: str) -> str: """Analyze image using GPT-5 or GPT-4o.""" base64_data, mime_type = encode_image_base64(image_path)

response = client.chat.completions.create(
    model="gpt-5.2",  # or "gpt-4.1" for cost optimization
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {"type": "image_url", "image_url": {
                "url": f"data:{mime_type};base64,{base64_data}",
                "detail": "high"  # low, high, or auto
            }}
        ]
    }],
    max_tokens=4096  # Required for vision
)
return response.choices[0].message.content

Claude 4.5 Vision (Anthropic)

import anthropic

client = anthropic.Anthropic()

def analyze_image_claude(image_path: str, prompt: str) -> str: """Analyze image using Claude Opus 4.6 or Sonnet 4.5.""" base64_data, media_type = encode_image_base64(image_path)

response = client.messages.create(
    model="claude-opus-4-6",  # or claude-sonnet-4-5
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": media_type,
                    "data": base64_data
                }
            },
            {"type": "text", "text": prompt}
        ]
    }]
)
return response.content[0].text

Gemini 2.5/3 Vision (Google)

import google.generativeai as genai from PIL import Image

genai.configure(api_key="YOUR_API_KEY")

def analyze_image_gemini(image_path: str, prompt: str) -> str: """Analyze image using Gemini 2.5 Pro or Gemini 3.""" model = genai.GenerativeModel("gemini-2.5-pro") # or gemini-3-pro

image = Image.open(image_path)

response = model.generate_content([prompt, image])
return response.text

For video analysis (Gemini excels here)

def analyze_video_gemini(video_path: str, prompt: str) -> str: """Analyze video using Gemini's native video support.""" model = genai.GenerativeModel("gemini-2.5-pro")

video_file = genai.upload_file(video_path)

response = model.generate_content([prompt, video_file])
return response.text

Grok 4 Vision (xAI)

from openai import OpenAI # Grok uses OpenAI-compatible API

client = OpenAI( api_key="YOUR_XAI_API_KEY", base_url="https://api.x.ai/v1" )

def analyze_image_grok(image_path: str, prompt: str) -> str: """Analyze image using Grok 4 with real-time capabilities.""" base64_data, mime_type = encode_image_base64(image_path)

response = client.chat.completions.create(
    model="grok-4",  # or grok-2-vision-1212
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {"type": "image_url", "image_url": {
                "url": f"data:{mime_type};base64,{base64_data}"
            }}
        ]
    }]
)
return response.choices[0].message.content

Multi-Image Analysis

async def compare_images(images: list[str], prompt: str) -> str: """Compare multiple images (Claude supports up to 100).""" content = []

for img_path in images:
    base64_data, media_type = encode_image_base64(img_path)
    content.append({
        "type": "image",
        "source": {
            "type": "base64",
            "media_type": media_type,
            "data": base64_data
        }
    })

content.append({"type": "text", "text": prompt})

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=8192,
    messages=[{"role": "user", "content": content}]
)
return response.content[0].text

Object Detection (Gemini 2.5+)

def detect_objects_gemini(image_path: str) -> list[dict]: """Detect objects with bounding boxes using Gemini 2.5+.""" model = genai.GenerativeModel("gemini-2.5-pro") image = Image.open(image_path)

response = model.generate_content([
    "Detect all objects in this image. Return bounding boxes "
    "as JSON with format: {objects: [{label, box: [x1,y1,x2,y2]}]}",
    image
])

import json
return json.loads(response.text)

Token Cost Optimization

Provider Detail Level Cost Impact

OpenAI low (65 tokens) Use for classification

OpenAI high (129+ tokens/tile) Use for OCR/charts

Gemini 258 tokens base Scales with resolution

Claude Per-image pricing Batch for efficiency

Cost-optimized simple classification

response = client.chat.completions.create( model="gpt-5.2-mini", # Cheaper for simple tasks messages=[{ "role": "user", "content": [ {"type": "text", "text": "Is there a person? Reply: yes/no"}, {"type": "image_url", "image_url": { "url": image_url, "detail": "low" # Minimal tokens }} ] }] )

Image Size Limits ()

Provider Max Size Max Images Notes

OpenAI 20MB 10/request GPT-5 series

Claude 8000x8000 px 100/request 2000px if >20 images

Gemini 20MB 3,600/request Best for batch

Grok 20MB Limited Grok 5 expands this

Key Decisions

Decision Recommendation

High accuracy Claude Opus 4.6 or GPT-5

Long documents Gemini 2.5 Pro (1M context)

Cost efficiency Gemini 2.5 Flash ($0.15/M tokens)

Real-time/X data Grok 4 with DeepSearch

Video analysis Gemini 2.5/3 Pro (native)

Common Mistakes

Not setting max_tokens (responses truncated)
Sending oversized images (resize to 2048px max)
Using high detail for yes/no questions
Not validating image format before encoding
Ignoring rate limits on vision endpoints
Using deprecated models (GPT-4V retired)

Limitations

Cannot identify specific people (privacy restriction)
May hallucinate on low-quality/rotated images (<200px)
GPT-5: may struggle with precise spatial reasoning on edge cases
No real-time video (use frame extraction except Gemini)

Related Skills

audio-language-models
Audio/speech processing
multimodal-rag
Image + text retrieval
llm-streaming
Streaming vision responses

Capability Details

image-captioning

Keywords: caption, describe, image description, alt text, accessibility Solves:

Generate descriptive captions for images
Create accessibility alt text
Extract visual content summary

visual-qa

Keywords: VQA, visual question, image question, analyze image Solves:

Answer questions about image content
Extract specific information from visuals
Reason about image elements

document-vision

Keywords: document, PDF, chart, diagram, OCR, extract, table Solves:

Extract text from documents and charts
Analyze diagrams and flowcharts
Process forms and tables with structure

Claude Code PDF Handling (CC 2.1.30+)

Read Tool Pages Parameter

For large PDFs (>10 pages), use the pages parameter to read specific ranges:

Read first 5 pages of a large PDF

Read(file_path="/path/to/document.pdf", pages="1-5")

Read specific page

Read(file_path="/path/to/document.pdf", pages="10")

Read range in middle

Read(file_path="/path/to/document.pdf", pages="15-25")

Large PDF Strategy

For documents >100 pages, process incrementally:

1. Initial scan - read first pages for structure

Read(file_path=pdf_path, pages="1-5")

2. Identify key sections from TOC/headers

3. Read relevant sections

Read(file_path=pdf_path, pages="45-55") # e.g., "Implementation" section

4. Process remaining sections as needed

Read(file_path=pdf_path, pages="80-90") # e.g., "Appendix" section

Limits

Constraint Value

Max pages per request 20

Max file size 20MB

Large PDF threshold

10 pages (returns lightweight reference if @ mentioned)

multi-image-analysis

Keywords: compare images, multiple images, image comparison, batch Solves:

Compare visual elements across images
Track changes between versions
Analyze image sequences

object-detection

Keywords: bounding box, detect objects, locate, segmentation Solves:

Detect and locate objects in images
Generate bounding box coordinates
Segment image regions (Gemini 2.5+)

vision-language-models

Safety Notice

Copy this and send it to your AI assistant to learn