Vision Language Models ()
Integrate vision capabilities from leading multimodal models for image understanding, document analysis, and visual reasoning.
Overview
-
Image captioning and description generation
-
Visual question answering (VQA)
-
Document/chart/diagram analysis with OCR
-
Multi-image comparison and reasoning
-
Bounding box detection and region analysis
-
Video frame analysis
Model Comparison (January )
Model Context Strengths Vision Input
GPT-5.2 128K Best general reasoning, multimodal Up to 10 images
Claude Opus 4.6 1M Best coding, sustained agent tasks, adaptive thinking Up to 100 images
Gemini 2.5 Pro 1M+ Longest context, video analysis 3,600 images max
Gemini 3 Pro 1M Deep Think, 100% AIME 2025 Enhanced segmentation
Grok 4 2M Real-time X integration, DeepSearch Images + upcoming video
Image Input Methods
Base64 Encoding (All Providers)
import base64 import mimetypes
def encode_image_base64(image_path: str) -> tuple[str, str]: """Encode local image to base64 with MIME type.""" mime_type, _ = mimetypes.guess_type(image_path) mime_type = mime_type or "image/png"
with open(image_path, "rb") as f:
base64_data = base64.standard_b64encode(f.read()).decode("utf-8")
return base64_data, mime_type
OpenAI GPT-5/4o Vision
from openai import OpenAI
client = OpenAI()
def analyze_image_openai(image_path: str, prompt: str) -> str: """Analyze image using GPT-5 or GPT-4o.""" base64_data, mime_type = encode_image_base64(image_path)
response = client.chat.completions.create(
model="gpt-5.2", # or "gpt-4.1" for cost optimization
messages=[{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {
"url": f"data:{mime_type};base64,{base64_data}",
"detail": "high" # low, high, or auto
}}
]
}],
max_tokens=4096 # Required for vision
)
return response.choices[0].message.content
Claude 4.5 Vision (Anthropic)
import anthropic
client = anthropic.Anthropic()
def analyze_image_claude(image_path: str, prompt: str) -> str: """Analyze image using Claude Opus 4.6 or Sonnet 4.5.""" base64_data, media_type = encode_image_base64(image_path)
response = client.messages.create(
model="claude-opus-4-6", # or claude-sonnet-4-5
max_tokens=4096,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": base64_data
}
},
{"type": "text", "text": prompt}
]
}]
)
return response.content[0].text
Gemini 2.5/3 Vision (Google)
import google.generativeai as genai from PIL import Image
genai.configure(api_key="YOUR_API_KEY")
def analyze_image_gemini(image_path: str, prompt: str) -> str: """Analyze image using Gemini 2.5 Pro or Gemini 3.""" model = genai.GenerativeModel("gemini-2.5-pro") # or gemini-3-pro
image = Image.open(image_path)
response = model.generate_content([prompt, image])
return response.text
For video analysis (Gemini excels here)
def analyze_video_gemini(video_path: str, prompt: str) -> str: """Analyze video using Gemini's native video support.""" model = genai.GenerativeModel("gemini-2.5-pro")
video_file = genai.upload_file(video_path)
response = model.generate_content([prompt, video_file])
return response.text
Grok 4 Vision (xAI)
from openai import OpenAI # Grok uses OpenAI-compatible API
client = OpenAI( api_key="YOUR_XAI_API_KEY", base_url="https://api.x.ai/v1" )
def analyze_image_grok(image_path: str, prompt: str) -> str: """Analyze image using Grok 4 with real-time capabilities.""" base64_data, mime_type = encode_image_base64(image_path)
response = client.chat.completions.create(
model="grok-4", # or grok-2-vision-1212
messages=[{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {
"url": f"data:{mime_type};base64,{base64_data}"
}}
]
}]
)
return response.choices[0].message.content
Multi-Image Analysis
async def compare_images(images: list[str], prompt: str) -> str: """Compare multiple images (Claude supports up to 100).""" content = []
for img_path in images:
base64_data, media_type = encode_image_base64(img_path)
content.append({
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": base64_data
}
})
content.append({"type": "text", "text": prompt})
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=8192,
messages=[{"role": "user", "content": content}]
)
return response.content[0].text
Object Detection (Gemini 2.5+)
def detect_objects_gemini(image_path: str) -> list[dict]: """Detect objects with bounding boxes using Gemini 2.5+.""" model = genai.GenerativeModel("gemini-2.5-pro") image = Image.open(image_path)
response = model.generate_content([
"Detect all objects in this image. Return bounding boxes "
"as JSON with format: {objects: [{label, box: [x1,y1,x2,y2]}]}",
image
])
import json
return json.loads(response.text)
Token Cost Optimization
Provider Detail Level Cost Impact
OpenAI low (65 tokens) Use for classification
OpenAI high (129+ tokens/tile) Use for OCR/charts
Gemini 258 tokens base Scales with resolution
Claude Per-image pricing Batch for efficiency
Cost-optimized simple classification
response = client.chat.completions.create( model="gpt-5.2-mini", # Cheaper for simple tasks messages=[{ "role": "user", "content": [ {"type": "text", "text": "Is there a person? Reply: yes/no"}, {"type": "image_url", "image_url": { "url": image_url, "detail": "low" # Minimal tokens }} ] }] )
Image Size Limits ()
Provider Max Size Max Images Notes
OpenAI 20MB 10/request GPT-5 series
Claude 8000x8000 px 100/request 2000px if >20 images
Gemini 20MB 3,600/request Best for batch
Grok 20MB Limited Grok 5 expands this
Key Decisions
Decision Recommendation
High accuracy Claude Opus 4.6 or GPT-5
Long documents Gemini 2.5 Pro (1M context)
Cost efficiency Gemini 2.5 Flash ($0.15/M tokens)
Real-time/X data Grok 4 with DeepSearch
Video analysis Gemini 2.5/3 Pro (native)
Common Mistakes
-
Not setting max_tokens (responses truncated)
-
Sending oversized images (resize to 2048px max)
-
Using high detail for yes/no questions
-
Not validating image format before encoding
-
Ignoring rate limits on vision endpoints
-
Using deprecated models (GPT-4V retired)
Limitations
-
Cannot identify specific people (privacy restriction)
-
May hallucinate on low-quality/rotated images (<200px)
-
GPT-5: may struggle with precise spatial reasoning on edge cases
-
No real-time video (use frame extraction except Gemini)
Related Skills
-
audio-language-models
-
Audio/speech processing
-
multimodal-rag
-
Image + text retrieval
-
llm-streaming
-
Streaming vision responses
Capability Details
image-captioning
Keywords: caption, describe, image description, alt text, accessibility Solves:
-
Generate descriptive captions for images
-
Create accessibility alt text
-
Extract visual content summary
visual-qa
Keywords: VQA, visual question, image question, analyze image Solves:
-
Answer questions about image content
-
Extract specific information from visuals
-
Reason about image elements
document-vision
Keywords: document, PDF, chart, diagram, OCR, extract, table Solves:
-
Extract text from documents and charts
-
Analyze diagrams and flowcharts
-
Process forms and tables with structure
Claude Code PDF Handling (CC 2.1.30+)
Read Tool Pages Parameter
For large PDFs (>10 pages), use the pages parameter to read specific ranges:
Read first 5 pages of a large PDF
Read(file_path="/path/to/document.pdf", pages="1-5")
Read specific page
Read(file_path="/path/to/document.pdf", pages="10")
Read range in middle
Read(file_path="/path/to/document.pdf", pages="15-25")
Large PDF Strategy
For documents >100 pages, process incrementally:
1. Initial scan - read first pages for structure
Read(file_path=pdf_path, pages="1-5")
2. Identify key sections from TOC/headers
3. Read relevant sections
Read(file_path=pdf_path, pages="45-55") # e.g., "Implementation" section
4. Process remaining sections as needed
Read(file_path=pdf_path, pages="80-90") # e.g., "Appendix" section
Limits
Constraint Value
Max pages per request 20
Max file size 20MB
Large PDF threshold
10 pages (returns lightweight reference if @ mentioned)
multi-image-analysis
Keywords: compare images, multiple images, image comparison, batch Solves:
-
Compare visual elements across images
-
Track changes between versions
-
Analyze image sequences
object-detection
Keywords: bounding box, detect objects, locate, segmentation Solves:
-
Detect and locate objects in images
-
Generate bounding box coordinates
-
Segment image regions (Gemini 2.5+)