Vision & Multimodal Skill

Leverage Claude's vision capabilities for image analysis, document processing, and multimodal understanding.

When to Use This Skill

Image analysis and description
Document/PDF processing
Screenshot analysis
OCR-like text extraction
Visual comparison
Chart and diagram interpretation

Supported Formats

Format Status Best For

JPEG ✓ Photos, natural scenes

PNG ✓ Screenshots, UI, text

GIF ✓ Animated (first frame)

WebP ✓ Modern, compressed

PDF ✓ Documents (via Files API)

Image Size Guidelines

Minimum: 200 pixels (smaller = reduced accuracy)
Optimal: 1000x1000 pixels
Maximum: 8000x8000 pixels
Token cost: ~(width × height) / 1000
Tip: Resize to 1568px max dimension for 30-50% token savings

Core Patterns

Pattern 1: Single Image Analysis

import anthropic import base64

client = anthropic.Anthropic()

Load and encode image

with open("image.jpg", "rb") as f: image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{ "role": "user", "content": [ { "type": "image", "source": { "type": "base64", "media_type": "image/jpeg", "data": image_data } }, { "type": "text", "text": "Describe this image in detail." } ] }] )

Pattern 2: Image from URL

import httpx

Fetch and encode from URL

image_url = "https://example.com/image.jpg" response = httpx.get(image_url) image_data = base64.standard_b64encode(response.content).decode("utf-8")

Then use same pattern as above

Pattern 3: Multiple Images

Compare multiple images (up to 100 per request)

messages = [{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image1}}, {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image2}}, {"type": "text", "text": "Compare these two images and list the differences."} ] }]

Pattern 4: Few-Shot with Images

Teach by example

messages = [ # Example 1 {"role": "user", "content": [ {"type": "image", "source": {...}}, {"type": "text", "text": "Classify this image."} ]}, {"role": "assistant", "content": "Category: Landscape\nElements: Mountains, lake, trees"},

# Example 2
{"role": "user", "content": [
    {"type": "image", "source": {...}},
    {"type": "text", "text": "Classify this image."}
]},
{"role": "assistant", "content": "Category: Portrait\nElements: Person, indoor, professional"},

# Target image
{"role": "user", "content": [
    {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": target_image}},
    {"type": "text", "text": "Classify this image."}
]}

]

Pattern 5: PDF Processing

Using Files API (beta)

with open("document.pdf", "rb") as f: pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4096, messages=[{ "role": "user", "content": [ { "type": "document", "source": { "type": "base64", "media_type": "application/pdf", "data": pdf_data } }, {"type": "text", "text": "Summarize this document."} ] }] )

Prompt Engineering for Vision

Strategy 1: Role Assignment

prompt = """You have perfect vision and exceptional attention to detail, making you an expert at analyzing technical diagrams.

Analyze this architecture diagram and identify:

All components
Data flow between components
Potential bottlenecks"""

Strategy 2: Step-by-Step Thinking

prompt = """Before answering, analyze the image systematically:

What is the overall subject?
What are the key elements?
How do elements relate to each other?
What details stand out? </thinking>

Then provide your answer based on this analysis."""

Strategy 3: Structured Output

prompt = """Extract information from this receipt and return as JSON:

{ "vendor": "", "date": "", "items": [{"name": "", "price": 0}], "total": 0 }"""

Image Optimization

from PIL import Image import io

def optimize_for_claude(image_path, max_dimension=1568): """Resize image to reduce token usage by 30-50%""" with Image.open(image_path) as img: # Calculate new dimensions ratio = min(max_dimension / img.width, max_dimension / img.height) if ratio < 1: new_size = (int(img.width * ratio), int(img.height * ratio)) img = img.resize(new_size, Image.LANCZOS)

    # Convert to bytes
    buffer = io.BytesIO()
    img.save(buffer, format="JPEG", quality=85)
    return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")

Common Use Cases

Text Extraction (OCR-like)

prompt = """Extract all text from this image. Preserve the original formatting and structure as much as possible. If text is unclear, indicate with [unclear]."""

Table Extraction

prompt = """Extract the table data from this image. Return as a markdown table with proper headers and alignment."""

Chart Analysis

prompt = """Analyze this chart:

What type of chart is this?
What are the axes/labels?
What are the key data points?
What trends or patterns are visible?"""

Best Practices

DO:

Use high-quality images (≥1000px)
Resize large images to save tokens
Provide context about what to look for
Use few-shot examples for consistent output

DON'T:

Send images smaller than 200px
Expect perfect OCR for handwriting
Send very large images (>8000px)
Ignore token costs for multiple images

Limitations

Cannot identify specific individuals
May struggle with very small text
Animated GIFs: only first frame analyzed
Some specialized symbols may be misread

vision-multimodal

Safety Notice

Copy this and send it to your AI assistant to learn

Load and encode image

Fetch and encode from URL

Then use same pattern as above

Compare multiple images (up to 100 per request)

Teach by example

Using Files API (beta)

Source Transparency

Related Skills

design-system

kanban

gcp