CLIP - Contrastive Language-Image Pre-Training
OpenAI's model that understands images from natural language.
When to use CLIP
Use when:
-
Zero-shot image classification (no training data needed)
-
Image-text similarity/matching
-
Semantic image search
-
Content moderation (detect NSFW, violence)
-
Visual question answering
-
Cross-modal retrieval (image→text, text→image)
Metrics:
-
25,300+ GitHub stars
-
Trained on 400M image-text pairs
-
Matches ResNet-50 on ImageNet (zero-shot)
-
MIT License
Use alternatives instead:
-
BLIP-2: Better captioning
-
LLaVA: Vision-language chat
-
Segment Anything: Image segmentation
Quick start
Installation
pip install git+https://github.com/openai/CLIP.git pip install torch torchvision ftfy regex tqdm
Zero-shot classification
import torch import clip from PIL import Image
Load model
device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load("ViT-B/32", device=device)
Load image
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
Define possible labels
text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)
Compute similarity
with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text)
# Cosine similarity
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
Print results
labels = ["a dog", "a cat", "a bird", "a car"] for label, prob in zip(labels, probs[0]): print(f"{label}: {prob:.2%}")
Available models
Models (sorted by size)
models = [ "RN50", # ResNet-50 "RN101", # ResNet-101 "ViT-B/32", # Vision Transformer (recommended) "ViT-B/16", # Better quality, slower "ViT-L/14", # Best quality, slowest ]
model, preprocess = clip.load("ViT-B/32")
Model Parameters Speed Quality
RN50 102M Fast Good
ViT-B/32 151M Medium Better
ViT-L/14 428M Slow Best
Image-text similarity
Compute embeddings
image_features = model.encode_image(image) text_features = model.encode_text(text)
Normalize
image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True)
Cosine similarity
similarity = (image_features @ text_features.T).item() print(f"Similarity: {similarity:.4f}")
Semantic image search
Index images
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"] image_embeddings = []
for img_path in image_paths: image = preprocess(Image.open(img_path)).unsqueeze(0).to(device) with torch.no_grad(): embedding = model.encode_image(image) embedding /= embedding.norm(dim=-1, keepdim=True) image_embeddings.append(embedding)
image_embeddings = torch.cat(image_embeddings)
Search with text query
query = "a sunset over the ocean" text_input = clip.tokenize([query]).to(device) with torch.no_grad(): text_embedding = model.encode_text(text_input) text_embedding /= text_embedding.norm(dim=-1, keepdim=True)
Find most similar images
similarities = (text_embedding @ image_embeddings.T).squeeze(0) top_k = similarities.topk(3)
for idx, score in zip(top_k.indices, top_k.values): print(f"{image_paths[idx]}: {score:.3f}")
Content moderation
Define categories
categories = [ "safe for work", "not safe for work", "violent content", "graphic content" ]
text = clip.tokenize(categories).to(device)
Check image
with torch.no_grad(): logits_per_image, _ = model(image, text) probs = logits_per_image.softmax(dim=-1)
Get classification
max_idx = probs.argmax().item() max_prob = probs[0, max_idx].item()
print(f"Category: {categories[max_idx]} ({max_prob:.2%})")
Batch processing
Process multiple images
images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)] images = torch.stack(images).to(device)
with torch.no_grad(): image_features = model.encode_image(images) image_features /= image_features.norm(dim=-1, keepdim=True)
Batch text
texts = ["a dog", "a cat", "a bird"] text_tokens = clip.tokenize(texts).to(device)
with torch.no_grad(): text_features = model.encode_text(text_tokens) text_features /= text_features.norm(dim=-1, keepdim=True)
Similarity matrix (10 images × 3 texts)
similarities = image_features @ text_features.T print(similarities.shape) # (10, 3)
Integration with vector databases
Store CLIP embeddings in Chroma/FAISS
import chromadb
client = chromadb.Client() collection = client.create_collection("image_embeddings")
Add image embeddings
for img_path, embedding in zip(image_paths, image_embeddings): collection.add( embeddings=[embedding.cpu().numpy().tolist()], metadatas=[{"path": img_path}], ids=[img_path] )
Query with text
query = "a sunset" text_embedding = model.encode_text(clip.tokenize([query])) results = collection.query( query_embeddings=[text_embedding.cpu().numpy().tolist()], n_results=5 )
Best practices
-
Use ViT-B/32 for most cases - Good balance
-
Normalize embeddings - Required for cosine similarity
-
Batch processing - More efficient
-
Cache embeddings - Expensive to recompute
-
Use descriptive labels - Better zero-shot performance
-
GPU recommended - 10-50× faster
-
Preprocess images - Use provided preprocess function
Performance
Operation CPU GPU (V100)
Image encoding ~200ms ~20ms
Text encoding ~50ms ~5ms
Similarity compute <1ms <1ms
Limitations
-
Not for fine-grained tasks - Best for broad categories
-
Requires descriptive text - Vague labels perform poorly
-
Biased on web data - May have dataset biases
-
No bounding boxes - Whole image only
-
Limited spatial understanding - Position/counting weak
Resources
-
GitHub: https://github.com/openai/CLIP ⭐ 25,300+
-
Colab: https://colab.research.google.com/github/openai/clip/
-
License: MIT