Multimodal Models
Pre-trained models for vision, audio, and cross-modal tasks.
Model Overview
Model Modality Task
CLIP Image + Text Zero-shot classification, similarity
Whisper Audio → Text Transcription, translation
Stable Diffusion Text → Image Image generation, editing
CLIP (Vision-Language)
Zero-shot image classification without training on specific labels.
CLIP Use Cases
Task How
Zero-shot classification Compare image to text label embeddings
Image search Find images matching text query
Content moderation Classify against safety categories
Image similarity Compare image embeddings
CLIP Models
Model Parameters Trade-off
ViT-B/32 151M Recommended balance
ViT-L/14 428M Best quality, slower
RN50 102M Fastest, lower quality
CLIP Concepts
Concept Description
Dual encoder Separate encoders for image and text
Contrastive learning Trained to match image-text pairs
Normalization Always normalize embeddings before similarity
Descriptive labels Better labels = better zero-shot accuracy
Key concept: CLIP embeds images and text in same space. Classification = find nearest text embedding.
CLIP Limitations
-
Not for fine-grained classification
-
No spatial understanding (whole image only)
-
May reflect training data biases
Whisper (Speech Recognition)
Robust multilingual transcription supporting 99 languages.
Whisper Use Cases
Task Configuration
Transcription Default transcribe task
Translation to English task="translate"
Subtitles Output format SRT/VTT
Word timestamps word_timestamps=True
Whisper Models
Model Size Speed Recommendation
turbo 809M Fast Recommended
large 1550M Slow Maximum quality
small 244M Medium Good balance
base 74M Fast Quick tests
tiny 39M Fastest Prototyping only
Whisper Concepts
Concept Description
Language detection Auto-detects, or specify for speed
Initial prompt Improves technical terms accuracy
Timestamps Segment-level or word-level
faster-whisper 4× faster alternative implementation
Key concept: Specify language when known—auto-detection adds latency.
Whisper Limitations
-
May hallucinate on silence/noise
-
No speaker diarization (who said what)
-
Accuracy degrades on >30 min audio
-
Not suitable for real-time captioning
Stable Diffusion (Image Generation)
Text-to-image generation with various control methods.
SD Use Cases
Task Pipeline
Text-to-image DiffusionPipeline
Style transfer Image2Image
Fill regions Inpainting
Guided generation ControlNet
Custom styles LoRA adapters
SD Models
Model Resolution Quality
SDXL 1024×1024 Best
SD 1.5 512×512 Good, faster
SD 2.1 768×768 Middle ground
Key Parameters
Parameter Effect Typical Value
num_inference_steps Quality vs speed 20-50
guidance_scale Prompt adherence 7-12
negative_prompt Avoid artifacts "blurry, low quality"
strength (img2img) How much to change 0.5-0.8
seed Reproducibility Fixed number
Control Methods
Method Input Use Case
ControlNet Edge/depth/pose Structural guidance
LoRA Trained weights Custom styles
Img2Img Source image Style transfer
Inpainting Image + mask Fill regions
Memory Optimization
Technique Effect
CPU offload Reduces VRAM usage
Attention slicing Trades speed for memory
VAE tiling Large image support
xFormers Faster attention
DPM scheduler Fewer steps needed
Key concept: Use SDXL for quality, SD 1.5 for speed. Always use negative prompts.
SD Limitations
-
GPU strongly recommended (CPU very slow)
-
Large VRAM requirements for SDXL
-
May generate anatomical errors
-
Prompt engineering matters
Common Patterns
Embedding and Similarity
All three models use embeddings:
-
CLIP: Image/text embeddings for similarity
-
Whisper: Audio embeddings for transcription
-
SD: Text embeddings for image conditioning
GPU Acceleration
Model VRAM Needed
CLIP ViT-B/32 ~2 GB
Whisper turbo ~6 GB
SD 1.5 ~6 GB
SDXL ~10 GB
Best Practices
Practice Why
Use recommended model sizes Best quality/speed balance
Cache embeddings (CLIP) Expensive to recompute
Specify language (Whisper) Faster than auto-detect
Use negative prompts (SD) Avoid common artifacts
Set seeds for reproducibility Consistent results
Resources
-
Whisper: https://github.com/openai/whisper
-
Diffusers: https://huggingface.co/docs/diffusers