Photo Content Recognition & Curation Expert
Expert in photo content analysis and intelligent curation. Combines classical computer vision with modern deep learning for comprehensive photo analysis.
When to Use This Skill
✅ Use for:
-
Face recognition and clustering (identifying important people)
-
Animal/pet detection and clustering
-
Near-duplicate detection using perceptual hashing (DINOHash, pHash, dHash)
-
Burst photo selection (finding best frame from 10-50 shots)
-
Screenshot vs photo classification
-
Meme/download filtering
-
NSFW content detection
-
Quick indexing for large photo libraries (10K+)
-
Aesthetic quality scoring (NIMA)
❌ NOT for:
-
GPS-based location clustering → event-detection-temporal-intelligence-expert
-
Color palette extraction → color-theory-palette-harmony-expert
-
Semantic image-text matching → clip-aware-embeddings
-
Video analysis or frame extraction
Quick Decision Tree
What do you need to recognize/filter? │ ├─ Duplicate photos? ─────────────────────────────── Perceptual Hashing │ ├─ Exact duplicates? ──────────────────────────── dHash (fastest) │ ├─ Brightness/contrast changes? ───────────────── pHash (DCT-based) │ ├─ Heavy crops/compression? ───────────────────── DINOHash (2025 SOTA) │ └─ Production system? ─────────────────────────── Hybrid (pHash → DINOHash) │ ├─ People in photos? ─────────────────────────────── Face Clustering │ ├─ Known thresholds? ──────────────────────────── Apple-style Agglomerative │ └─ Unknown data distribution? ─────────────────── HDBSCAN │ ├─ Pets/Animals? ─────────────────────────────────── Pet Recognition │ ├─ Detection? ─────────────────────────────────── YOLOv8 │ └─ Individual clustering? ─────────────────────── CLIP + HDBSCAN │ ├─ Best from burst? ──────────────────────────────── Burst Selection │ └─ Score: sharpness + face quality + aesthetics │ └─ Filter junk? ──────────────────────────────────── Content Detection ├─ Screenshots? ───────────────────────────────── Multi-signal classifier └─ NSFW? ──────────────────────────────────────── Safety classifier
Core Concepts
- Perceptual Hashing for Near-Duplicate Detection
Problem: Camera bursts, re-saved images, and minor edits create near-duplicates.
Solution: Perceptual hashes generate similar values for visually similar images.
Method Comparison:
Method Speed Robustness Best For
dHash Fastest Low Exact duplicates
pHash Fast Medium Brightness/contrast changes
DINOHash Slower High Heavy crops, compression
Hybrid Medium Very High Production systems
Hybrid Pipeline (2025 Best Practice):
-
Stage 1: Fast pHash filtering (eliminates obvious non-duplicates)
-
Stage 2: DINOHash refinement (accurate detection)
-
Stage 3: Optional Siamese ViT verification
Hamming Distance Thresholds:
-
Conservative: ≤5 bits different = duplicates
-
Aggressive: ≤10 bits different = duplicates
→ Deep dive: references/perceptual-hashing.md
- Face Recognition & Clustering
Goal: Group photos by person without user labeling.
Apple Photos Strategy (2021-2025):
-
Extract face + upper body embeddings (FaceNet, 512-dim)
-
Two-pass agglomerative clustering
-
Conservative first pass (threshold=0.4, high precision)
-
HAC second pass (threshold=0.6, increase recall)
-
Incremental updates for new photos
HDBSCAN Alternative:
-
No threshold tuning required
-
Robust to noise
-
Better for unknown data distributions
Parameters:
Setting Agglomerative HDBSCAN
Pass 1 threshold 0.4 (cosine)
Pass 2 threshold 0.6 (cosine)
Min cluster size
3 photos
Metric cosine cosine
→ Deep dive: references/face-clustering.md
- Burst Photo Selection
Problem: Burst mode creates 10-50 nearly identical photos.
Multi-Criteria Scoring:
Criterion Weight Measurement
Sharpness 30% Laplacian variance
Face Quality 35% Eyes open, smiling, face sharpness
Aesthetics 20% NIMA score
Position 10% Middle frames bonus
Exposure 5% Histogram clipping check
Burst Detection: Photos within 0.5 seconds of each other.
→ Deep dive: references/content-detection.md
- Screenshot Detection
Multi-Signal Approach:
Signal Confidence Description
UI elements 0.85 Status bars, buttons detected
Perfect rectangles 0.75
5 UI buttons (90° angles)
High text 0.70
25% text coverage (OCR)
No camera EXIF 0.60 Missing Make/Model/Lens
Device aspect 0.60 Exact phone screen ratio
Perfect sharpness 0.50
2000 Laplacian variance
Decision: Confidence >0.6 = screenshot
→ Deep dive: references/content-detection.md
- Quick Indexing Pipeline
Goal: Index 10K+ photos efficiently with caching.
Features Extracted:
-
Perceptual hashes (de-duplication)
-
Face embeddings (people clustering)
-
CLIP embeddings (semantic search)
-
Color palettes
-
Aesthetic scores
Performance (10K photos, M1 MacBook Pro):
Operation Time
Perceptual hashing 2 min
CLIP embeddings 3 min (GPU)
Face detection 4 min
Color palettes 1 min
Aesthetic scoring 2 min (GPU)
Clustering + dedup 1 min
Total (first run) ~13 min
Incremental <1 min
→ Deep dive: references/photo-indexing.md
Common Anti-Patterns
Anti-Pattern: Euclidean Distance for Face Embeddings
What it looks like:
distance = np.linalg.norm(embedding1 - embedding2) # WRONG
Why it's wrong: Face embeddings are normalized; cosine similarity is the correct metric.
What to do instead:
from scipy.spatial.distance import cosine distance = cosine(embedding1, embedding2) # Correct
Anti-Pattern: Fixed Clustering Thresholds
What it looks like: Using same distance threshold for all face clusters.
Why it's wrong: Different people have varying intra-class variance (twins vs. diverse ages).
What to do instead: Use HDBSCAN for automatic threshold discovery, or two-pass clustering with conservative + relaxed passes.
Anti-Pattern: Raw Pixel Comparison for Duplicates
What it looks like:
is_duplicate = np.allclose(img1, img2) # WRONG
Why it's wrong: Re-saved JPEGs, crops, brightness changes create pixel differences.
What to do instead: Perceptual hashing (pHash or DINOHash) with Hamming distance.
Anti-Pattern: Sequential Face Detection
What it looks like: Processing faces one photo at a time without batching.
Why it's wrong: GPU underutilization, 10x slower than batched.
What to do instead: Batch process images (batch_size=32) with GPU acceleration.
Anti-Pattern: No Confidence Filtering
What it looks like:
for face in all_detected_faces: cluster(face) # No filtering
Why it's wrong: Low-confidence detections create noise clusters (hands, objects).
What to do instead: Filter by confidence (threshold 0.9 for faces).
Anti-Pattern: Forcing Every Photo into Clusters
What it looks like: Assigning noise points to nearest cluster.
Why it's wrong: Solo appearances shouldn't pollute person clusters.
What to do instead: HDBSCAN/DBSCAN naturally identifies noise (label=-1). Keep noise separate.
Quick Start
from photo_curation import PhotoCurationPipeline
pipeline = PhotoCurationPipeline()
Index photo library
index = pipeline.index_library('/path/to/photos')
De-duplicate
duplicates = index.find_duplicates() print(f"Found {len(duplicates)} duplicate groups")
Cluster faces
face_clusters = index.cluster_faces() print(f"Found {len(face_clusters)} people")
Select best from bursts
best_photos = pipeline.select_best_from_bursts(index)
Filter screenshots
real_photos = pipeline.filter_screenshots(index)
Curate for collage
collage_photos = pipeline.curate_for_collage(index, target_count=100)
Python Dependencies
torch transformers facenet-pytorch ultralytics hdbscan opencv-python scipy numpy scikit-learn pillow pytesseract
Integration Points
-
event-detection-temporal-intelligence-expert: Provides temporal event clustering for event-aware curation
-
color-theory-palette-harmony-expert: Extracts color palettes for visual diversity
-
collage-layout-expert: Receives curated photos for assembly
-
clip-aware-embeddings: Provides CLIP embeddings for semantic search and DeepDBSCAN
References
-
DINOHash (2025): "Adversarially Fine-Tuned DINOv2 Features for Perceptual Hashing"
-
Apple Photos (2021): "Recognizing People in Photos Through Private On-Device ML"
-
HDBSCAN: "Hierarchical Density-Based Spatial Clustering" (2013-2025)
-
Perceptual Hashing: dHash (Neal Krawetz), DCT-based pHash
Version: 2.0.0 Last Updated: November 2025