ml-cv-specialist

Deep expertise in ML/CV model selection, training pipelines, and inference architecture. Use when designing machine learning systems, computer vision pipelines, or AI-powered features.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ml-cv-specialist" with this command: npx skills add alirezarezvani/claude-cto-team/alirezarezvani-claude-cto-team-ml-cv-specialist

ML/CV Specialist

Provides specialized guidance for machine learning and computer vision system design, model selection, and production deployment.

When to Use

  • Selecting ML models for specific use cases
  • Designing training and inference pipelines
  • Optimizing ML system performance and cost
  • Evaluating build vs. API for ML capabilities
  • Planning data pipelines for ML workloads

ML System Design Framework

Model Selection Decision Tree

Use Case Identified
    │
    ├─► Text/Language Tasks
    │   ├─► Classification → BERT, DistilBERT, or API (OpenAI, Claude)
    │   ├─► Generation → GPT-4, Claude, Llama (self-hosted)
    │   ├─► Embeddings → OpenAI Ada, sentence-transformers
    │   └─► Search/RAG → Vector DB + Embeddings + LLM
    │
    ├─► Computer Vision Tasks
    │   ├─► Classification → ResNet, EfficientNet, ViT
    │   ├─► Object Detection → YOLOv8, DETR, Faster R-CNN
    │   ├─► Segmentation → SAM, Mask R-CNN, U-Net
    │   ├─► OCR → Tesseract, PaddleOCR, Cloud Vision API
    │   └─► Face Recognition → InsightFace, DeepFace
    │
    ├─► Audio Tasks
    │   ├─► Speech-to-Text → Whisper, DeepSpeech, Cloud APIs
    │   ├─► Text-to-Speech → ElevenLabs, Coqui TTS
    │   └─► Audio Classification → PANNs, AudioSet models
    │
    └─► Structured Data
        ├─► Tabular → XGBoost, LightGBM, CatBoost
        ├─► Time Series → Prophet, ARIMA, Transformer-based
        └─► Recommendations → Two-tower, matrix factorization

API vs. Self-Hosted Decision

When to Use APIs

FactorAPI PreferredSelf-Hosted Preferred
Volume< 10K requests/month> 100K requests/month
Latency> 500ms acceptable< 100ms required
CustomizationGeneral use caseDomain-specific fine-tuning
Data PrivacyNon-sensitive dataPII, HIPAA, financial
Team ExpertiseNo ML engineersML team available
BudgetPredictable per-call costsHigh volume justifies infra

Cost Comparison Framework

## API Costs (Example: OpenAI GPT-4)
- Input: $0.03/1K tokens
- Output: $0.06/1K tokens
- Average request: 500 input + 200 output tokens
- Cost per request: $0.027
- 100K requests/month: $2,700

## Self-Hosted Costs (Example: Llama 70B)
- GPU instance: $3/hour (A100 40GB)
- Throughput: ~50 requests/minute = 3K/hour
- Cost per request: $0.001
- 100K requests/month: $100 + $500 engineering time

## Break-even Analysis
- < 50K requests: API likely cheaper
- > 50K requests: Self-hosted may be cheaper
- Factor in: engineering time, ops burden, model quality

Training Pipeline Architecture

Standard ML Pipeline

┌─────────────────────────────────────────────────────────────┐
│                    DATA LAYER                                │
├─────────────────────────────────────────────────────────────┤
│  Data Sources → ETL → Feature Store → Training Data         │
│  (S3, DBs)     (Airflow)  (Feast)     (Versioned)          │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                  TRAINING LAYER                              │
├─────────────────────────────────────────────────────────────┤
│  Experiment Tracking → Training Jobs → Model Registry       │
│  (MLflow, W&B)         (SageMaker)    (MLflow, S3)         │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                  SERVING LAYER                               │
├─────────────────────────────────────────────────────────────┤
│  Model Server → Load Balancer → Monitoring                  │
│  (TorchServe)   (K8s/ELB)      (Prometheus)                │
└─────────────────────────────────────────────────────────────┘

Component Selection Guide

ComponentOptionsRecommendation
Feature StoreFeast, Tecton, SageMakerFeast (open source), Tecton (enterprise)
Experiment TrackingMLflow, Weights & Biases, NeptuneMLflow (free), W&B (best UX)
Training OrchestrationKubeflow, SageMaker, Vertex AISageMaker (AWS), Vertex (GCP)
Model RegistryMLflow, SageMaker, custom S3MLflow (standard)
Model ServingTorchServe, TFServing, TritonTriton (multi-framework)

Inference Architecture Patterns

Pattern 1: Synchronous API

Best for: Low-latency requirements, simple integration

Client → API Gateway → Model Server → Response
                           │
                      Load Balancer
                           │
                    ┌──────┴──────┐
                    │             │
                Model Pod    Model Pod

Latency targets:

  • P50: < 100ms
  • P95: < 300ms
  • P99: < 500ms

Pattern 2: Asynchronous Processing

Best for: Long-running inference, batch processing

Client → API → Queue (SQS) → Worker → Result Store → Webhook/Poll
                                          │
                                     S3/Redis

Use when:

  • Inference > 5 seconds
  • Batch processing required
  • Variable load patterns

Pattern 3: Edge Inference

Best for: Privacy, offline capability, ultra-low latency

┌─────────────────────────────────────────┐
│              EDGE DEVICE                 │
│  ┌─────────┐    ┌─────────────────────┐ │
│  │ Camera  │───▶│ Optimized Model     │ │
│  └─────────┘    │ (ONNX, TFLite)      │ │
│                 └─────────────────────┘ │
│                          │              │
│                     Local Result        │
└─────────────────────────────────────────┘
                           │
                    Sync to Cloud
                    (non-blocking)

Model optimization for edge:

  • Quantization (INT8): 4x smaller, 2-3x faster
  • Pruning: 50-90% sparsity possible
  • Distillation: Smaller model, similar accuracy
  • ONNX/TFLite: Optimized runtime

Computer Vision Pipeline Design

Real-Time Video Processing

Camera Stream → Frame Extraction → Preprocessing → Model → Postprocessing → Output
     │              │                   │            │           │
   RTSP/         1-30 FPS           Resize,      Batch or    NMS, tracking,
   WebRTC                           normalize    single       annotation

Performance optimization:

  • Process every Nth frame (skip frames)
  • Resize to model input size early
  • Batch frames when latency allows
  • Use GPU preprocessing (NVIDIA DALI)

Object Detection System

## Pipeline Components

1. **Input Processing**
   - Video decode: FFmpeg, OpenCV
   - Frame buffer: Ring buffer for temporal context
   - Preprocessing: NVIDIA DALI (GPU), OpenCV (CPU)

2. **Detection**
   - Model: YOLOv8 (speed), DETR (accuracy)
   - Batch size: 1-8 depending on latency requirements
   - Confidence threshold: 0.5-0.7 typical

3. **Post-processing**
   - NMS (Non-Maximum Suppression)
   - Tracking: SORT, DeepSORT, ByteTrack
   - Smoothing: Kalman filter for stable boxes

4. **Output**
   - Annotations: Bounding boxes, labels, confidence
   - Events: Trigger on detection (webhook, queue)
   - Storage: Frame + metadata to S3/DB

LLM Integration Patterns

RAG (Retrieval-Augmented Generation)

User Query → Embedding → Vector Search → Context Retrieval → LLM → Response
                              │
                         Vector DB
                       (Pinecone, Weaviate,
                        Chroma, pgvector)

Vector DB Selection:

DatabaseBest ForLimitations
PineconeManaged, scaleCost at scale
WeaviateSelf-hosted, featuresOperational overhead
ChromaSimple, local devNot for production scale
pgvectorPostgreSQL usersPerformance at >1M vectors
QdrantPerformanceNewer, smaller community

LLM Serving Architecture

┌─────────────────────────────────────────────────────────────┐
│                    API GATEWAY                               │
│  Rate limiting, auth, request routing                       │
└─────────────────────────────────────────────────────────────┘
                            │
              ┌─────────────┼─────────────┐
              │             │             │
              ▼             ▼             ▼
         ┌────────┐   ┌────────┐   ┌────────┐
         │ GPT-4  │   │ Claude │   │ Local  │
         │  API   │   │  API   │   │ Llama  │
         └────────┘   └────────┘   └────────┘
                            │
                    Model Router
              (cost/latency/capability)

Multi-model strategy:

  • Simple queries → Cheaper model (GPT-3.5, Haiku)
  • Complex reasoning → Expensive model (GPT-4, Opus)
  • Sensitive data → Self-hosted (Llama, Mistral)

Performance Optimization

GPU Memory Optimization

TechniqueMemory ReductionSpeed Impact
FP16 (Half Precision)50%Neutral to faster
INT8 Quantization75%10-20% slower
INT4 Quantization87.5%20-40% slower
Gradient Checkpointing60-80%20-30% slower
Model ShardingDistributedCommunication overhead

Batching Strategies

# Dynamic batching pseudocode
class DynamicBatcher:
    def __init__(self, max_batch=32, max_wait_ms=50):
        self.queue = []
        self.max_batch = max_batch
        self.max_wait = max_wait_ms

    async def add_request(self, request):
        self.queue.append(request)

        # Batch when full or timeout
        if len(self.queue) >= self.max_batch:
            return await self.process_batch()

        await asyncio.sleep(self.max_wait / 1000)
        return await self.process_batch()

    async def process_batch(self):
        batch = self.queue[:self.max_batch]
        self.queue = self.queue[self.max_batch:]
        return await self.model.predict_batch(batch)

Model Monitoring

Key Metrics to Track

MetricWhat It MeasuresAlert Threshold
Latency (P95)Response time> 2x baseline
ThroughputRequests/second< 80% capacity
Error RateFailed predictions> 1%
Model DriftDistribution shiftPSI > 0.2
Data QualityInput anomalies> 5% anomalies

Drift Detection

Training Distribution ──┐
                        ├──► Statistical Test ──► Alert
Production Distribution ─┘
                         (PSI, KS test, JS divergence)

Population Stability Index (PSI):

  • PSI < 0.1: No significant change
  • 0.1 < PSI < 0.2: Moderate change, monitor
  • PSI > 0.2: Significant change, investigate

Quick Reference Tables

Model Selection by Use Case

Use CaseRecommended ModelLatencyCost
Text ClassificationDistilBERT10msLow
Text GenerationGPT-4 / Claude1-5sMedium
Image ClassificationEfficientNet-B05msLow
Object DetectionYOLOv8-n10msLow
Object Detection (Accurate)YOLOv8-x50msMedium
Semantic SegmentationSAM100msMedium
Speech-to-TextWhisper-baseReal-timeLow
Embeddingstext-embedding-ada-00250msLow

Infrastructure Sizing

ScaleGPUModel SizeThroughput
DevelopmentT4 (16GB)< 7B params10-50 req/s
Production SmallA10G (24GB)< 13B params50-100 req/s
Production MediumA100 (40GB)< 70B params100-500 req/s
Production LargeA100 (80GB) x 2+> 70B params500+ req/s

References

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

cost-estimator

No summary provided by upstream source.

Repository SourceNeeds Review
General

clarification-protocol

No summary provided by upstream source.

Repository SourceNeeds Review
General

delegation-prompt-crafter

No summary provided by upstream source.

Repository SourceNeeds Review
General

antipattern-detector

No summary provided by upstream source.

Repository SourceNeeds Review