multimodal-models

Pre-trained models for vision, audio, and cross-modal tasks.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "multimodal-models" with this command: npx skills add eyadsibai/ltk/eyadsibai-ltk-multimodal-models

Multimodal Models

Pre-trained models for vision, audio, and cross-modal tasks.

Model Overview

Model Modality Task

CLIP Image + Text Zero-shot classification, similarity

Whisper Audio → Text Transcription, translation

Stable Diffusion Text → Image Image generation, editing

CLIP (Vision-Language)

Zero-shot image classification without training on specific labels.

CLIP Use Cases

Task How

Zero-shot classification Compare image to text label embeddings

Image search Find images matching text query

Content moderation Classify against safety categories

Image similarity Compare image embeddings

CLIP Models

Model Parameters Trade-off

ViT-B/32 151M Recommended balance

ViT-L/14 428M Best quality, slower

RN50 102M Fastest, lower quality

CLIP Concepts

Concept Description

Dual encoder Separate encoders for image and text

Contrastive learning Trained to match image-text pairs

Normalization Always normalize embeddings before similarity

Descriptive labels Better labels = better zero-shot accuracy

Key concept: CLIP embeds images and text in same space. Classification = find nearest text embedding.

CLIP Limitations

  • Not for fine-grained classification

  • No spatial understanding (whole image only)

  • May reflect training data biases

Whisper (Speech Recognition)

Robust multilingual transcription supporting 99 languages.

Whisper Use Cases

Task Configuration

Transcription Default transcribe task

Translation to English task="translate"

Subtitles Output format SRT/VTT

Word timestamps word_timestamps=True

Whisper Models

Model Size Speed Recommendation

turbo 809M Fast Recommended

large 1550M Slow Maximum quality

small 244M Medium Good balance

base 74M Fast Quick tests

tiny 39M Fastest Prototyping only

Whisper Concepts

Concept Description

Language detection Auto-detects, or specify for speed

Initial prompt Improves technical terms accuracy

Timestamps Segment-level or word-level

faster-whisper 4× faster alternative implementation

Key concept: Specify language when known—auto-detection adds latency.

Whisper Limitations

  • May hallucinate on silence/noise

  • No speaker diarization (who said what)

  • Accuracy degrades on >30 min audio

  • Not suitable for real-time captioning

Stable Diffusion (Image Generation)

Text-to-image generation with various control methods.

SD Use Cases

Task Pipeline

Text-to-image DiffusionPipeline

Style transfer Image2Image

Fill regions Inpainting

Guided generation ControlNet

Custom styles LoRA adapters

SD Models

Model Resolution Quality

SDXL 1024×1024 Best

SD 1.5 512×512 Good, faster

SD 2.1 768×768 Middle ground

Key Parameters

Parameter Effect Typical Value

num_inference_steps Quality vs speed 20-50

guidance_scale Prompt adherence 7-12

negative_prompt Avoid artifacts "blurry, low quality"

strength (img2img) How much to change 0.5-0.8

seed Reproducibility Fixed number

Control Methods

Method Input Use Case

ControlNet Edge/depth/pose Structural guidance

LoRA Trained weights Custom styles

Img2Img Source image Style transfer

Inpainting Image + mask Fill regions

Memory Optimization

Technique Effect

CPU offload Reduces VRAM usage

Attention slicing Trades speed for memory

VAE tiling Large image support

xFormers Faster attention

DPM scheduler Fewer steps needed

Key concept: Use SDXL for quality, SD 1.5 for speed. Always use negative prompts.

SD Limitations

  • GPU strongly recommended (CPU very slow)

  • Large VRAM requirements for SDXL

  • May generate anatomical errors

  • Prompt engineering matters

Common Patterns

Embedding and Similarity

All three models use embeddings:

  • CLIP: Image/text embeddings for similarity

  • Whisper: Audio embeddings for transcription

  • SD: Text embeddings for image conditioning

GPU Acceleration

Model VRAM Needed

CLIP ViT-B/32 ~2 GB

Whisper turbo ~6 GB

SD 1.5 ~6 GB

SDXL ~10 GB

Best Practices

Practice Why

Use recommended model sizes Best quality/speed balance

Cache embeddings (CLIP) Expensive to recompute

Specify language (Whisper) Faster than auto-detect

Use negative prompts (SD) Avoid common artifacts

Set seeds for reproducibility Consistent results

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

document-processing

No summary provided by upstream source.

Repository SourceNeeds Review
General

stripe-payments

No summary provided by upstream source.

Repository SourceNeeds Review
General

file-organization

No summary provided by upstream source.

Repository SourceNeeds Review
General

literature-review

No summary provided by upstream source.

Repository SourceNeeds Review