transformers

HuggingFace Transformers

Access thousands of pre-trained models for NLP, vision, audio, and multimodal tasks.

When to Use

Quick inference with pipelines
Text generation, classification, QA, NER
Image classification, object detection
Fine-tuning on custom datasets
Loading pre-trained models from HuggingFace Hub

Pipeline Tasks

NLP Tasks

Task Pipeline Name Output

Text Generation text-generation

Completed text

Classification text-classification

Label + confidence

Question Answering question-answering

Answer span

Summarization summarization

Shorter text

Translation translation_en_to_fr

Translated text

NER ner

Entity spans + types

Fill Mask fill-mask

Predicted tokens

Vision Tasks

Task Pipeline Name Output

Image Classification image-classification

Label + confidence

Object Detection object-detection

Bounding boxes

Image Segmentation image-segmentation

Pixel masks

Audio Tasks

Task Pipeline Name Output

Speech Recognition automatic-speech-recognition

Transcribed text

Audio Classification audio-classification

Label + confidence

Model Loading Patterns

Auto Classes

Class Use Case

AutoModel Base model (embeddings)

AutoModelForCausalLM Text generation (GPT-style)

AutoModelForSeq2SeqLM Encoder-decoder (T5, BART)

AutoModelForSequenceClassification Classification head

AutoModelForTokenClassification NER, POS tagging

AutoModelForQuestionAnswering Extractive QA

Key concept: Always use Auto classes unless you need a specific architecture—they handle model detection automatically.

Generation Parameters

Parameter Effect Typical Values

max_new_tokens Output length 50-500

temperature Randomness (0=deterministic) 0.1-1.0

top_p Nucleus sampling threshold 0.9-0.95

top_k Limit vocabulary per step 50

num_beams Beam search (disable sampling) 4-8

repetition_penalty Discourage repetition 1.1-1.3

Key concept: Higher temperature = more creative but less coherent. For factual tasks, use low temperature (0.1-0.3).

Memory Management

Device Placement Options

Option When to Use

device_map="auto" Let library decide GPU allocation

device_map="cuda:0" Specific GPU

device_map="cpu" CPU only

Quantization Options

Method Memory Reduction Quality Impact

8-bit ~50% Minimal

4-bit ~75% Small for most tasks

GPTQ ~75% Requires calibration

AWQ ~75% Activation-aware

Key concept: Use torch_dtype="auto" to automatically use the model's native precision (often bfloat16).

Fine-Tuning Concepts

Trainer Arguments

Argument Purpose Typical Value

num_train_epochs Training passes 3-5

per_device_train_batch_size Samples per GPU 8-32

learning_rate Step size 2e-5 for fine-tuning

weight_decay Regularization 0.01

warmup_ratio LR warmup 0.1

evaluation_strategy When to eval "epoch" or "steps"

Fine-Tuning Strategies

Strategy Memory Quality Use Case

Full fine-tuning High Best Small models, enough data

LoRA Low Good Large models, limited GPU

QLoRA Very Low Good 7B+ models on consumer GPU

Prefix tuning Low Moderate When you can't modify weights

Tokenization Concepts

Parameter Purpose

padding Make sequences same length

truncation Cut sequences to max_length

max_length Maximum tokens (model-specific)

return_tensors Output format ("pt", "tf", "np")

Key concept: Always use the tokenizer that matches the model—different models use different vocabularies.

Best Practices

Practice Why

Use pipelines for inference Handles preprocessing automatically

Use device_map="auto" Optimal GPU memory distribution

Batch inputs Better throughput

Use quantization for large models Run 7B+ on consumer GPUs

Match tokenizer to model Vocabularies differ between models

Use Trainer for fine-tuning Built-in best practices

Resources

Docs: https://huggingface.co/docs/transformers
Model Hub: https://huggingface.co/models
Course: https://huggingface.co/course

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

document-processing

stripe-payments

file-organization