HuggingFace Accelerate - Unified Distributed Training
Quick start
Accelerate simplifies distributed training to 4 lines of code.
Installation:
pip install accelerate
Convert PyTorch script (4 lines):
import torch
-
from accelerate import Accelerator
-
accelerator = Accelerator()
model = torch.nn.Transformer() optimizer = torch.optim.Adam(model.parameters()) dataloader = torch.utils.data.DataLoader(dataset)
-
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
for batch in dataloader: optimizer.zero_grad() loss = model(batch)
-
loss.backward()
-
accelerator.backward(loss) optimizer.step()
Run (single command):
accelerate launch train.py
Common workflows
Workflow 1: From single GPU to multi-GPU
Original script:
train.py
import torch
model = torch.nn.Linear(10, 2).to('cuda') optimizer = torch.optim.Adam(model.parameters()) dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
for epoch in range(10): for batch in dataloader: batch = batch.to('cuda') optimizer.zero_grad() loss = model(batch).mean() loss.backward() optimizer.step()
With Accelerate (4 lines added):
train.py
import torch from accelerate import Accelerator # +1
accelerator = Accelerator() # +2
model = torch.nn.Linear(10, 2) optimizer = torch.optim.Adam(model.parameters()) dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) # +3
for epoch in range(10): for batch in dataloader: # No .to('cuda') needed - automatic! optimizer.zero_grad() loss = model(batch).mean() accelerator.backward(loss) # +4 optimizer.step()
Configure (interactive):
accelerate config
Questions:
-
Which machine? (single/multi GPU/TPU/CPU)
-
How many machines? (1)
-
Mixed precision? (no/fp16/bf16/fp8)
-
DeepSpeed? (no/yes)
Launch (works on any setup):
Single GPU
accelerate launch train.py
Multi-GPU (8 GPUs)
accelerate launch --multi_gpu --num_processes 8 train.py
Multi-node
accelerate launch --multi_gpu --num_processes 16
--num_machines 2 --machine_rank 0
--main_process_ip $MASTER_ADDR
train.py
Workflow 2: Mixed precision training
Enable FP16/BF16:
from accelerate import Accelerator
FP16 (with gradient scaling)
accelerator = Accelerator(mixed_precision='fp16')
BF16 (no scaling, more stable)
accelerator = Accelerator(mixed_precision='bf16')
FP8 (H100+)
accelerator = Accelerator(mixed_precision='fp8')
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
Everything else is automatic!
for batch in dataloader: with accelerator.autocast(): # Optional, done automatically loss = model(batch) accelerator.backward(loss)
Workflow 3: DeepSpeed ZeRO integration
Enable DeepSpeed ZeRO-2:
from accelerate import Accelerator
accelerator = Accelerator( mixed_precision='bf16', deepspeed_plugin={ "zero_stage": 2, # ZeRO-2 "offload_optimizer": False, "gradient_accumulation_steps": 4 } )
Same code as before!
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
Or via config:
accelerate config
Select: DeepSpeed → ZeRO-2
deepspeed_config.json:
{ "fp16": {"enabled": false}, "bf16": {"enabled": true}, "zero_optimization": { "stage": 2, "offload_optimizer": {"device": "cpu"}, "allgather_bucket_size": 5e8, "reduce_bucket_size": 5e8 } }
Launch:
accelerate launch --config_file deepspeed_config.json train.py
Workflow 4: FSDP (Fully Sharded Data Parallel)
Enable FSDP:
from accelerate import Accelerator, FullyShardedDataParallelPlugin
fsdp_plugin = FullyShardedDataParallelPlugin( sharding_strategy="FULL_SHARD", # ZeRO-3 equivalent auto_wrap_policy="TRANSFORMER_AUTO_WRAP", cpu_offload=False )
accelerator = Accelerator( mixed_precision='bf16', fsdp_plugin=fsdp_plugin )
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
Or via config:
accelerate config
Select: FSDP → Full Shard → No CPU Offload
Workflow 5: Gradient accumulation
Accumulate gradients:
from accelerate import Accelerator
accelerator = Accelerator(gradient_accumulation_steps=4)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
for batch in dataloader: with accelerator.accumulate(model): # Handles accumulation optimizer.zero_grad() loss = model(batch) accelerator.backward(loss) optimizer.step()
Effective batch size: batch_size * num_gpus * gradient_accumulation_steps
When to use vs alternatives
Use Accelerate when:
-
Want simplest distributed training
-
Need single script for any hardware
-
Use HuggingFace ecosystem
-
Want flexibility (DDP/DeepSpeed/FSDP/Megatron)
-
Need quick prototyping
Key advantages:
-
4 lines: Minimal code changes
-
Unified API: Same code for DDP, DeepSpeed, FSDP, Megatron
-
Automatic: Device placement, mixed precision, sharding
-
Interactive config: No manual launcher setup
-
Single launch: Works everywhere
Use alternatives instead:
-
PyTorch Lightning: Need callbacks, high-level abstractions
-
Ray Train: Multi-node orchestration, hyperparameter tuning
-
DeepSpeed: Direct API control, advanced features
-
Raw DDP: Maximum control, minimal abstraction
Common issues
Issue: Wrong device placement
Don't manually move to device:
WRONG
batch = batch.to('cuda')
CORRECT
Accelerate handles it automatically after prepare()
Issue: Gradient accumulation not working
Use context manager:
CORRECT
with accelerator.accumulate(model): optimizer.zero_grad() accelerator.backward(loss) optimizer.step()
Issue: Checkpointing in distributed
Use accelerator methods:
Save only on main process
if accelerator.is_main_process: accelerator.save_state('checkpoint/')
Load on all processes
accelerator.load_state('checkpoint/')
Issue: Different results with FSDP
Ensure same random seed:
from accelerate.utils import set_seed set_seed(42)
Advanced topics
Megatron integration: See references/megatron-integration.md for tensor parallelism, pipeline parallelism, and sequence parallelism setup.
Custom plugins: See references/custom-plugins.md for creating custom distributed plugins and advanced configuration.
Performance tuning: See references/performance.md for profiling, memory optimization, and best practices.
Hardware requirements
-
CPU: Works (slow)
-
Single GPU: Works
-
Multi-GPU: DDP (default), DeepSpeed, or FSDP
-
Multi-node: DDP, DeepSpeed, FSDP, Megatron
-
TPU: Supported
-
Apple MPS: Supported
Launcher requirements:
-
DDP: torch.distributed.run (built-in)
-
DeepSpeed: deepspeed (pip install deepspeed)
-
FSDP: PyTorch 1.12+ (built-in)
-
Megatron: Custom setup
Resources
-
Version: 1.11.0+
-
Tutorial: "Accelerate your scripts"
-
Examples: https://github.com/huggingface/accelerate/tree/main/examples
-
Used by: HuggingFace Transformers, TRL, PEFT, all HF libraries