SimPO - Simple Preference Optimization
Quick start
SimPO is a reference-free preference optimization method that outperforms DPO without needing a reference model.
Installation:
Create environment
conda create -n simpo python=3.10 && conda activate simpo
Install PyTorch 2.2.2
Visit: https://pytorch.org/get-started/locally/
Install alignment-handbook
git clone https://github.com/huggingface/alignment-handbook.git cd alignment-handbook python -m pip install .
Install Flash Attention 2
python -m pip install flash-attn --no-build-isolation
Training (Mistral 7B):
ACCELERATE_LOG_LEVEL=info accelerate launch
--config_file accelerate_configs/deepspeed_zero3.yaml
scripts/run_simpo.py
training_configs/mistral-7b-base-simpo.yaml
Common workflows
Workflow 1: Train from base model (Mistral 7B)
Config (mistral-7b-base-simpo.yaml ):
Model
model_name_or_path: mistralai/Mistral-7B-v0.1 torch_dtype: bfloat16
Dataset
dataset_mixer: HuggingFaceH4/ultrafeedback_binarized: 1.0 dataset_splits:
- train_prefs
- test_prefs
SimPO hyperparameters
beta: 2.0 # Reward scaling (2.0-10.0) gamma_beta_ratio: 0.5 # Target margin (0-1) loss_type: sigmoid # sigmoid or hinge sft_weight: 0.0 # Optional SFT regularization
Training
learning_rate: 5e-7 # Critical: 3e-7 to 1e-6 num_train_epochs: 1 per_device_train_batch_size: 1 gradient_accumulation_steps: 8
Output
output_dir: ./outputs/mistral-7b-simpo
Launch training:
accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml
scripts/run_simpo.py training_configs/mistral-7b-base-simpo.yaml
Workflow 2: Fine-tune instruct model (Llama 3 8B)
Config (llama3-8b-instruct-simpo.yaml ):
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
dataset_mixer: argilla/ultrafeedback-binarized-preferences-cleaned: 1.0
beta: 2.5 gamma_beta_ratio: 0.5 learning_rate: 5e-7 sft_weight: 0.1 # Add SFT loss to preserve capabilities
num_train_epochs: 1 per_device_train_batch_size: 2 gradient_accumulation_steps: 4 output_dir: ./outputs/llama3-8b-simpo
Launch:
accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml
scripts/run_simpo.py training_configs/llama3-8b-instruct-simpo.yaml
Workflow 3: Reasoning-intensive tasks (lower LR)
For math/code tasks:
model_name_or_path: deepseek-ai/deepseek-math-7b-base
dataset_mixer: argilla/distilabel-math-preference-dpo: 1.0
beta: 5.0 # Higher for stronger signal gamma_beta_ratio: 0.7 # Larger margin learning_rate: 3e-7 # Lower LR for reasoning sft_weight: 0.0
num_train_epochs: 1 per_device_train_batch_size: 1 gradient_accumulation_steps: 16
When to use vs alternatives
Use SimPO when:
-
Want simpler training than DPO (no reference model)
-
Have preference data (chosen/rejected pairs)
-
Need better performance than DPO
-
Limited compute resources
-
Single-node training sufficient
Algorithm selection:
-
SimPO: Simplest, best performance, no reference model
-
DPO: Need reference model baseline, more conservative
-
PPO: Maximum control, need reward model, complex setup
-
GRPO: Memory-efficient RL, no critic
Use alternatives instead:
-
OpenRLHF: Multi-node distributed training, PPO/GRPO
-
TRL: Need multiple methods in one framework
-
DPO: Established baseline comparison
Common issues
Issue: Loss divergence
Reduce learning rate:
learning_rate: 3e-7 # Reduce from 5e-7
Reduce beta:
beta: 1.0 # Reduce from 2.0
Issue: Model forgets capabilities
Add SFT regularization:
sft_weight: 0.1 # Add SFT loss component
Issue: Poor preference separation
Increase beta and margin:
beta: 5.0 # Increase from 2.0 gamma_beta_ratio: 0.8 # Increase from 0.5
Issue: OOM during training
Reduce batch size:
per_device_train_batch_size: 1 gradient_accumulation_steps: 16 # Maintain effective batch
Enable gradient checkpointing:
gradient_checkpointing: true
Advanced topics
Loss functions: See references/loss-functions.md for sigmoid vs hinge loss, mathematical formulations, and when to use each.
Hyperparameter tuning: See references/hyperparameters.md for beta, gamma, learning rate selection guide, and model-size-specific recommendations.
Dataset preparation: See references/datasets.md for preference data formats, quality filtering, and custom dataset creation.
Hardware requirements
-
GPU: NVIDIA A100/H100 recommended
-
VRAM:
-
7B model: 1× A100 40GB (DeepSpeed ZeRO-3)
-
8B model: 2× A100 40GB
-
70B model: 8× A100 80GB
-
Single-node: DeepSpeed ZeRO-3 sufficient
-
Mixed precision: BF16 recommended
Memory optimization:
-
DeepSpeed ZeRO-3 (default config)
-
Gradient checkpointing
-
Flash Attention 2
Resources
-
Paper: https://arxiv.org/abs/2405.14734 (NeurIPS 2024)
-
Alignment Handbook: https://github.com/huggingface/alignment-handbook