miles: Enterprise-Grade RL for Large-Scale Model Training

miles is a high-performance, enterprise-ready RL framework optimized for large-scale model post-training. Built as a production fork of slime, it addresses critical challenges in MoE training stability, low-precision training, and train-inference alignment.

When to Use miles

Choose miles when you need:

Training 1TB+ MoE models (DeepSeek V3, Qwen3-MoE)
FP8 or INT4 quantization-aware training
Bit-wise identical train-inference alignment
Speculative RL for maximum throughput
Production stability with enterprise support

Consider alternatives when:

You want the research-grade original → use slime
You need flexible backend swapping → use verl
You want PyTorch-native abstractions → use torchforge

Key Features

Low-Precision Training

Unified FP8: End-to-end FP8 for both inference and training
INT4 QAT: 1TB models on single-machine VRAM (H200)
Rollout Routing Replay (R3): Bit-wise expert alignment for MoE

Performance Optimizations

Speculative RL: 25%+ rollout speedup with online SFT draft models
Zero-Copy Weight Sync: CUDA IPC zero-copy mapping
Partial Rollout: Recycle half-finished trajectories

Train-Inference Alignment

TIS/MIS: Truncated/Masked Importance Sampling for off-policy correction
Kernel-level optimization: FlashAttention-3, DeepGEMM integration

Installation

Recommended: Docker

docker pull radixark/miles:latest docker run --rm --gpus all --ipc=host --shm-size=16g
-it radixark/miles:latest /bin/bash

From source

git clone https://github.com/radixark/miles.git cd miles pip install -r requirements.txt pip install -e .

Quick Start

miles inherits slime's configuration system. Basic training:

python train.py
--advantage-estimator grpo
--model-name qwen3-30b-a3b
--hf-checkpoint /path/to/qwen3-30b-a3b-hf
--rollout-batch-size 512
--n-samples-per-prompt 8

Workflow 1: Large MoE Training

Use this workflow for training large MoE models like DeepSeek V3 or Qwen3-MoE.

Prerequisites Checklist

H100/H200 GPUs with FP8 support
MoE model (DeepSeek V3, Qwen3-MoE)
Docker environment with miles

Step 1: Environment Setup

FP8 block scaling (recommended for stability)

export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1 export CUDA_DEVICE_MAX_CONNECTIONS=1

Step 2: Configure Training

python train.py
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--hf-checkpoint /path/to/deepseek-v3
--advantage-estimator grpo
--tensor-model-parallel-size 8
--expert-model-parallel-size 4
--prompt-data /path/to/data.jsonl
--num-rollout 3000

Verification Checklist

Model loads without errors
Routing decisions are consistent
No NaN/Inf in loss values

Workflow 2: Speculative RL Training

Use this workflow for maximum rollout throughput with EAGLE speculative decoding.

How Speculative RL Works

Small draft model generates candidate tokens
Target model verifies in parallel
Draft model updated via online SFT to track policy

Step 1: Enable Speculative Decoding

miles supports EAGLE speculative decoding via SGLang:

python train.py
--actor-num-gpus-per-node 8
--hf-checkpoint /path/to/target-model
--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4
--sglang-speculative-draft-model-path /path/to/draft-model
--advantage-estimator grpo
--prompt-data /path/to/data.jsonl

Step 2: Enable Online MTP Training (Optional)

For online SFT of draft model during training:

--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2

Note: Online MTP training requires a torch dist checkpoint with MTP weights. Add --mtp-num-layers 1 during checkpoint conversion from HuggingFace.

Expected Speedup

Standard rollout: Baseline
Speculative RL: 25-40% faster rollout
With partial rollout: Additional 10-15% throughput

Configuration Reference

miles inherits all slime arguments. See slime API Reference for the complete list.

Cluster Resources (from slime)

--actor-num-nodes 1 --actor-num-gpus-per-node 8 --rollout-num-gpus 8 --rollout-num-gpus-per-engine 2 --colocate

Megatron Parallelism (from slime)

--tensor-model-parallel-size 8 --pipeline-model-parallel-size 2 --expert-model-parallel-size 4 # MoE expert parallelism

Speculative Decoding (miles-specific)

--sglang-speculative-algorithm EAGLE --sglang-speculative-num-steps 3 --sglang-speculative-eagle-topk 1 --sglang-speculative-num-draft-tokens 4 --sglang-enable-draft-weights-cpu-backup --sglang-speculative-draft-model-path /your/draft/model/path

Online MTP Training (miles-specific)

--mtp-num-layers 1 --enable-mtp-training --mtp-loss-scaling-factor 0.2

Key Features (Conceptual)

The following features are documented in miles but specific CLI flags may vary. Consult the miles repository for latest configuration.

Unified FP8 Pipeline

End-to-end FP8 sampling and training that eliminates quantization-induced discrepancy causing RL collapse in MoE models.

Rollout Routing Replay (R3)

Records expert routing decisions during SGLang inference and replays them during Megatron training for bit-wise expert alignment.

How R3 Works:

During SGLang inference, expert routing decisions are recorded
Routing decisions stored in sample.rollout_routed_experts
During Megatron training, routing is replayed instead of recomputed
Ensures identical expert selection between train and inference

INT4 Quantization-Aware Training

Enables single-machine deployment of 1TB+ models (e.g., on H200).

Memory Savings with INT4:

Model Size BF16 VRAM INT4 VRAM Reduction

70B 140GB 45GB 3.1x

235B 470GB 150GB 3.1x

671B 1.3TB 420GB 3.1x

Train-Inference Alignment

miles achieves "exactly 0 KL divergence" between training and inference through:

Flash Attention 3
DeepGEMM
Batch-invariant kernels from Thinking Machines Lab
torch.compile integration

Sample Data Structure

miles uses the same Sample dataclass as slime with the rollout_routed_experts field for MoE routing replay:

@dataclass class Sample: prompt: str | list[dict] tokens: list[int] response: str reward: float | dict loss_mask: list[int] status: Status metadata: dict rollout_log_probs: list[float] rollout_routed_experts: list[list[int]] # MoE routing for R3

See slime API Reference for the complete Sample definition.

Common Issues and Solutions

Issue: FP8 Training Collapse

Symptoms: Loss explodes, NaN values

Solutions:

Use block scaling: export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
Reduce learning rate: --lr 5e-7
Ensure MoE routing is consistent between train/inference

Issue: Speculative Draft Drift

Symptoms: Low acceptance rate over time

Solutions:

Enable online MTP training to keep draft model aligned
Reduce speculative steps: --sglang-speculative-num-steps 2
Use CPU backup: --sglang-enable-draft-weights-cpu-backup

Issue: Train-Inference Mismatch

Symptoms: Policy divergence, reward collapse

Solutions:

Use TIS for off-policy correction: --use-tis --tis-threshold 0.9
Verify log probs match between SGLang and Megatron
Enable R3 for MoE models

Supported Models

Family Models MoE Support

DeepSeek R1, V3, V3.2 Full

Qwen 2, 2.5, 3 (including MoE) Full

Llama 3, 3.1, 3.3, 4 Dense only

Gemma 2, 3, 3N Dense only

GLM 4.5, 4.6, 4.7 Dense only

MiniMax M2, M2.1 Full

Resources

GitHub: https://github.com/radixark/miles
Introduction Blog: https://lmsys.org/blog/2025-11-19-miles/
Slime (upstream): https://github.com/THUDM/slime
SGLang: https://github.com/sgl-project/sglang

miles-rl-training

Safety Notice

Copy this and send it to your AI assistant to learn

Recommended: Docker

From source

FP8 block scaling (recommended for stability)

Source Transparency

Related Skills

ml-paper-writing

mlflow

faiss

serving-llms-vllm