RWKV - Receptance Weighted Key Value
Quick start
RWKV (RwaKuv) combines Transformer parallelization (training) with RNN efficiency (inference).
Installation:
Install PyTorch
pip install torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu121
Install dependencies
pip install pytorch-lightning==1.9.5 deepspeed wandb ninja --upgrade
Install RWKV
pip install rwkv
Basic usage (GPT mode + RNN mode):
import os from rwkv.model import RWKV
os.environ["RWKV_JIT_ON"] = '1' os.environ["RWKV_CUDA_ON"] = '1' # Use CUDA kernel for speed
Load model
model = RWKV( model='/path/to/RWKV-4-Pile-1B5-20220903-8040', strategy='cuda fp16' )
GPT mode (parallel processing)
out, state = model.forward([187, 510, 1563, 310, 247], None) print(out.detach().cpu().numpy()) # Logits
RNN mode (sequential processing, same result)
out, state = model.forward([187, 510], None) # First 2 tokens out, state = model.forward([1563], state) # Next token out, state = model.forward([310, 247], state) # Last tokens print(out.detach().cpu().numpy()) # Same logits as above!
Common workflows
Workflow 1: Text generation (streaming)
Efficient token-by-token generation:
from rwkv.model import RWKV from rwkv.utils import PIPELINE
model = RWKV(model='RWKV-4-Pile-14B-20230313-ctx8192-test1050', strategy='cuda fp16') pipeline = PIPELINE(model, "20B_tokenizer.json")
Initial prompt
prompt = "The future of AI is" state = None
Generate token by token
for token in prompt: out, state = pipeline.model.forward(pipeline.encode(token), state)
Continue generation
for _ in range(100): out, state = pipeline.model.forward(None, state) token = pipeline.sample_logits(out) print(pipeline.decode(token), end='', flush=True)
Key advantage: Constant memory per token (no growing KV cache)
Workflow 2: Long context processing (infinite context)
Process million-token sequences:
model = RWKV(model='RWKV-4-Pile-14B', strategy='cuda fp16')
Process very long document
state = None long_document = load_document() # e.g., 1M tokens
Stream through entire document
for chunk in chunks(long_document, chunk_size=1024): out, state = model.forward(chunk, state)
State now contains information from entire 1M token document
Memory usage: O(1) (constant, not O(n)!)
Workflow 3: Fine-tuning RWKV
Standard fine-tuning workflow:
Training script
import pytorch_lightning as pl from rwkv.model import RWKV from rwkv.trainer import RWKVTrainer
Configure model
config = { 'n_layer': 24, 'n_embd': 1024, 'vocab_size': 50277, 'ctx_len': 1024 }
Setup trainer
trainer = pl.Trainer( accelerator='gpu', devices=8, precision='bf16', strategy='deepspeed_stage_2', max_epochs=1 )
Train
model = RWKV(config) trainer.fit(model, train_dataloader)
Workflow 4: RWKV vs Transformer comparison
Memory comparison (1M token sequence):
Transformer (GPT)
Memory: O(n²) for attention
KV cache: 1M × hidden_dim × n_layers × 2 (keys + values)
Example: 1M × 4096 × 24 × 2 = ~400GB (impractical!)
RWKV
Memory: O(1) per token
State: hidden_dim × n_layers = 4096 × 24 = ~400KB
1,000,000× more efficient!
Speed comparison (inference):
Transformer: O(n) per token (quadratic overall)
First token: 1 computation
Second token: 2 computations
...
1000th token: 1000 computations
RWKV: O(1) per token (linear overall)
Every token: 1 computation
1000th token: 1 computation (same as first!)
When to use vs alternatives
Use RWKV when:
-
Need very long context (100K+ tokens)
-
Want constant memory usage
-
Building streaming applications
-
Need RNN efficiency with Transformer performance
-
Memory-constrained deployment
Key advantages:
-
Linear time: O(n) vs O(n²) for Transformers
-
No KV cache: Constant memory per token
-
Infinite context: No fixed window limit
-
Parallelizable training: Like GPT
-
Sequential inference: Like RNN
Use alternatives instead:
-
Transformers: Need absolute best performance, have compute
-
Mamba: Want state-space models
-
RetNet: Need retention mechanism
-
Hyena: Want convolution-based approach
Common issues
Issue: Out of memory during training
Use gradient checkpointing and DeepSpeed:
trainer = pl.Trainer( strategy='deepspeed_stage_3', # Full ZeRO-3 precision='bf16' )
Issue: Slow inference
Enable CUDA kernel:
os.environ["RWKV_CUDA_ON"] = '1'
Issue: Model not loading
Check model path and strategy:
model = RWKV( model='/absolute/path/to/model.pth', strategy='cuda fp16' # Or 'cpu fp32' for CPU )
Issue: State management in RNN mode
Always pass state between forward calls:
WRONG: State lost
out1, _ = model.forward(tokens1, None) out2, _ = model.forward(tokens2, None) # No context from tokens1!
CORRECT: State preserved
out1, state = model.forward(tokens1, None) out2, state = model.forward(tokens2, state) # Has context from tokens1
Advanced topics
Time-mixing and channel-mixing: See references/architecture-details.md for WKV operation, time-decay mechanism, and receptance gates.
State management: See references/state-management.md for att_x_prev, att_kv, ffn_x_prev states, and numerical stability considerations.
RWKV-7 improvements: See references/rwkv7.md for latest architectural improvements (March 2025) and multimodal capabilities.
Hardware requirements
-
GPU: NVIDIA (CUDA 11.6+) or CPU
-
VRAM (FP16):
-
169M model: 1GB
-
430M model: 2GB
-
1.5B model: 4GB
-
3B model: 8GB
-
7B model: 16GB
-
14B model: 32GB
-
Inference: O(1) memory per token
-
Training: Parallelizable like GPT
Performance (vs Transformers):
-
Speed: Similar training, faster inference
-
Memory: 1000× less for long sequences
-
Scaling: Linear vs quadratic
Resources
-
Paper (RWKV): https://arxiv.org/abs/2305.13048 (May 2023)
-
Paper (RWKV-7): https://arxiv.org/abs/2503.14456 (March 2025)
-
GitHub: https://github.com/BlinkDL/RWKV-LM ⭐ 12,000+
-
Docs: https://wiki.rwkv.com/
-
Models: https://huggingface.co/BlinkDL
-
Linux Foundation AI: Official project
-
Production: Microsoft Windows, Office integration, NeMo support