rwkv-architecture

RWKV - Receptance Weighted Key Value

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "rwkv-architecture" with this command: npx skills add ovachiever/droid-tings/ovachiever-droid-tings-rwkv-architecture

RWKV - Receptance Weighted Key Value

Quick start

RWKV (RwaKuv) combines Transformer parallelization (training) with RNN efficiency (inference).

Installation:

Install PyTorch

pip install torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu121

Install dependencies

pip install pytorch-lightning==1.9.5 deepspeed wandb ninja --upgrade

Install RWKV

pip install rwkv

Basic usage (GPT mode + RNN mode):

import os from rwkv.model import RWKV

os.environ["RWKV_JIT_ON"] = '1' os.environ["RWKV_CUDA_ON"] = '1' # Use CUDA kernel for speed

Load model

model = RWKV( model='/path/to/RWKV-4-Pile-1B5-20220903-8040', strategy='cuda fp16' )

GPT mode (parallel processing)

out, state = model.forward([187, 510, 1563, 310, 247], None) print(out.detach().cpu().numpy()) # Logits

RNN mode (sequential processing, same result)

out, state = model.forward([187, 510], None) # First 2 tokens out, state = model.forward([1563], state) # Next token out, state = model.forward([310, 247], state) # Last tokens print(out.detach().cpu().numpy()) # Same logits as above!

Common workflows

Workflow 1: Text generation (streaming)

Efficient token-by-token generation:

from rwkv.model import RWKV from rwkv.utils import PIPELINE

model = RWKV(model='RWKV-4-Pile-14B-20230313-ctx8192-test1050', strategy='cuda fp16') pipeline = PIPELINE(model, "20B_tokenizer.json")

Initial prompt

prompt = "The future of AI is" state = None

Generate token by token

for token in prompt: out, state = pipeline.model.forward(pipeline.encode(token), state)

Continue generation

for _ in range(100): out, state = pipeline.model.forward(None, state) token = pipeline.sample_logits(out) print(pipeline.decode(token), end='', flush=True)

Key advantage: Constant memory per token (no growing KV cache)

Workflow 2: Long context processing (infinite context)

Process million-token sequences:

model = RWKV(model='RWKV-4-Pile-14B', strategy='cuda fp16')

Process very long document

state = None long_document = load_document() # e.g., 1M tokens

Stream through entire document

for chunk in chunks(long_document, chunk_size=1024): out, state = model.forward(chunk, state)

State now contains information from entire 1M token document

Memory usage: O(1) (constant, not O(n)!)

Workflow 3: Fine-tuning RWKV

Standard fine-tuning workflow:

Training script

import pytorch_lightning as pl from rwkv.model import RWKV from rwkv.trainer import RWKVTrainer

Configure model

config = { 'n_layer': 24, 'n_embd': 1024, 'vocab_size': 50277, 'ctx_len': 1024 }

Setup trainer

trainer = pl.Trainer( accelerator='gpu', devices=8, precision='bf16', strategy='deepspeed_stage_2', max_epochs=1 )

Train

model = RWKV(config) trainer.fit(model, train_dataloader)

Workflow 4: RWKV vs Transformer comparison

Memory comparison (1M token sequence):

Transformer (GPT)

Memory: O(n²) for attention

KV cache: 1M × hidden_dim × n_layers × 2 (keys + values)

Example: 1M × 4096 × 24 × 2 = ~400GB (impractical!)

RWKV

Memory: O(1) per token

State: hidden_dim × n_layers = 4096 × 24 = ~400KB

1,000,000× more efficient!

Speed comparison (inference):

Transformer: O(n) per token (quadratic overall)

First token: 1 computation

Second token: 2 computations

...

1000th token: 1000 computations

RWKV: O(1) per token (linear overall)

Every token: 1 computation

1000th token: 1 computation (same as first!)

When to use vs alternatives

Use RWKV when:

  • Need very long context (100K+ tokens)

  • Want constant memory usage

  • Building streaming applications

  • Need RNN efficiency with Transformer performance

  • Memory-constrained deployment

Key advantages:

  • Linear time: O(n) vs O(n²) for Transformers

  • No KV cache: Constant memory per token

  • Infinite context: No fixed window limit

  • Parallelizable training: Like GPT

  • Sequential inference: Like RNN

Use alternatives instead:

  • Transformers: Need absolute best performance, have compute

  • Mamba: Want state-space models

  • RetNet: Need retention mechanism

  • Hyena: Want convolution-based approach

Common issues

Issue: Out of memory during training

Use gradient checkpointing and DeepSpeed:

trainer = pl.Trainer( strategy='deepspeed_stage_3', # Full ZeRO-3 precision='bf16' )

Issue: Slow inference

Enable CUDA kernel:

os.environ["RWKV_CUDA_ON"] = '1'

Issue: Model not loading

Check model path and strategy:

model = RWKV( model='/absolute/path/to/model.pth', strategy='cuda fp16' # Or 'cpu fp32' for CPU )

Issue: State management in RNN mode

Always pass state between forward calls:

WRONG: State lost

out1, _ = model.forward(tokens1, None) out2, _ = model.forward(tokens2, None) # No context from tokens1!

CORRECT: State preserved

out1, state = model.forward(tokens1, None) out2, state = model.forward(tokens2, state) # Has context from tokens1

Advanced topics

Time-mixing and channel-mixing: See references/architecture-details.md for WKV operation, time-decay mechanism, and receptance gates.

State management: See references/state-management.md for att_x_prev, att_kv, ffn_x_prev states, and numerical stability considerations.

RWKV-7 improvements: See references/rwkv7.md for latest architectural improvements (March 2025) and multimodal capabilities.

Hardware requirements

  • GPU: NVIDIA (CUDA 11.6+) or CPU

  • VRAM (FP16):

  • 169M model: 1GB

  • 430M model: 2GB

  • 1.5B model: 4GB

  • 3B model: 8GB

  • 7B model: 16GB

  • 14B model: 32GB

  • Inference: O(1) memory per token

  • Training: Parallelizable like GPT

Performance (vs Transformers):

  • Speed: Similar training, faster inference

  • Memory: 1000× less for long sequences

  • Scaling: Linear vs quadratic

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

react-hook-form-zod

No summary provided by upstream source.

Repository SourceNeeds Review
General

nextjs-shadcn-builder

No summary provided by upstream source.

Repository SourceNeeds Review
General

deep-reading-analyst

No summary provided by upstream source.

Repository SourceNeeds Review
General

react-native-expo

No summary provided by upstream source.

Repository SourceNeeds Review