SGLang AMD Benchmark
Benchmark sglang LLM serving on AMD Instinct GPUs across parallel configurations (TP/DP/EP) and workload shapes (ISL/OSL/Concurrency). This skill runs in mix mode (non-disaggregated) — prefill and decode happen on the same GPUs. It produces a performance baseline and suggests config-level optimizations.
Run Rules (non-negotiable)
These rules apply to every benchmark run in this skill. (A profiling-stage-separation rule exists in the broader sglang-run guidance but is intentionally omitted here, since this skill does not profile.)
Rule 1 — Do NOT modify the sglang/aiter/mori environment
Never run pip install, pip uninstall, pip install --upgrade, or any equivalent reinstall command for sglang, aiter, mori, flydsl, or any related kernel/runtime package — even if a workload fails or imports look broken. The user's environments are hand-tuned dev installs (typically pip install -e .); a naive reinstall will silently overwrite local patches and destroy hours of work.
If the environment looks broken (missing module, version mismatch, ABI error, import crash), STOP and report the symptom to the user. Let the user decide whether to reinstall.
What you CAN do without asking:
- Inspect versions:
pip show sglang,python -c "import sglang; print(sglang.__file__)" - Read source files in the editable install
- Set environment variables for the run
What you MUST ask before doing:
pip install/pip uninstall/pip install -Ufor any package abovegit checkout/git pullinside the editable source directories- Modifying files inside
sglang/,aiter/,mori/source trees
Rule 2 — Always preserve server logs when launching an sglang server
Whenever you start an sglang server, redirect stdout+stderr to a real file. Never let server output go only to the terminal or to /dev/null. The Bash tool's run_in_background: true buffer is not a substitute — still redirect to a file.
In this skill, serve.sh writes to $LOG_DIR/server_<LABEL>.log automatically — that's what satisfies this rule, and what wait_for_server.py (Rule 3) reads.
Rule 3 — Wait for the server with the bundled monitor, don't blind-sleep
After launching an sglang server, startup typically takes a few minutes (model load, weight shard, kernel warmup, graph capture; AITER may JIT-compile CK kernels for several minutes on first launch). Do not sleep 300 and hope. Use the bundled monitor — it polls the log and returns the moment the outcome is known:
# After 3-0 deploys it, the script lives at /sgl-workspace/wait_for_server.py inside the container.
python3 /sgl-workspace/wait_for_server.py "$SERVER_LOG"
# exit codes:
# 0 READY — saw "The server is fired up and ready to roll"
# 1 CRASHED — saw "Traceback"
# 2 HUNG — log's last line + line count unchanged for >5 min
# 3 TIMEOUT — overall timeout (default 30 min) exceeded
# 4 ERROR — log file unreadable / never appeared
Source lives at scripts/wait_for_server.py in this skill's directory; 3-0 copies it to /sgl-workspace/ alongside serve.sh / bench.sh. Detection logic:
- Success: substring
The server is fired up and ready to rollappears. - Crash: substring
Tracebackappears. - Hang: each poll records
(line_count, last_non_empty_line)of the log; unchanged for ≥5 minutes (--stall-seconds) → treated as failed.
Tunable flags: --success, --failure, --stall-seconds, --overall-timeout, --poll-seconds. Bump --stall-seconds consciously if a specific config genuinely has long quiet periods (e.g. very large weight downloads, prolonged AITER JIT).
On CRASHED / HUNG / TIMEOUT / ERROR: stop and report the log tail to the user; do NOT silently restart.
Important Notes
- This skill covers mix mode only (no PD-disaggregation). Prefill and decode run on the same GPUs.
serve.shsetsSGLANG_USE_AITER=1automatically.bench.shsetsPYTHONPATHfor sglang's benchmark module automatically. No need to set these manually.- Use dummy weights by default (
LOAD_DUMMY=1). Dummy weights are sufficient for benchmarking throughput, latency, and parallel config comparison — real weights produce the same performance characteristics. Only useLOAD_DUMMY=0if the user explicitly asks for real weights. Real weights take much longer to load (10+ minutes for large models) and are rarely needed for config benchmarking. --random-range-ratio 1.0ensures exact ISL/OSL lengths (no variation) for reproducible benchmarks.bench.shusesnum_prompts = concurrency * 2— this is handled by the script automatically.- Between configs, fully kill the sglang server and wait for GPU memory to be freed before relaunching.
- If a benchmark run fails or hangs, check GPU memory usage with
rocm-smiand server health with the/healthendpoint.
Key Metrics
Every benchmark collects these metrics per (ISL, OSL, Concurrency) combination:
| Metric | Unit | Description |
|---|---|---|
| TTFT | ms | Time To First Token — latency from request to first token |
| TPOT | ms | Time Per Output Token — average inter-token latency |
| Input throughput | tok/s | Input tokens processed per second across all requests |
| Output throughput | tok/s | Output tokens generated per second across all requests |
| Total throughput | tok/s | Input + Output token throughput combined |
| Per-GPU throughput | tok/s | Total throughput / number of GPUs |
Per-GPU throughput is the most important efficiency metric — it shows how well each GPU is utilized. Two configs might have similar total throughput, but the one using fewer GPUs has better per-GPU throughput and is more cost-efficient.
Common Workspace Layout
The standard development environment uses /sgl-workspace as the root workspace inside Docker containers:
/sgl-workspace/
├── sglang/ # sglang source (installed via pip -e, dev mode)
├── aiter/ # AITER source (AMD AI Tensor Engine)
├── mori/ # Mori (communication library)
└── <model_short>_<YYYYMMDD>/ # benchmark output directories (created by this skill)
All benchmark artifacts (logs, reports) are saved under /sgl-workspace/ by default. If the user specifies a different workspace, use that instead.
Core Principle: Ask First, Execute Later
Do NOT guess or assume any configuration. Every detail must be explicitly confirmed by the user before execution begins. The workflow has two distinct phases:
- Planning phase (Steps 0–1): Gather ALL information through conversation. Ask questions, wait for answers. Do not proceed to the next question until the current one is answered.
- Confirmation gate (Step 2): Present the complete plan as a summary. Get explicit "go ahead" from the user.
- Execution phase (Steps 3–4): Only after full confirmation, run the benchmarks.
If at any point you're unsure about a parameter, ask. Never fill in a value the user hasn't confirmed.
Workflow
Step 0: Model & Environment Discovery
Ask the user these questions one by one. Wait for each answer before asking the next.
0a. Model selection — ask this FIRST
"Which model do you want to benchmark?"
The user may respond with:
- A full HuggingFace model ID (e.g.,
deepseek-ai/DeepSeek-R1-0528) - A short name (e.g., "DeepSeek R1", "Llama 70B", "Qwen 235B")
- A local path to the model weights
If the user gives a short name, confirm the exact model ID (e.g., "Do you mean deepseek-ai/DeepSeek-R1-0528?").
0b. Single-node or multi-node?
"Is this single-node or multi-node?"
- Single-node: 1 node, typically 8 GPUs
- Multi-node: ask how many nodes and GPUs per node
If multi-node, also ask for:
- Network interface (
GLOO_SOCKET_IFNAME) - InfiniBand HCAs (
NCCL_IB_HCA) - Head node IP (
SGLANG_HOST_IP)
0c. Access the GPU node
"How do I access the GPU node?"
- SSH command? (e.g.,
ssh user@gpu-node) - Docker container? (e.g.,
docker exec -it <container> bash) - Already on the machine?
- For multi-node: ask about access to each node
0d. Probe the environment
Once connected, probe automatically (no need to ask — just run and report back):
- Run
rocm-smi --showid→ report GPU count, model (MI355X, MI300X, MI308X), architecture - Run
pip show sgl-kernel 2>/dev/null && python3 -c "import sglang; print('sglang version:', sglang.__version__)"→ report sglang version - Run
pip list | grep -i aiter→ report AITER status - Check common paths:
/sgl-workspace/sglang,/sgl-workspace/aiter,/sgl-workspace/mori
PYTHONPATH probe (important for Docker environments): When running inside Docker containers via docker exec -d (non-interactive), .bashrc is often not sourced due to [ -z "$PS1" ] && return guards. This can cause PYTHONPATH to be missing paths for editable installs (aiter, mori, sglang), leading to import errors like ImportError: aiter is required when SGLANG_USE_AITER is set to True. The serve.sh script auto-detects and adds common workspace paths (/sgl-workspace/aiter, /sgl-workspace/mori, /sgl-workspace/sglang/python) to PYTHONPATH if they exist but are missing. However, if you encounter import errors, compare the environments:
# Non-interactive PYTHONPATH (what docker exec -d sees)
docker exec <container> bash -c 'echo $PYTHONPATH'
# Interactive PYTHONPATH (what the user sees)
docker exec <container> bash -ic 'echo $PYTHONPATH' 2>/dev/null
If they differ, ensure the missing paths are exported before running serve.sh.
If any probe reveals a broken package or missing dependency, follow Rule 1 above: report and stop. Do NOT pip install/uninstall sglang/aiter/mori or otherwise modify the environment yourself.
0e. Locate model weights
The user may or may not have specified where the model weights are stored. If they haven't provided a path, do a quick search — but don't waste time on this:
Quick places to check:
$HUGGINGFACE_HUB_CACHEenv var~/.cache/huggingface/hub/- Common mount points:
/mnt,/raid,/data
Note: HuggingFace cache stores models as models--<Org>--<Name>/snapshots/<hash>/. For example, Qwen/Qwen3.5-397B-A17B-FP8 would be at models--Qwen--Qwen3.5-397B-A17B-FP8/snapshots/<hash>/. Look for this pattern.
If you find a match, confirm with the user:
"I found what looks like the model weights at
/data/models/DeepSeek-R1-0528/. Is this the right location?"
If nothing turns up quickly, ask:
"I couldn't find the model weights on this machine. Where are they stored?"
The --model-path can be either:
- A local path directly to the weights (e.g.,
/data/models/DeepSeek-R1/) - A HuggingFace model ID (e.g.,
Qwen/Qwen3.5-397B-A17B-FP8) — but only if the weights already exist in$HUGGINGFACE_HUB_CACHE. If the weights are at$HUGGINGFACE_HUB_CACHE/models--<Org>--<Name>, using the HF model ID is preferred. You can alsoexport HUGGINGFACE_HUB_CACHE=<path>to point to the right cache dir.
Do NOT let sglang trigger a model download — the weights must already be on disk.
0f. Report findings and confirm
Present everything you found to the user:
"Here's what I have so far:
- Model: deepseek-ai/DeepSeek-R1-0528
- Weights: /data/models/DeepSeek-R1-0528/
- GPUs: 8x MI355X (gfx950)
- sglang: v0.5.x at /sgl-workspace/sglang
- AITER: installed
- Setup: single-node
Does this look right? Anything I should know about this environment?"
Step 1: Configuration Planning
Ask each of these questions explicitly. Do not move forward until you have clear answers for ALL of them.
1a. MTP decision (if applicable)
If the model is MTP-capable (detected via mtp_num_hidden_layers in config.json, or known models like DeepSeek-R1/V3, Qwen3.5), ask:
"This model supports Multi-Token Prediction (MTP), which can improve decode throughput. MTP is configured by a step count N (MTP=0 disables it; MTP=N for N>0 enables N speculative steps). By default we run with MTP=0 for a clean baseline. What would you like to do?"
- Run with
MTP=0only (baseline) - Run with
MTP=Nfor a chosenN(ask the user forN) - Run both
MTP=0andMTP=N, and compare
If the user wants MTP enabled, determine:
- MTP steps (
MTP=N, whereNis an integer ≥ 1, NOT a 0/1 toggle). If unsure, ask the user. - MTP algorithm (
MTP_ALGO): model-dependent — seereferences/server_config.mdfor the per-model table
serve.sh handles all speculative decoding flags (--speculative-algorithm, --speculative-num-steps, --speculative-eagle-topk, --speculative-num-draft-tokens) automatically from MTP and MTP_ALGO.
1b. Server setup
Check if a sglang server is already running — don't ask the user, just probe:
curl -s http://localhost:30000/health && echo "Server is running" || echo "No server running"
pgrep -fa "sglang.launch_server" || true
- If a server is running: inform the user and ask whether to shut it down or use it as-is. By default, shut it down so the skill controls the server lifecycle for each config.
- If no server is running: good — the skill will launch one for each config.
Ask: "Any additional sglang launch flags you want to use?" (e.g., --quantization fp8, --chunked-prefill-size, --schedule-policy, etc.)
Note: --disable-radix-cache is enabled by default in serve.sh for benchmarking. User can opt out with DISABLE_RADIX_CACHE=0.
1c. Parallel configurations
This is the most important decision in the benchmark. Read references/server_config.md for the full reference on parallelism types, naming conventions, EP modes, and how to reason about config choices.
Before asking the user, do the following:
- Read the model's
config.jsonfrom the weights directory directly (it's short). Look for KV heads, Q heads, expert count, and detect attention type (MLA/GQA/MHA). Seereferences/server_config.mdfor the key fields to look for — but note that field names vary across models, so read carefully. - Analyze the 4 factors described in
references/server_config.md→ "How to Reason About Parallel Config":
- Weight size vs GPU HBM → which TP values fit?
- Attention type + KV heads → TP or DP-attention?
- MoE vs Dense → EP applicable?
- EP mode → all-to-all or all-reduce?
- Present your analysis to the user — show your reasoning (weight size calc, KV head implications, why certain configs are better). Then present a suggested config table and ask the user to pick.
- If EP is involved, ask which EP mode (all-to-all or all-reduce), or suggest benchmarking both.
Wait for the user to respond. If they say "try all of them" or "you decide", confirm your suggested set before proceeding.
1d. Benchmark sweep parameters
"What ISL (input sequence length), OSL (output sequence length), and concurrency levels do you want to sweep?"
If the user isn't sure, offer options but still ask them to pick:
"Some common approaches:
- Specific pairs — e.g., (ISL=512, OSL=256), (ISL=1024, OSL=512) — good for simulating real workloads
- Full sweep — provide separate ISL, OSL, and CON lists, benchmark all combinations
Which approach? And what values?"
If the user says "you pick" or "whatever makes sense", then suggest values and ask for confirmation before proceeding:
"Here's what I'd suggest:
- ISL: 128, 512, 1024, 2048, 4096
- OSL: 128, 512, 1024, 2048
- Concurrency: 1, 16, 64, 128, 256
That's 5 × 4 × 5 = 100 runs per config, times 2 configs = 200 total runs. Estimated ~3+ hours. Want to proceed with these, or adjust?"
Step 2: Confirmation Gate
Do NOT start any benchmark until this step is complete.
Naming convention
Use this pattern for directories:
BENCH_DIR=/sgl-workspace/<model_short>_<YYYYMMDD>
Per-config dirs: <CONFIG>_mtp<N> where N is the MTP step count (0 = off, e.g. DP8EP8_mtp0, TP8_mtp0, DP8EP8_mtp3)
Present the plan summary
Benchmark Plan Summary
Item Value Model deepseek-ai/DeepSeek-R1-0528 GPU 8x MI355X Mode Mix (non-disaggregated) Bench dir /sgl-workspace/DeepSeek-R1_20260322/Sweep: ISL=[128, 512, 1024, 2048], OSL=[128, 512, 1024], CON=[1, 16, 64, 128, 256]
Confirm configs with dry-run
For each parallel config, actually run scripts/serve.sh with DRY_RUN=1 on the GPU node — do NOT construct the launch command manually. The dry-run output shows the exact command that will be executed, ensuring consistency between what the user confirms and what actually runs.
For a small number of configs (2-3), present all dry-run outputs at once. For many configs, present them one by one. Get confirmation before proceeding to execution.
BENCH_DIR=/sgl-workspace/<model_short>_$(date +%Y%m%d)
# Config 1 — dry run
MODEL_PATH=<MODEL_PATH> CONFIG=DP8EP8_A2A MTP=0 \
LOG_DIR=$BENCH_DIR/DP8EP8_A2A_mtp0 DRY_RUN=1 bash serve.sh
# Config 2 — dry run
MODEL_PATH=<MODEL_PATH> CONFIG=TP8 MTP=0 \
LOG_DIR=$BENCH_DIR/TP8_mtp0 DRY_RUN=1 bash serve.sh
Show the full dry-run output (including the complete formatted sglang launch command with all flags) to the user and ask: "Do these configs look right?"
If the user wants changes, adjust and re-run the dry run. Once confirmed, proceed to Step 3.
Step 3: Benchmark Execution
Only proceed here after the user has confirmed ALL configs in Step 2.
Always use serve.sh and bench.sh to launch the server and run benchmarks. Do NOT construct sglang commands manually — the scripts handle critical flags (--enable-dp-attention, --enable-dp-lm-head, SGLANG_USE_AITER, PYTHONPATH, etc.) that are easy to miss.
3-0. Deploy benchmark scripts to the remote node
The scripts/serve.sh, scripts/bench.sh, scripts/stop.sh, scripts/verify_stop.sh, and scripts/wait_for_server.py files live in the skill directory on the local machine. serve.sh/bench.sh/stop.sh/wait_for_server.py run inside the container; verify_stop.sh MUST run on the host (so it can see PIDs from sibling containers).
# From local: scripts → remote node → into container (verify_stop.sh stays on the host)
scp scripts/serve.sh scripts/bench.sh scripts/stop.sh scripts/verify_stop.sh scripts/wait_for_server.py <SSH_HOST>:/tmp/
ssh <SSH_HOST> "docker cp /tmp/serve.sh <CONTAINER>:/sgl-workspace/ && docker cp /tmp/bench.sh <CONTAINER>:/sgl-workspace/ && docker cp /tmp/stop.sh <CONTAINER>:/sgl-workspace/ && docker cp /tmp/wait_for_server.py <CONTAINER>:/sgl-workspace/"
Alternatively, if you're already inside the container, write the script content directly using cat > /sgl-workspace/serve.sh << 'SCRIPT' ... SCRIPT.
Important: Avoid running scripts through nested ssh → docker exec → bash -c with inline heredocs — the quoting becomes unmanageable. Always copy scripts to the remote first, then run them simply with bash serve.sh.
For each parallel config:
3a. Launch sglang server
Launch in background so you can proceed to benchmarking:
MODEL_PATH=<MODEL_PATH> CONFIG=<CONFIG> MTP=<N> \
LOG_DIR=$BENCH_DIR/<CONFIG>_mtp<N> \
BACKGROUND=1 bash serve.sh
serve.sh writes the server's stdout+stderr to $LOG_DIR/server_<LABEL>.log, which is what satisfies Rule 2 (persistent server log) and what wait_for_server.py in 3b reads.
If the user already has a running server, skip the launch and use their URL.
3b. Wait for server ready
Per Rule 3 above, use the bundled scripts/wait_for_server.py — do NOT sleep blindly and do NOT roll your own tail -f | grep loop. The script already handles stall detection (≥ 5 min unchanged) and avoids matching benign substrings like Ignore import error / UserWarning.
# Script was copied to /sgl-workspace/ in 3-0 alongside serve.sh / bench.sh.
SERVER_LOG=$(ls -t $BENCH_DIR/<CONFIG>_mtp<N>/server_*.log | head -1)
python3 /sgl-workspace/wait_for_server.py "$SERVER_LOG"
# exit codes:
# 0 READY — saw "The server is fired up and ready to roll"
# 1 CRASHED — saw "Traceback"; stop and report tail of $SERVER_LOG to user
# 2 HUNG — log stalled ≥ --stall-seconds (default 300s); stop and report
# 3 TIMEOUT — overall --overall-timeout (default 1800s) exceeded
# 4 ERROR — log file unreadable / never appeared
If AITER JIT compilation legitimately produces long quiet periods on a particular config, bump --stall-seconds (and/or --overall-timeout) explicitly rather than swallowing a HUNG. On any non-zero exit, stop and report the log tail to the user — do NOT silently relaunch.
3c. Run benchmark
bench.sh no longer writes per-run logs itself. Set OUTPUT_DIR; per-run JSONL is written to ${OUTPUT_DIR}/jsonl_dir/ and you MUST capture stdout+stderr with 2>&1 | tee $OUTPUT_DIR/<name>.log.
OUTPUT_DIR=$BENCH_DIR/<CONFIG>_mtp<N> \
MODEL_PATH=<MODEL_PATH> ISL=<ISL> OSL=<OSL> \
CONCURRENCY="<CON1> <CON2> <CON3>" \
bash bench.sh 2>&1 | tee $OUTPUT_DIR/bench_ISL<X>_OSL<Y>.log
For multiple ISL/OSL combinations, loop (remember 2>&1 | tee per invocation):
export OUTPUT_DIR=$BENCH_DIR/<CONFIG>_mtp<N>
for ISL in 128 512 1024 2048; do
for OSL in 128 512 1024; do
MODEL_PATH=<MODEL_PATH> ISL=$ISL OSL=$OSL \
CONCURRENCY="1 16 64 128 256" \
bash bench.sh 2>&1 | tee $OUTPUT_DIR/bench_ISL${ISL}_OSL${OSL}.log
done
done
3d. Stop server and repeat
Kill sglang inside the container, then verify on the host (sibling-container PIDs are invisible from within the container):
ssh <SSH_HOST> "docker exec <CONTAINER> bash /sgl-workspace/stop.sh"
ssh <SSH_HOST> bash /tmp/verify_stop.sh # exit 0 = GPUs free; non-zero prints offending PIDs
If a config crashes: Report the error, run stop.sh then verify_stop.sh, and move on to the next config. Do NOT debug kernel issues or retry. Document the crash and error message in the final report.
Repeat 3a–3d for each parallel config.
Step 4: Report
After all configs are benchmarked, generate structured CSV data, a performance plot, and a Markdown report.
4a. Generate CSV from JSONL
For each config directory, run jsonl_to_csv.py to extract metrics into an InferenceX-compatible CSV:
python3 /sgl-workspace/jsonl_to_csv.py \
--jsonl-dir $BENCH_DIR/<CONFIG>_mtp<N>/jsonl_dir \
--hardware <HARDWARE> \
--precision <PRECISION> \
--model <MODEL_NAME> \
--date <YYYY-MM-DD> \
--output $BENCH_DIR/<CONFIG>_mtp<N>/<MODEL>_<HARDWARE>_<PRECISION>.csv
Required args:
--hardware: GPU hardware name (e.g.mi355x,b200,b300)--precision: weight precision (e.g.fp4,fp8,bf16)
Optional args:
--model: model display name (default: auto-detected from model path)--date: benchmark date (default: today)--output: output CSV path (default: auto-named in jsonl-dir parent)
The CSV follows InferenceX format with all standard columns (throughput/GPU, TTFT, TPOT, interactivity, ITL, E2E latency, etc.). Time values are stored in seconds (matching InferenceX convention, despite column headers saying "ms"). Interactivity = 1000 / TPOT(ms).
4b. Generate performance plot
Run plot_interactivity.py to produce a Token Throughput per GPU vs. Interactivity chart from one or more CSVs:
python3 /sgl-workspace/plot_interactivity.py \
$BENCH_DIR/<CONFIG1>/<CSV1>.csv \
$BENCH_DIR/<CONFIG2>/<CSV2>.csv \
-o $BENCH_DIR/interactivity_plot.png
You can also include reference CSVs (e.g. from InferenceX) alongside your benchmark CSVs to produce comparison plots. Optional args: --title, --subtitle, --dpi (default: 150).
4c. Write Markdown report
Write a Markdown report to $BENCH_DIR/benchmark_report.md that includes:
- Configuration summary (model, GPUs, mode, MTP status)
- Per-config results tables with all metrics + per-GPU throughput
- Cross-config comparison highlighting the best performer for each metric
- Reference to the generated CSV and plot files
Present the report to the user and walk them through the key findings.
File Organization
/sgl-workspace/<model_short>_<YYYYMMDD>/
├── benchmark_report.md # final report
├── DP4EP4_mtp0/ # per-config directory
│ ├── server_DP4EP4_mtp0.log # sglang server log (from serve.sh)
│ ├── bench_ISL4096_OSL1024.log # bench.sh stdout/stderr (you capture via `2>&1 | tee`)
│ └── jsonl_dir/ # raw JSONL written by bench.sh --output-file
│ ├── bench_ISL4096_OSL1024_CON64.jsonl
│ ├── bench_ISL4096_OSL1024_CON128.jsonl
│ └── ...
├── TP8_mtp0/
│ ├── server_TP8_mtp0.log
│ └── ...
└── DP8EP8_A2A_mtp1/
└── ...
Each config gets its own directory. serve.sh writes server_<LABEL>.log into LOG_DIR. bench.sh writes JSONL into OUTPUT_DIR; capture its stdout/stderr to the same OUTPUT_DIR via 2>&1 | tee $OUTPUT_DIR/<bench>.log.