qwen3-tts-rs Profiling & Benchmarking

Run performance profiling and benchmarks for the qwen3-tts Rust TTS engine.

Prerequisites

Docker with --gpus all support
qwen3-tts:latest Docker image (has Rust toolchain + CUDA)
Model weights in test_data/models/ (1.7B-CustomVoice is the default)
tokenizer.json must be in the model directory

Docker Execution Pattern

The CUDA toolchain lives inside the Docker container. All cargo commands must run there. The workspace is bind-mounted at /workspace:

docker run --rm --gpus all --entrypoint /bin/bash \
  -v "$(pwd):/workspace" -w /workspace \
  qwen3-tts:latest \
  -c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && <COMMAND>'

Profiling Modes

1. Chrome Trace (default — best for span hierarchy)

Produces trace.json for viewing in chrome://tracing or https://ui.perfetto.dev.

docker run --rm --gpus all --entrypoint /bin/bash \
  -v "$(pwd):/workspace" -w /workspace \
  qwen3-tts:latest \
  -c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && \
      cargo run --profile=profiling --features=profiling,cuda,cli --bin e2e_bench -- \
        --model-dir test_data/models/1.7B-CustomVoice --iterations 1 --warmup 1'

Output: trace.json (~12MB for 3 sentences). Contains spans:

generate_frames — full generation loop
code_predictor / code_predictor_inner — per-frame acoustic code generation
talker_step — per-frame transformer forward pass
sampling / top_k / top_p — per-frame token sampling
gpu_sync trace events — marks every to_vec1() GPU→CPU sync

2. Per-Stage Timing (no profiling feature needed)

The e2e_bench binary reports stage breakdowns (prefill / generation / decode) even without the profiling feature:

docker run --rm --gpus all --entrypoint /bin/bash \
  -v "$(pwd):/workspace" -w /workspace \
  qwen3-tts:latest \
  -c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && \
      cargo run --release --features=cuda,cli --bin e2e_bench -- \
        --model-dir test_data/models/1.7B-CustomVoice --iterations 3 --warmup 1'

3. Streaming TTFA (Time to First Audio)

# Add --streaming flag
... --bin e2e_bench -- --model-dir test_data/models/1.7B-CustomVoice \
    --iterations 3 --warmup 1 --streaming

4. JSON Output

... --bin e2e_bench -- --model-dir test_data/models/1.7B-CustomVoice \
    --json-output results.json --iterations 3

GPU Sync Audit

List all to_vec1() GPU→CPU synchronization points:

bash scripts/audit-gpu-syncs.sh

Interpreting Results

Stage Breakdown Table

Label  Words  Wall (ms)  Audio (s)  RTF    Tok/s  Mem (MB)  Prefill     Generate      Decode
short     13    5235.2      3.68   1.423    8.8      858   21ms (1%)  2724ms (71%)  1109ms (29%)
medium    53   23786.3     34.00   0.700   17.9      859   20ms (0%)  22694ms (95%)  1057ms (4%)
long     115   43797.4     60.96   0.718   17.4      864   19ms (0%)  41861ms (96%)  1886ms (4%)

Key metrics:

RTF < 1.0 = faster than real-time
Prefill: Should be <50ms on GPU. If high, check embedding/attention.
Generation: Dominates. ~18 GPU→CPU syncs per frame (16 code_predictor + 2 sampling).
Decode: ConvNeXt decoder. Scales with frame count. ~4% for long text.
Tok/s: Semantic tokens per second. Higher = better.

Chrome Trace Analysis

In Perfetto/chrome://tracing:

Look for gaps between talker_step and code_predictor — that's CPU overhead
Check if sampling (top_k + top_p) is significant vs model forward passes
The gpu_sync events mark where GPU stalls waiting for CPU

Optimization Targets

The ~18 to_vec1() calls per frame are the main bottleneck:

16 in code_predictor (argmax per acoustic code group)
2 in sampling (read sampled token)

Batch these to reduce GPU→CPU round-trips.

Model Variants

Model	Dir	Notes
1.7B-CustomVoice	`test_data/models/1.7B-CustomVoice`	Default benchmark target
1.7B-Base	`test_data/models/1.7B-Base`	Voice cloning (needs ref audio)
1.7B-VoiceDesign	`test_data/models/1.7B-VoiceDesign`	Text-described voices

Reference Baseline (1.7B-CustomVoice, CUDA)

From January 2025 on DGX (A100):

Short (13 words): RTF 1.42, 8.8 tok/s
Medium (53 words): RTF 0.70, 17.9 tok/s
Long (115 words): RTF 0.72, 17.4 tok/s
Prefill: ~20ms, Decode: ~1-2s, Generation: 71-96%

qwen3-tts-profile

Safety Notice

Copy this and send it to your AI assistant to learn