xiaoyuzhou-asr
Transcribe 小宇宙 podcast episodes to text using local Qwen3-ASR (Metal/CUDA accelerated).
Prerequisites
- xyz API server running — fetches episode data and audio URLs from 小宇宙
git clone https://github.com/ultrazg/xyz.git && cd xyz && go run . # Default port: 23020, change with -p - Access token — login via
POST /sendCodethenPOST /login(see references/xyz-api.md) - ffmpeg — audio format conversion (
brew install ffmpeg) - Qwen3-ASR model — download (HF Hub does NOT ship tokenizer.json):
python3 -c " from huggingface_hub import snapshot_download snapshot_download('Qwen/Qwen3-ASR-0.6B', local_dir='models/0.6B') " - qwen3-asr-rs — build from source:
git clone https://github.com/alan890104/qwen3-asr-rs.git && cd qwen3-asr-rs cargo build --release --example local_transcribe - tokenizer.json — auto-generated by the transcription script on first run (from vocab.json + merges.txt). No manual step needed.
Workflow
Step 1: Find Episode
TOKEN="$XYZ_ACCESS_TOKEN"
BASE="http://localhost:23020"
# Search episodes by keyword
curl -s -X POST $BASE/search \
-H "x-jike-access-token: $TOKEN" -H "Content-Type: application/json" \
-d '{"keyword":"关键词","type":"EPISODE"}'
# Get episode detail (contains audio URL)
curl -s -X POST $BASE/episode_detail \
-H "x-jike-access-token: $TOKEN" -H "Content-Type: application/json" \
-d '{"eid":"EPISODE_ID"}'
# List episodes of a podcast
curl -s -X POST $BASE/episode_list \
-H "x-jike-access-token: $TOKEN" -H "Content-Type: application/json" \
-d '{"pid":"PODCAST_ID","order":"desc"}'
Step 2: Download and Convert Audio
Audio URL is in data.data.media.source.url (m4a format).
mkdir -p /tmp/xiaoyuzhou-audio
curl -L -o /tmp/xiaoyuzhou-audio/episode.m4a "$AUDIO_URL"
ffmpeg -y -i /tmp/xiaoyuzhou-audio/episode.m4a -ar 16000 -ac 1 /tmp/xiaoyuzhou-audio/episode.wav
Step 3: Split Long Audio (REQUIRED for >3 min)
Podcasts are continuous speech with few silence gaps. Use fixed-interval splitting:
# Split into 3-minute segments (must split at ≥2 min for Metal GPU memory)
ffmpeg -y -i episode.wav -f segment -segment_time 180 -ar 16000 -ac 1 seg_%03d.wav
Or try silence-based splitting (may find no gaps in continuous podcasts):
ffmpeg -i episode.wav -af "silencedetect=noise=-30dB:d=2" -f null - 2>&1 | grep silence_end
ffmpeg -i episode.wav -f segment -segment_times T1,T2 -ar 16000 -ac 1 seg_%03d.wav
Step 4: Transcribe
MODEL_DIR="/path/to/models/0.6B"
ASR_BIN="qwen3-asr-rs/target/release/examples/local_transcribe"
# Transcribe each segment
for seg in seg_*.wav; do
$ASR_BIN $MODEL_DIR $seg 2>/dev/null | grep "^Text :" | sed 's/^Text : //'
done
For efficiency (load model once in Rust):
use qwen3_asr::{AsrInference, TranscribeOptions, best_device};
let engine = AsrInference::load("models/0.6B", best_device())?;
for seg in segments {
let result = engine.transcribe(&seg, TranscribeOptions::default())?;
output.push(result.text);
}
Step 5: Format Output
Combine transcript with metadata as markdown:
# {title}
**节目**: {podcast.title} | **日期**: {pubDate} | **时长**: {duration}s
## 转录文本
{transcript}
References
- xyz API endpoints and auth: references/xyz-api.md
- Qwen3-ASR usage and performance: references/qwen3-asr.md
Token Management
- Tokens expire. If API returns 401, refresh:
POST /refresh_token - Store in env:
XYZ_ACCESS_TOKEN,XYZ_REFRESH_TOKEN - Prompt user to login if no valid token
Constraints
- MUST split audio into ≤3-minute segments for Metal GPU stability
- Audio must be WAV 16kHz mono
- tokenizer.json must be generated manually (not included in HF download)
local_transcribebinary needed (demo binary only runs built-in test samples)- xyz API requires Chinese phone number (+86) login
- All processing is local — audio never leaves the machine