SenseAudio Pronunciation Coach

Listen → Record → Compare → Drill. The loop that actually improves pronunciation.

Step 1: Choose Practice Material

Three input modes:

A — Direct input: User pastes a word, phrase, or sentence.

B — Scene presets: Offer these if the user isn't sure what to practice:

Scene	Sample phrase
机场值机	"I'd like a window seat, please."
餐厅点餐	"Could I have the menu, please?"
商务会议	"Let me walk you through the agenda."
酒店入住	"I have a reservation under my name."
购物	"Do you have this in a different size?"
问路	"Excuse me, how do I get to the station?"

C — Topic-based: User says "练习 th 发音" or "练习 r 和 l 的区别" — generate 5 sentences targeting that phoneme.

Also ask: 目标语言？ (default: English)

Step 2: Generate Standard Pronunciation

Produce two versions — slow for learning, normal for natural rhythm:

# Slow version (speed 0.75)
curl -s -X POST https://api.senseaudio.cn/v1/t2a_v2 \
  -H "Authorization: Bearer $SENSEAUDIO_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"SenseAudio-TTS-1.0\",
    \"text\": \"<TEXT>\",
    \"stream\": false,
    \"voice_setting\": { \"voice_id\": \"<VOICE_ID>\", \"speed\": 0.75 },
    \"audio_setting\": { \"format\": \"mp3\" }
  }" -o slow.json
jq -r '.data.audio' slow.json | xxd -r -p > standard_slow.mp3

# Normal version (speed 1.0)
curl -s -X POST https://api.senseaudio.cn/v1/t2a_v2 \
  -H "Authorization: Bearer $SENSEAUDIO_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"SenseAudio-TTS-1.0\",
    \"text\": \"<TEXT>\",
    \"stream\": false,
    \"voice_setting\": { \"voice_id\": \"<VOICE_ID>\", \"speed\": 1.0 },
    \"audio_setting\": { \"format\": \"mp3\" }
  }" -o normal.json
jq -r '.data.audio' normal.json | xxd -r -p > standard_normal.mp3

Voice selection by language:

English: female_0006_a (clear, neutral accent)
Chinese: female_0008_c (standard Mandarin)
Default: female_0006_a

Tell the user: "慢速版和正常速版已生成。先听慢速版，感受每个音的发音，再听正常版感受自然节奏。准备好后，录一段你的跟读发给我。"

Step 3: Transcribe User Recording

When the user uploads their recording:

curl -s -X POST https://api.senseaudio.cn/v1/audio/transcriptions \
  -H "Authorization: Bearer $SENSEAUDIO_API_KEY" \
  -F "file=@<USER_RECORDING>" \
  -F "model=sense-asr-pro" \
  -F "response_format=verbose_json" \
  -F "language=<LANGUAGE_CODE>" \
  -F "timestamp_granularities[]=word" \
  > asr_result.json

Language codes: English → en, Chinese → zh, Japanese → ja, French → fr, Spanish → es

Extract the transcript: jq -r '.text' asr_result.json

Step 4: Word-by-Word Comparison (LLM task)

Compare the ASR transcript against the original text yourself. Align words and identify mismatches:

Comparison approach:

Tokenize both original and ASR output into words
Use sequence alignment (like diff) to match them
Flag words where ASR output differs from original

Diagnosis output format:

跟读分析：

✓ "I'd like a"  — 正确
✗ "window"      — 识别为 "winder"（可能是 -ow 结尾发音问题）
✓ "seat"        — 正确
✗ "please"      — 识别为 "pleas"（末尾 -z 音可能不够清晰）

准确率：3/5 词 (60%)

Common phoneme issues for Chinese speakers (English):

Misrecognized as	Likely problem	Phoneme
"free" for "three"	th → f	/θ/
"light" for "right"	r → l confusion	/r/
"wery" for "very"	v → w	/v/
"sit" for "seat"	short vs long vowel	/ɪ/ vs /iː/
"fink" for "think"	th → f	/θ/
dropped final consonant	final stop deletion	/t/, /d/, /k/

When a word is misrecognized, infer the likely phoneme issue and name it specifically.

Step 5: Targeted Drill

For each identified problem phoneme, generate a focused drill set:

Phoneme drill library:

Phoneme	Drill words
/θ/ (th)	think, three, through, both, weather, teeth, breathe
/r/	red, right, road, very, sorry, around, mirror
/r/ vs /l/	right/light, road/load, rice/lice, pray/play
/v/	very, voice, love, live, over, never, river
/iː/ vs /ɪ/	seat/sit, beat/bit, sheep/ship, feel/fill
final /t/	cat, hat, right, night, about, what, that
final /d/	road, said, good, food, bad, head

Present 3–5 drill words and generate slow TTS for each.

Step 6: Track Progress

Save session results to pronunciation_progress.json in the current directory:

{
  "sessions": [
    {
      "date": "<ISO date>",
      "text": "<practice text>",
      "accuracy": 0.6,
      "errors": ["window (/ow/)", "please (final /z/)"],
      "phonemes_drilled": ["/ow/", "/z/"]
    }
  ]
}

After 3+ sessions, show a summary:

发音弱项分析（最近5次练习）：

/θ/ (th)  ████████░░  4次出错  ← 重点练习
/r/       ████░░░░░░  2次出错
/iː/      ██░░░░░░░░  1次出错

建议：重点练习 th 发音，可以说"把舌尖放在上下牙之间，轻轻吹气"。

Iteration

After each round, ask: "再来一遍，还是换一个句子？" Keep the loop going until the user is satisfied or accuracy reaches 90%+.