Zhipu AI Automatic Speech Recognition (ASR)
Transcribe Chinese audio files to text using Zhipu AI's GLM-ASR model.
Setup
1. Get your API Key: Get a key from Zhipu AI Console
2. Set it in your environment:
export ZHIPU_API_KEY="your-key-here"
Supported Audio Formats
- WAV - Recommended, best quality
- MP3 - Widely supported
- OGG - Auto-converted to MP3
- M4A - Auto-converted to MP3
- AAC - Auto-converted to MP3
- FLAC - Auto-converted to MP3
- WMA - Auto-converted to MP3
Note: The script automatically converts unsupported formats to MP3 using ffmpeg. Only WAV and MP3 are accepted by the API, but you can use any format that ffmpeg supports.
File Constraints
- Maximum file size: 25 MB
- Maximum duration: 30 seconds
- Recommended sample rate: 16000 Hz or higher
- Audio channels: Mono or stereo
Usage
Basic Transcription
Transcribe an audio file with default settings:
bash scripts/speech_to_text.sh recording.wav
Transcription with Context
Provide previous transcription or context for better accuracy:
bash scripts/speech_to_text.sh recording.wav "这是之前的转录内容,有助于提高准确性"
Transcription with Hotwords
Use custom vocabulary to improve recognition of specific terms:
bash scripts/speech_to_text.sh recording.mp3 "" "人名,地名,专业术语,公司名称"
Full Options
Combine context and hotwords:
bash scripts/speech_to_text.sh recording.wav "会议记录片段" "张三,李四,项目名称"
Parameters:
audio_file(required): Path to audio file (.wav or .mp3)prompt(optional): Previous transcription or context text (max 8000 chars)hotwords(optional): Comma-separated list of specific terms (max 100 words)
Features
Context Prompts
Why use context prompts:
- Improves accuracy in long conversations
- Helps with domain-specific terminology
- Maintains consistency across multiple segments
When to use:
- Multi-part conversations or meetings
- Technical or specialized content
- Continuing from previous transcriptions
Example:
bash scripts/speech_to_text.sh part2.wav "第一部分的转录内容:讨论了项目进展和下一步计划"
Hotwords
What are hotwords: Custom vocabulary list that boosts recognition accuracy for specific terms.
Best use cases:
- Proper names (people, places)
- Domain-specific terminology
- Company names and products
- Technical jargon
- Industry-specific terms
Examples:
# Medical transcription
bash scripts/speech_to_text.sh medical.wav "" "患者,症状,诊断,治疗方案"
# Business meeting
bash scripts/speech_to_text.sh meeting.wav "" "张经理,李总,项目代号,预算"
# Tech discussion
bash scripts/speech_to_text.sh tech.wav "" "API,数据库,算法,框架"
Workflow Examples
Transcribe a Meeting
# Part 1
bash scripts/speech_to_text.sh meeting_part1.wav
# Part 2 with context
bash scripts/speech_to_text.sh meeting_part2.wav "第一部分讨论了项目进度" "张总,李经理,项目名称"
# Part 3 with context
bash scripts/speech_to_text.sh meeting_part3.wav "前两部分讨论了项目进度和预算" "张总,李经理,项目名称"
Transcribe a Lecture
bash scripts/speech_to_text.sh lecture.wav "" "教授,课程名称,专业术语1,专业术语2"
Process Multiple Files
for file in recording_*.wav; do
bash scripts/speech_to_text.sh "$file"
done
Audio Quality Tips
Best practices for accurate transcription:
-
Clear audio source
- Minimize background noise
- Use good quality microphone
- Speak clearly and at moderate pace
-
Optimal audio settings
- Sample rate: 16000 Hz or higher
- Bit depth: 16-bit or higher
- Single channel (mono) is sufficient
-
File preparation
- Remove silence from beginning/end
- Normalize audio levels
- Ensure consistent volume
Output Format
The script outputs JSON with:
id: Task IDcreated: Request timestamp (Unix timestamp)request_id: Unique request identifiermodel: Model name usedtext: Transcribed text
Example output:
{
"id": "task-12345",
"created": 1234567890,
"request_id": "req-abc123",
"model": "glm-asr-2512",
"text": "你好,这是转录的文本内容"
}
Troubleshooting
File Size Issues:
- Split audio files larger than 25 MB
- Reduce sample rate or bit depth
- Use compression (MP3) for smaller files
Duration Issues:
- Split recordings longer than 30 seconds
- Process segments separately
- Use context prompts to maintain continuity
Poor Accuracy:
- Improve audio quality
- Use hotwords for specific terms
- Provide context prompts
- Ensure clear speech and minimal noise
Format Issues:
- Ensure file is .wav or .mp3
- Check file is not corrupted
- Verify audio can be played by standard players
Limitations
- Maximum audio duration: 30 seconds per request
- File size limit: 25 MB
- Maximum hotwords: 100 terms
- Context prompt limit: 8000 characters
- Best performance with Chinese language audio
Performance Notes
- Typical transcription time: 1-3 seconds
- Real-time or faster for most audio
- Processing time scales with audio quality and length