mm-voice-maker

Enables voice synthesis, voice cloning, voice design, and audio post-processing using MiniMax Voice API and FFmpeg. Use when converting text to speech, creating custom voices, or processing/merging audio.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "mm-voice-maker" with this command: npx skills add BLUE-coconut/mm-voice-maker

MiniMax Voice Maker

Professional text-to-speech skill with emotion detection, voice cloning, and audio processing capabilities powered by MiniMax Voice API and FFmpeg.

Capabilities

AreaFeatures
TTSSync (HTTP/WebSocket), async (long text), streaming
Segment-basedMulti-voice, multi-emotion synthesis from segments.json, auto merge
VoiceCloning (10s–5min), design (text prompt), management
AudioFormat conversion, merge, normalize, trim, remove silence (FFmpeg)

File structure:

mmVoice_Maker/
├── SKILL.md                       # This overview
├── mmvoice.py                     # CLI tool (recommended for Agents)
├── check_environment.py           # Environment verification
├── requirements.txt
├── scripts/                       # Entry: scripts/__init__.py
│   ├── utils.py                   # Config, data classes
│   ├── sync_tts.py                # HTTP/WebSocket TTS
│   ├── async_tts.py               # Long text TTS
│   ├── segment_tts.py             # Segment-based TTS (multi-voice, multi-emotion)
│   ├── voice_clone.py             # Voice cloning
│   ├── voice_design.py            # Voice design
│   ├── voice_management.py        # List/delete voices
│   └── audio_processing.py        # FFmpeg audio tools
└── reference/                     # Load as needed
    ├── cli-guide.md               # CLI usage guide
    ├── getting-started.md         # Setup and quick test
    ├── tts-guide.md               # Sync/async TTS workflows
    ├── voice-guide.md             # Clone/design/manage
    ├── audio-guide.md             # Audio processing
    ├── script-examples.md         # Runnable code snippets
    ├── troubleshooting.md         # Common issues
    ├── api_documentation.md       # Complete API reference
    └── voice_catalog.md           # Voice selection guide

Main Workflow Guideline (Text to Speech)

6-step workflow: [step1]. Verify environment

[step2-preparation]⚠️NOTE: Before processing the text, you must read voice-catalog.md for voice selection.

[step2]. Process text into script → <cwd>/audio/segments.json. Note: [Step2.4] is really important, you must check it twice before sending the script to the user.

[step2.5]. ⚠️ Generate preview for user confirmation (highly recommended for multi-voice content)

[step3]. Present plan to user for confirmation

[step4]. Validate segments.json

[step5]. Generate and merge audio → intermediate files in <cwd>/audio/tmp/, final output in <cwd>/audio/output.mp3

[step6]. ⚠️ CRITICAL: User confirms audio quality FIRST → THEN cleanup temp files (only after user is satisfied)

<cwd> is Claude's current working directory (not the skill directory). Audio files are saved relative to where Claude is running commands.

Step 1: Verify environment

python check_environment.py

Checks:

  • Python 3.8+
  • Required packages (requests, websockets)
  • FFmpeg installation
  • MINIMAX_VOICE_API_KEY environment variable

If API key is not set, ask user for keys and set it:

export MINIMAX_VOICE_API_KEY="your-api-key-here"

Step 2: Decision and Pre-processing

⚠️ MOST IMPORTANT PRINCIPLE: Gender Matching First

Before selecting voices, you MUST always match gender first. This is non-negotiable.

Golden Rule:

If a character is male → use male voice If a character is female → use female voice If a character is neutral/other → choose appropriate neutral voice

Why this matters:

  • Violating gender matching (e.g., male character with female voice) breaks immersion
  • Even if personality traits match, gender comes first
  • This is especially critical for classic literature, historical content, and professional narration

Examples:

CharacterWrong VoiceCorrect Voice
唐三藏 (male monk)female-yujieChinese (Mandarin)_Gentleman
林黛玉 (female)male-qn-badaofemale-shaonv
曹操 (male warlord)female-chengshuChinese (Mandarin)_Unrestrained_Young_Man

Decision guide: Evaluate based on:

  • Does the user specify a model? → Use that model, or use the default one "speech-2.8"
  • Is multi-voice needed? → Different voice_id per speaker/character
  • For speech-2.8: emotion is auto-matched (leave emotion empty)
  • For older models: manually specify emotion tags

Use case scenarios:

ScenarioDescriptionSegmentsVoice Selection
Single VoiceUser needs one voice for the entire content. Segment only by length (≤1,000,000 chars per segment).Split by length onlyOne voice_id for all segments
Multi-VoiceMultiple characters/speakers, each with different voice. Segment by speaker/role changes.Split by logical unit (speaker, dialogue, etc.)Different voice_id per role
Podcast/InterviewHost and guest speakers with distinct voices.Split by speakerVoice per host/guest
Audiobook/FictionNarrator and character voices.Split by narration vs. dialogueVoice per narrator/character
DocumentaryMostly narration with occasional quotes.Keep as one segmentSingle narrator voice
Report/AnnouncementFormal content with consistent tone.Keep as one segmentProfessional voice

Processing Workflow (4 sub-steps):

Step 2.1: Text Segmentation and Role Analysis First, segment your text into logical units and identify the role/character for each segment.

Key principle (Important!): Split by logical unit, NOT simply by sentence

When to split (Important!):

  • Different speakers clearly marked
  • Narrator vs. character dialogue (in fiction/audiobooks/interview etc.)
  • In some scenarios (like audiobooks, multi-voice fiction etc.), where speaker's identity is important, split when narration and dialogue mix in the same sentence.

When NOT to split (Important!):

  • Third-person narration like "John said..." or "The reporter noted..."
  • Quoted speech in narration (in documentary/podcast/report etc.) should keep in narrator's voice
  • Keep in narrator's voice unless specific characterization is needed

Decision depends on use case:

Use caseExampleSplit strategy
Single VoiceLong article, news piece, announcementSplit by length (≤1,000,000 chars), same voice for all
Podcast/Interview"Host: Welcome to the show. Guest: Thank you for having me."Split by speaker
Documentary narration"The scientist explained, 'The results are promising.'"Keep as one segment (narrator voice)
Audiobook/Fiction"'Who's there?' she whispered."Split: "'Who's there?'" should be in character voice, while "she whispered." should be in narrator's voice
Report"According to the report, the economy is growing."Keep as one segment

Example1: Single Voice (speech-2.8) For single-voice content (e.g., news, announcements, articles), segment only by length while maintaining the same voice:

[
  {"text": "First part of the article (under 1,000,000 chars)...", "role": "narrator", "voice_id": "female-shaonv", "emotion": ""},
  {"text": "Second part of the article (under 1,000,000 chars)...", "role": "narrator", "voice_id": "female-shaonv", "emotion": ""},
  {"text": "Third part of the article (under 1,000,000 chars)...", "role": "narrator", "voice_id": "female-shaonv", "emotion": ""}
]

Example2: Audiobook with characters (speech-2.8) In audiobooks (multi-voice fiction), split when narration and dialogue mix in the same sentence:

[
  {"text": "The detective entered the room.", "role": "narrator", "voice_id": "", "emotion": ""},
  {"text": "\"Who's there?\"", "role": "female_character", "voice_id": "", "emotion": ""},
  {"text": "she whispered.", "role": "narrator", "voice_id": "", "emotion": ""},
  {"text": "\"It's me,\"", "role": "male_character", "voice_id": "", "emotion": ""},
  {"text": "he replied calmly.", "role": "narrator", "voice_id": "", "emotion": ""}
]

Example3: Documentary/podcast narration (speech-2.8) Quoted speech in narration stays in narrator's voice (no need to split):

[
  {
    "text": "The scientist explained, \"The results show significant improvement in all test groups.\"",
    "role": "narrator",
    "voice_id": "",
    "emotion": ""
  },
  {
    "text": "According to the latest report, the economy has grown by 3% this quarter.",
    "role": "narrator",
    "voice_id": "",
    "emotion": ""
  }
]

**Note:** In the preliminary `segments.json`:
- Fill in the `text` field with segment content
- Fill in the `role` field to identify the character (narrator, male_character, female_character, host, guest, etc.)
- Leave `voice_id` empty (to be filled in Step 2.2)
- Leave `emotion` empty for speech-2.8 models


**Step 2.2: Voice Selection**

After segmenting and labeling roles, analyze all detected characters in your text. Consult [voice_catalog.md](reference/voice_catalog.md) **Section 1 "How to Choose a Voice"** to match voices to characters.

**⚠️ CRITICAL: Follow the two-step selection process below**

**Path A — Professional domains (Story/Narration, News/Announcements, Documentary):**
If the content belongs to one of these three professional domains, prioritize selecting from the recommended voices in **voice_catalog.md Section 2.1** (filter by scenario + gender). These voices are specifically optimized for their professional use cases.

**Path B — All other scenarios:**
Select from **voice_catalog.md Section 2.2**, following this strict priority hierarchy:

1. **First: Match Gender** (non-negotiable) — Male characters MUST use male voices, female characters MUST use female voices
2. **Second: Match Language** — The voice MUST match the content language (Chinese content → Chinese voice, Korean content → Korean voice, English content → English voice, etc.). Never assign a voice from the wrong language.
3. **Third: Match Age** — Determine the age group (Children / Youth / Adult / Elderly / Professional) and select from the corresponding subsection in Section 2.2
4. **Fourth: Match Personality & Role** — Choose the best fit based on personality traits, tone, and character role

**Voice Selection Decision Tree:**

Is this a professional domain (Story/News/Documentary)? ├── YES → Select from voice_catalog Section 2.1 (filter by scenario + gender) └── NO → Select from voice_catalog Section 2.2: Step 1: Match Gender ├── Male character → Male voices only └── Female character → Female voices only Step 2: Match Age Group └── Children / Youth / Adult / Elderly / Professional Step 3: Match Language └── Filter to voices matching the content language Step 4: Match Personality & Role └── Choose best fit by tone, personality, character role


**Step 2.3: Emotions Segmentation** *(For non-2.8 series models only)*
For models other than speech-2.8 series, analyze emotions in your segments:
- For **long segments**, split further based on **emotional transitions**
- Add appropriate **emotion tags** to each segment
- Refer to Section 3 in [text-processing.md](reference/text-processing.md) for emotion tags and examples
- Skip this step for speech-2.8 models (emotion is auto-matched)

**Emotion Tags:**
- For speech-2.6 series (speech-2.6-hd and speech-2.6-turbo): happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper
- For older models: happy, sad, angry, fearful, disgusted, surprised, calm (7 emotions)


**Step 2.4: Check and Post-processing**
Finally, review and optimize your script:
- Verify segment length limits (async TTS ≤1,000,000 characters)
- Clean up conversational text (remove speaker names if needed)
- Ensure consistency in voice and emotion tags
- **Critical check for multi-voice content**: For audiobooks, multi-voice fiction, or content where dialogue is presented from a first-person perspective, verify that narration and dialogue mixed in the same sentence are properly split.

  **When splitting IS needed (first-person dialogue in fiction/audiobooks):**
  
  Example: `"John asked, 'Where are you going?'"` should be split into:
  - Segment 1: `"John asked, "` - uses narrator voice (describes who is speaking)
  - Segment 2: `"Where are you going?"` - uses the character's voice (actual dialogue in first-person)

  This ensures proper voice differentiation: descriptive narration uses the narrator's voice, while the character's spoken words use the character's designated voice.

  **When splitting is NOT needed (third-person quotes in podcast/documentary/news):**
  
  In podcasts, documentaries, or news reports, quoted speech is typically presented in third-person narrative style - the speaker's words are being reported, not performed. Keep these as one segment with the narrator's voice and remove the speaker's name at the beginning:
  
  - `"Welcome to our show." → narrator voice, remove the speaker's name (like "The host said:") at the beginning
  - `"According to experts, 'This technology represents a significant breakthrough.'" → keep as one segment (narrator voice)
  - `"Scientists noted, 'The experimental results exceeded our expectations.'" → keep as one segment (narrator voice)
- **If the split is missing**: Go back to Step 2.1 and ensure dialogue portions are separated from narration with appropriate role labels.

**Create segments.json:**
After completing all 4 sub-steps, save the final `segments.json` to `<cwd>/audio/segments.json`.


### Step 2.5: Generate Preview for User Confirmation (Highly Recommended)

**For multi-voice content (audiobooks, dramas, etc.), always generate a preview first.**

This saves time and prevents waste when voice selections need adjustment.

**How to generate a preview:**
1. Create a smaller segments file with 10-20 representative segments (include all characters)
2. Generate the preview audio
3. Ask user to listen and confirm voice choices

**Preview segments.json example:**
```json
[
  {"text": "Narration opening...", "role": "narrator", "voice_id": "...", "emotion": ""},
  {"text": "Male character speaks...", "role": "male_character", "voice_id": "...", "emotion": ""},
  {"text": "Female character speaks...", "role": "female_character", "voice_id": "...", "emotion": ""},
  {"text": "More dialogue...", "role": "...", "voice_id": "...", "emotion": ""}
]

Preview command:

python mmvoice.py generate segments_preview.json -o preview.mp3

When user confirms preview:

  • Use the same voice selections for the full segments.json
  • No need to re-select voices

Step 3: Present plan to user for confirmation

Before proceeding to validation and generation, present the segmentation plan to the user and wait for confirmation:

Present to the user:

  • Roles identified: List all characters/speakers in the text
  • Voice assignments: Show which voice_id is assigned to each role (include voice characteristics from voice_catalog.md)
  • Model being used: Explain why this model was selected
  • Language: Confirm the primary language of the content
  • Emotion approach: Auto-matched (speech-2.8) or manual tags (older models)

Example confirmation message:

I've analyzed the text and created a segmentation plan:

**Roles and Voices:**
- Narrator: male-qn-jingying (deep, authoritative, suitable for storytelling)
- Protagonist: female-shaonv (bright, energetic, youthful)
- Antagonist: male-qn-qingse (cool, menacing)

**Model:** speech-2.8-hd (recommended - automatic emotion matching)
**Language:** Chinese
**Segments:** 8 segments total

Please review and confirm:
1. ⚠️ **Gender Verification**: Do the voice genders match the character genders?
   - [Narrator: Male ✓] [Protagonist: Female ✓] [Antagonist: Male ✓]
2. ⚠️ **Language Verification**: Do the voice languages match the content language?
   - [All voices: Chinese ✓]
3. Are the voice assignments appropriate for each character (age, personality)?
4. Should any segments be combined or split differently?
5. Any other changes you'd like to make?

**After generation:**
- I'll generate a preview first for you to review
- Only after you confirm the audio quality will I clean up temporary files
- If not satisfied, I'll re-generate and we iterate until you're happy

Reply "confirm" to proceed, or let me know what to adjust.

Wait for user response:

  • If user confirms → Proceed to Step 4 (validate)
  • If user suggests changes → Update segments.json and present the plan again for confirmation

Step 4: Validate segments.json (model, emotion, voice_id validation)

Before generating audio, validate the segments file:

# Default: speech-2.8-hd (auto emotion matching)
python mmvoice.py validate <cwd>/audio/segments.json

# Specify model for context-specific validation
python mmvoice.py validate <cwd>/audio/segments.json --model speech-2.6-hd

# Validate voice_ids against available voices (slower, requires API call)
python mmvoice.py validate <cwd>/audio/segments.json --validate-voices

# Combined options (recommended)
python mmvoice.py validate <cwd>/audio/segments.json --model speech-2.6-hd --validate-voices

# Use `--verbose` to see segment details
python mmvoice.py validate <cwd>/audio/segments.json --model speech-2.6-hd --validate-voices --verbose

Emotion Validation checks:

ModelEmotion Validation
speech-2.8-hd/turboEmotion can be empty (auto emotion matching)
speech-2.6-hd/turboAll 9 emotions supported
Older modelshappy, sad, angry, fearful, disgusted, surprised, calm (7 emotions)

Voice ID validation: With --validate-voices:

  • Calls API once to get all available voices
  • Validates each voice_id against the list
  • Shows errors for invalid voice_ids (blocks validation)

Step 5: Generate and merge audio

Generate audio for all segments and merge into final output.

File placement (default behavior if user doesn't specify):

<cwd>/                      # Claude's current working directory
└── audio/                  # Created automatically
    ├── tmp/                # Intermediate segment files
    │   ├── segment_0000.mp3
    │   ├── segment_0001.mp3
    │   └── ...
    └── <custom_audio_name>.mp3             # Final merged audio, name can be customized

Where <cwd> is Claude's current working directory (where commands are executed).

  • If -o is not specified, output goes to <cwd>/audio/output.mp3
  • Intermediate files go to <cwd>/audio/tmp/
  • After user confirms the final audio, ask whether to delete <cwd>/audio/tmp/

Basic usage:

# Default: speech-2.8-hd, output to <cwd>/audio/output.mp3
python mmvoice.py generate <cwd>/audio/segments.json

# Specify output path
python mmvoice.py generate <cwd>/audio/segments.json -o <cwd>/audio/<custom_audio_name>.mp3

# Specify model if needed
python mmvoice.py generate <cwd>/audio/segments.json --model speech-2.6-hd

Skip existing segments (for rate limit retries):

# Only generate segments that don't exist yet - skips already-generated files
python mmvoice.py generate <cwd>/audio/segments.json --skip-existing

Error handling:

  • If a segment fails, the script reports which segment and why
  • Use --continue-on-error to generate remaining segments despite failures
  • Use --skip-existing to skip already successfully generated segments (recommended for retries after rate limit)
  • The script automatically uses fallback merging if FFmpeg filter_complex fails

Step 6: Confirm and cleanup

⚠️ CRITICAL: Never delete temp files until user confirms!

After generation completes, you MUST follow this exact sequence:

Step 6.1: Report generation result to user

✓ Audio saved to: <output_path>
  Generated: X/Y segments
  Intermediate files in: <cwd>/audio/tmp/

Step 6.2: Ask user to confirm audio quality Ask the user to listen to the audio and confirm:

  1. Is the audio quality satisfactory?
  2. Are all voices appropriate?
  3. Any adjustments needed?

Step 6.3: Wait for user response

Step 6.4: Only after user confirms, offer cleanup

After confirming audio quality, temporary files can be deleted with:
rm -rf <cwd>/audio/tmp/

NEVER execute rm -rf on temp files without explicit user confirmation!

If user is NOT satisfied:

  • Do NOT delete temp files
  • Discuss what needs to be adjusted
  • Re-generate affected segments if needed
  • Ask for confirmation again

Other Usage

Use the following when the task involves voice creation, single-voice TTS (sync/async), or audio processing instead of the main segment-based workflow. Each subsection gives CLI commands, script paths, and the reference doc to open for details.

Voice creation (clone / design / list)

  • Purpose: Create custom voices from audio (clone) or from a text description (design); list system and custom voices.
  • CLI (entry point: mmvoice.py):
    python mmvoice.py clone AUDIO_FILE --voice-id VOICE_ID   # Clone from 10s–5min audio
    python mmvoice.py design "DESCRIPTION" --voice-id ID      # Design from text
    python mmvoice.py list-voices                             # List all voices
    
  • Scripts: scripts/voice_clone.py (clone), scripts/voice_design.py (design), scripts/voice_management.py (list/manage).
  • Documentation: reference/voice-guide.md — cloning (quick + high-quality + step-by-step), design workflow, management.

Text-to-speech (sync / async)

  • Purpose: Single-voice TTS: sync for short text (≤10k chars), async for long text (up to 1M chars); optional streaming.
  • CLI:
    python mmvoice.py tts "TEXT" -o OUTPUT.mp3 [-v VOICE_ID] [--model MODEL]
    
  • Scripts: scripts/sync_tts.py (HTTP/WebSocket sync), scripts/async_tts.py (async task + poll).
  • Documentation: reference/tts-guide.md — sync TTS, async TTS, streaming, segment-based production.

Audio processing (merge / convert / normalize)

  • Purpose: Merge files (with optional crossfade), convert format, normalize loudness, trim.
  • CLI:
    python mmvoice.py merge FILE1 [FILE2 ...] -o OUTPUT [--crossfade MS]
    python mmvoice.py convert INPUT -o OUTPUT [--format FORMAT]
    
  • Script: scripts/audio_processing.py (merge, convert, normalize, trim).
  • Documentation: reference/audio-guide.md — format conversion, merging (filter_complex + concat demuxer fallback), normalization, trimming, optimization.

Segment-based TTS (main workflow)

  • CLI: validate and generate as in Steps 4–5 above.
  • Script: scripts/segment_tts.py.
  • Documentation: reference/cli-guide.md, reference/api_documentation.md.

Reference documents (on-demand)

Open these when you need concrete usage, parameters, or troubleshooting. Paths are relative to the skill root.

DocumentContent for the Agent
reference/cli-guide.mdAll CLI commands (validate, generate, tts, clone, design, list-voices, merge, convert, check-env) with options and examples. Use for correct CLI invocation.
reference/getting-started.mdEnvironment setup (venv, pip install, FFmpeg), MINIMAX_VOICE_API_KEY, basic synthesis test. Use for first-time setup or “env not working”.
reference/tts-guide.mdSync TTS (short text), async TTS (long text), streaming TTS, multi-segment production. Use for sync/async/streaming logic and parameters.
reference/voice-guide.mdVoice cloning (quick, high-quality with prompt audio, step-by-step), voice design, voice management. Use for custom voice creation flows.
reference/audio-guide.mdFormat conversion, merging (including crossfade and fallback), normalization, trimming, optimization. Use for merge/convert/normalize behavior and options.
reference/script-examples.mdCopy-paste runnable examples for sync TTS, async TTS, segment-based TTS, audio processing, voice clone/design/management. Use for quick Python snippets.
reference/troubleshooting.mdEnvironment (API key, FFmpeg), API errors, segment-based TTS, audio, voice. Use when an error message or unexpected behavior appears.
reference/api_documentation.mdFull API reference: config, sync/async TTS, emotion parameter, segment-based TTS, voice clone/design/management, audio processing, common parameters, error handling. Use for exact function signatures and parameter details.
reference/voice_catalog.mdSystem voices list (male/female/beta), selection guide, voice parameters, custom voices, voice IDs. Use to choose or look up voice_id.

Important notes

Requirements

  • Python: 3.8 or higher
  • API Key: MINIMAX_VOICE_API_KEY environment variable must be set
  • FFmpeg: Required for audio processing (merge, convert, normalize)
    • Install: brew install ffmpeg (macOS) or sudo apt install ffmpeg (Ubuntu)

Limits and constraints

  • Text length: Sync TTS ≤10,000 chars; async TTS ≤1,000,000 chars
  • Voice cloning: Audio must be 10s–5min duration, ≤20MB, formats: mp3/wav/m4a
  • Voice expiration: Custom voices (cloned/designed) expire after 7 days if not used with TTS

Special features

  • Pause insertion: Use <#x#> in text where x = pause duration in seconds (0.01–99.99)
    • Example: "Hello<#1.5#>world" creates 1.5s pause between words
  • Supported emotions: happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper
    • speech-2.8: automatic matching; speech-2.6: all 9; older models: first 7

Troubleshooting

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Sendflare

通过 Sendflare SDK 发送带附件的电子邮件,管理联系人列表,支持 CC/BCC 和安全 API 认证。

Registry SourceRecently Updated
General

Playtomic - Book courts using padel-tui

This skill should be used when the user asks to "book a padel court", "find available padel courts", "search padel courts near me", "reserve a Playtomic cour...

Registry SourceRecently Updated
General

Fund Keeper

国内场外基金智能顾问 + 股票行情查询。实时估值、买卖建议、收益统计、定投计划、OCR 识图、股票 - 基金联动。支持离线模式、多数据源缓存。

Registry SourceRecently Updated