comfyui-voice-pipeline

Generate character voices using TTS, voice cloning, and lip-sync tools. Supports Chatterbox, F5-TTS, TTS Audio Suite, RVC, and ElevenLabs. Use when creating speech audio for characters or syncing audio to video.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "comfyui-voice-pipeline" with this command: npx skills add mckruz/comfyui-expert/mckruz-comfyui-expert-comfyui-voice-pipeline

ComfyUI Voice Pipeline

Creates character voices through TTS/voice cloning and synchronizes them with generated video.

Voice Generation Decision Tree

VOICE REQUEST
    |
    |-- Have reference audio of target voice?
    |   |-- Yes (5+ seconds) → Chatterbox (MIT, paralinguistic tags)
    |   |-- Yes (10-15 seconds) → F5-TTS (fastest zero-shot)
    |   |-- Yes (10+ minutes) → RVC training (highest fidelity)
    |   |-- Yes (any length, budget) → ElevenLabs (production quality)
    |
    |-- No reference audio?
    |   |-- Need emotion control → IndexTTS-2 (8-emotion vectors)
    |   |-- Need multi-language → TTS Audio Suite (23 languages)
    |   |-- Need voice design → ElevenLabs Voice Design (describe voice)
    |   |-- Quick prototype → Any TTS with default voice
    |
    |-- Need multi-speaker dialog?
    |   |-- Chatterbox (4 voices) or TTS Audio Suite (character switching)
    |
    |-- Need lip-sync?
    |   |-- Best accuracy → Wav2Lip + CodeFormer
    |   |-- Need head movement → SadTalker
    |   |-- Full expression control → LivePortrait
    |   |-- Unlimited length → InfiniteTalk

Tool Reference

Chatterbox (Recommended Open-Source)

Strengths: MIT license, beats ElevenLabs 63.8% in blind tests, 5-second sample, emotion control, sub-200ms latency.

Paralinguistic tags:

[laugh] [chuckle] [sigh] [gasp] [cough] [clear throat]
[whisper] [excited] [sad] [angry] [surprised]

Key parameter: exaggeration (0.25-2.0) controls expressiveness.

Limit: 40-second generation cap. Split longer content.

F5-TTS

Strengths: Fastest zero-shot cloning, <15 second samples, MIT license, multi-language.

Requirements: Reference audio must be paired with .wav + .txt (matching transcription).

Languages: English, German, Spanish, French, Japanese, Hindi, Thai, Portuguese.

TTS Audio Suite

Strengths: Unified multi-engine platform, 23 languages, character switching.

Special features:

  • Character switching: [CharacterName] tags
  • Language switching: [de:Alice], [fr:Bob]
  • Pause control: [pause:1s]
  • SRT timing sync

Integrates: F5-TTS, Chatterbox, Higgs Audio 2, VibeVoice, IndexTTS-2, RVC.

IndexTTS-2

Strengths: 8-emotion vector control with per-segment parameters.

Emotions: happy, angry, sad, surprised, afraid, disgusted, calm, melancholic.

RVC (Voice Conversion)

Use case: Train a model on target voice (10+ min audio), then convert any TTS output.

Pipeline: Text → Any TTS → Base Audio → RVC Model → Character Voice

Training: 300-500 epochs, RMVPE feature extraction.

ElevenLabs (Commercial)

Tiers:

  • Instant Clone: 1-minute sample, good quality
  • Professional Clone: 30+ minutes (3h ideal), near-indistinguishable
  • Voice Design: Describe voice in text (no sample needed)

Voice Profile Setup

For each character, establish a voice profile in projects/{project}/characters/{name}/profile.yaml:

voice:
  cloned: true
  model: "chatterbox"
  sample_file: "references/voice_sample.wav"
  settings:
    exaggeration: 1.2
    default_emotion: "neutral"
  notes: "Warm, confident tone. Slight Italian-American undertones."

Script Preparation

Text Formatting for TTS

  1. Punctuation matters: Commas create pauses, periods create stops
  2. Phonetic hints: Spell unusual words phonetically if mispronounced
  3. Emotion cues: Use Chatterbox tags or split by emotion for IndexTTS-2
  4. Length: Split into 30-40 second segments for Chatterbox limit

Multi-Speaker Script

[Sage] Hello! *laughs* I've been looking forward to this.
[pause:0.5s]
[Alex] [excited] Same here! Let's dive right in.
[Sage] [whisper] But first, I need to tell you something...

Audio Post-Processing

Requirements for Lip-Sync Input

  • Sample rate: 16-24kHz (model dependent)
  • Format: WAV (uncompressed)
  • Mono channel
  • Trim leading silence
  • Add 0.2s trailing silence
  • Normalize to -3dB peak

FFmpeg Processing

# Convert to mono 24kHz WAV, normalized
ffmpeg -i input.wav -ac 1 -ar 24000 -af "loudnorm=I=-16:TP=-3" output.wav

# Trim silence from start/end
ffmpeg -i input.wav -af "silenceremove=start_periods=1:start_threshold=-50dB,areverse,silenceremove=start_periods=1:start_threshold=-50dB,areverse" trimmed.wav

# Concatenate segments
ffmpeg -f concat -safe 0 -i filelist.txt -c copy combined.wav

Lip-Sync Methods

Wav2Lip (Best Accuracy)

Settings:

wav2lip_model: "wav2lip_gan.pth"  # Better than wav2lip.pth
face_detect_batch: 16
nosmooth: false
pad_bottom: 10

MUST post-process: CodeFormer (fidelity 0.7) after Wav2Lip output.

SadTalker (Head Movement)

Settings:

preprocess: "full"     # Better for novel faces
enhancer: "gfpgan"
pose_style: 10-20      # Natural conversation range

LivePortrait (Expression Control)

Settings:

lip_zero: 0.03         # Reduces unnatural lip movement
stitching: true        # Seamless face blending

Best for: Premium avatar creation, expression transfer from driving video.

LatentSync 1.6 (Newest, Highest Quality)

ByteDance model trained at 512x512 with TREPA modules for temporal consistency.

InfiniteTalk (Unlimited Length)

For videos longer than standard lip-sync limits. Integrates with Wan for joint generation.

Complete Talking Head Workflow

Pipeline A: Quick (Image → Talk)

1. [Text] → Chatterbox/F5-TTS → audio.wav
2. [Character Image] + audio.wav → SadTalker → video.mp4
3. video.mp4 → GFPGAN/CodeFormer → final.mp4

Time: ~2 minutes. Quality: Good.

Pipeline B: Quality (Image → Video → Lip-Sync)

1. [Text] → Chatterbox → audio.wav
2. [Character Image] → Wan I2V → base_video.mp4
   Prompt: "person talking, slight head movement, indoor"
3. base_video.mp4 + audio.wav → Wav2Lip → lipsync.mp4
4. lipsync.mp4 → FaceDetailer batch → enhanced.mp4
5. enhanced.mp4 → Color correct + Deflicker → final.mp4

Time: ~10 minutes. Quality: Production.

Pipeline C: Premium (Expression Transfer)

1. Record driving video (actor performing lines)
2. [Text] → Voice Clone TTS → audio.wav
3. [Character Image] + driving.mp4 → LivePortrait → expression_video.mp4
4. expression_video.mp4 + audio.wav → Wav2Lip → lipsync.mp4
5. lipsync.mp4 → CodeFormer → final.mp4

Time: ~15 minutes. Quality: Premium.

Troubleshooting

IssueSolution
Audio out of syncOffset with ffmpeg: ffmpeg -itsoffset 0.1 -i audio.wav ...
Subtle mouth movementsUse wav2lip_gan.pth, increase audio volume
Face artifactsPost-process with CodeFormer (fidelity 0.6-0.8)
Robotic voice cloneUse longer/cleaner reference, increase exaggeration
Unnatural head movementLower SadTalker pose_style to 0-10

Reference

  • references/voice-synthesis.md - Full voice tool documentation
  • references/models.md - Voice model download links
  • Character voice profiles in projects/{project}/characters/

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

comfyui-api

No summary provided by upstream source.

Repository SourceNeeds Review
164-mckruz
General

comfyui-video-pipeline

No summary provided by upstream source.

Repository SourceNeeds Review
General

comfyui-prompt-engineer

No summary provided by upstream source.

Repository SourceNeeds Review
General

comfyui-troubleshooter

No summary provided by upstream source.

Repository SourceNeeds Review