ElevenLabs Audio Generation
Requires ELEVENLABS_API_KEY in .env .
Text-to-Speech
from elevenlabs.client import ElevenLabs from elevenlabs import save, VoiceSettings import os
client = ElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))
audio = client.text_to_speech.convert( text="Welcome to my video!", voice_id="JBFqnCBsd6RMkjVDRZzb", model_id="eleven_multilingual_v2", voice_settings=VoiceSettings( stability=0.5, similarity_boost=0.75, style=0.5, speed=1.0 ) ) save(audio, "voiceover.mp3")
Models
Model Quality SSML Support Notes
eleven_multilingual_v2
Highest consistency None Stable, production-ready, 29 languages
eleven_flash_v2_5
Good <break> , <phoneme>
Fast, supports pause/pronunciation tags
eleven_turbo_v2_5
Good <break> , <phoneme>
Fastest latency
eleven_v3
Most expressive None Alpha — unreliable, needs prompt engineering
Choose: multilingual_v2 for reliability, flash/turbo for SSML control, v3 for maximum expressiveness (expect retakes).
Voice Settings by Style
Style stability similarity style speed
Natural/professional 0.75-0.85 0.9 0.0-0.1 1.0
Conversational 0.5-0.6 0.85 0.3-0.4 0.9-1.0
Energetic/YouTuber 0.3-0.5 0.75 0.5-0.7 1.0-1.1
Pauses Between Sections
With flash/turbo models: Use SSML break tags inline:
...end of section. <break time="1.5s" /> Start of next...
Max 3 seconds per break. Excessive breaks can cause speed artifacts.
With multilingual_v2 / v3: No SSML support. Options:
-
Paragraph breaks (blank lines) — creates ~0.3-0.5s natural pause
-
Post-process with ffmpeg: split audio and insert silence
WARNING: ... (ellipsis) is NOT a reliable pause — it can be vocalized as a word/sound. Do not use ellipsis as a pause mechanism.
Pronunciation Control
Phonetic spelling (any model): Write words as you want them pronounced:
-
Janus → Jan-us
-
nginx → engine-x
-
Use dashes, capitals, apostrophes to guide pronunciation
SSML phoneme tags (flash/turbo only):
<phoneme alphabet="ipa" ph="ˈdʒeɪnəs">Janus</phoneme>
Iterative Workflow
-
Generate → listen → identify pronunciation/pacing issues
-
Adjust: phonetic spellings, break tags, voice settings
-
Regenerate. If pauses aren't precise enough, add silence in post with ffmpeg rather than fighting the TTS engine.
Voice Cloning
Instant Voice Clone
with open("sample.mp3", "rb") as f: voice = client.voices.ivc.create( name="My Voice", files=[f], remove_background_noise=True ) print(f"Voice ID: {voice.voice_id}")
-
Use client.voices.ivc.create() (not client.voices.clone() )
-
Pass file handles in binary mode ("rb" ), not paths
-
Convert m4a first: ffmpeg -i input.m4a -codec:a libmp3lame -qscale:a 2 output.mp3
-
Multiple samples (2-3 clips) improve accuracy
-
Save voice ID for reuse
Professional Voice Clone: Requires Creator plan+, 30+ min audio. See reference.md.
Sound Effects
Max 22 seconds per generation.
result = client.text_to_sound_effects.convert( text="Thunder rumbling followed by heavy rain", duration_seconds=10, prompt_influence=0.3 ) with open("thunder.mp3", "wb") as f: for chunk in result: f.write(chunk)
Prompt tips: Be specific — "Heavy footsteps on wooden floorboards, slow and deliberate, with creaking"
Music Generation
10 seconds to 5 minutes. Use client.music.compose() (not .generate() ).
result = client.music.compose( prompt="Upbeat indie rock, catchy guitar riff, energetic drums, travel vlog", music_length_ms=60000, force_instrumental=True ) with open("music.mp3", "wb") as f: for chunk in result: f.write(chunk)
Prompt structure: Genre, mood, instruments, tempo, use case. Add "no vocals" or use force_instrumental=True for background music.
Remotion Integration
Complete Workflow: Script to Synchronized Scene
VOICEOVER-SCRIPT.md → voiceover.py → public/audio/ → Remotion composition ↓ ↓ ↓ ↓ Scene narration Generate MP3 Audio files <Audio> component with durations per scene with timing synced to scenes
Step 1: Generate Per-Scene Audio
Use the toolkit's voiceover tool to generate audio for each scene:
Generate voiceover files for each scene
python tools/voiceover.py --scene-dir public/audio/scenes --json
Output:
public/audio/scenes/
├── scene-01-title.mp3
├── scene-02-problem.mp3
├── scene-03-solution.mp3
└── manifest.json (durations for each file)
The manifest.json contains timing info:
{ "scenes": [ { "file": "scene-01-title.mp3", "duration": 4.2 }, { "file": "scene-02-problem.mp3", "duration": 12.8 }, { "file": "scene-03-solution.mp3", "duration": 15.3 } ], "totalDuration": 32.3 }
Step 2: Use Audio in Remotion Composition
// src/Composition.tsx import { Audio, staticFile, Series, useVideoConfig } from 'remotion';
// Import scene components import { TitleSlide } from './scenes/TitleSlide'; import { ProblemSlide } from './scenes/ProblemSlide'; import { SolutionSlide } from './scenes/SolutionSlide';
// Scene durations (from manifest.json, converted to frames at 30fps) const SCENE_DURATIONS = { title: Math.ceil(4.2 * 30), // 126 frames problem: Math.ceil(12.8 * 30), // 384 frames solution: Math.ceil(15.3 * 30), // 459 frames };
export const MainComposition: React.FC = () => { return ( <> {/* Scene sequence */} <Series> <Series.Sequence durationInFrames={SCENE_DURATIONS.title}> <TitleSlide /> </Series.Sequence> <Series.Sequence durationInFrames={SCENE_DURATIONS.problem}> <ProblemSlide /> </Series.Sequence> <Series.Sequence durationInFrames={SCENE_DURATIONS.solution}> <SolutionSlide /> </Series.Sequence> </Series>
{/* Audio track - plays continuously across all scenes */}
<Audio src={staticFile('audio/voiceover.mp3')} volume={1} />
{/* Optional: Background music at lower volume */}
<Audio src={staticFile('audio/music.mp3')} volume={0.15} />
</>
); };
Step 3: Per-Scene Audio (Alternative)
For more control, add audio to each scene individually:
// src/scenes/ProblemSlide.tsx import { Audio, staticFile, useCurrentFrame } from 'remotion';
export const ProblemSlide: React.FC = () => { const frame = useCurrentFrame();
return ( <div style={{ /* slide styles / }}> <h1>The Problem</h1> {/ Scene content */}
{/* Audio starts when this scene starts (frame 0 of this sequence) */}
<Audio src={staticFile('audio/scenes/scene-02-problem.mp3')} />
</div>
); };
Syncing Visuals to Voiceover
Calculate scene duration from audio, not the other way around:
// src/config/timing.ts import manifest from '../../public/audio/scenes/manifest.json';
const FPS = 30;
// Convert audio durations to frame counts export const sceneDurations = manifest.scenes.reduce((acc, scene) => { const name = scene.file.replace(/^scene-\d+-/, '').replace('.mp3', ''); acc[name] = Math.ceil(scene.duration * FPS); return acc; }, {} as Record<string, number>);
// Usage in composition: // <Series.Sequence durationInFrames={sceneDurations.title}>
Audio Timing Patterns
import { Audio, Sequence, interpolate, useCurrentFrame } from 'remotion';
// Fade in audio export const FadeInAudio: React.FC<{ src: string; fadeFrames?: number }> = ({ src, fadeFrames = 30 }) => { const frame = useCurrentFrame(); const volume = interpolate(frame, [0, fadeFrames], [0, 1], { extrapolateRight: 'clamp', }); return <Audio src={src} volume={volume} />; };
// Delayed audio start export const DelayedAudio: React.FC<{ src: string; delayFrames: number }> = ({ src, delayFrames }) => ( <Sequence from={delayFrames}> <Audio src={src} /> </Sequence> );
// Usage: // <FadeInAudio src={staticFile('audio/music.mp3')} fadeFrames={60} /> // <DelayedAudio src={staticFile('audio/sfx/whoosh.mp3')} delayFrames={45} />
Voiceover + Demo Video Sync
When a scene has both voiceover and demo video:
import { Audio, OffthreadVideo, staticFile, useVideoConfig } from 'remotion';
export const DemoScene: React.FC = () => { const { durationInFrames, fps } = useVideoConfig();
// Calculate playback rate to fit demo into voiceover duration const demoDuration = 45; // seconds (original demo length) const sceneDuration = durationInFrames / fps; // seconds (from voiceover) const playbackRate = demoDuration / sceneDuration;
return ( <> <OffthreadVideo src={staticFile('demos/feature-demo.mp4')} playbackRate={playbackRate} /> <Audio src={staticFile('audio/scenes/scene-04-demo.mp3')} /> </> ); };
Error Handling
import { Audio, staticFile, delayRender, continueRender } from 'remotion'; import { useEffect, useState } from 'react';
export const SafeAudio: React.FC<{ src: string }> = ({ src }) => { const [handle] = useState(() => delayRender()); const [audioReady, setAudioReady] = useState(false);
useEffect(() => {
const audio = new window.Audio(src);
audio.oncanplaythrough = () => {
setAudioReady(true);
continueRender(handle);
};
audio.onerror = () => {
console.error(Failed to load audio: ${src});
continueRender(handle); // Continue without audio rather than hang
};
}, [src, handle]);
if (!audioReady) return null; return <Audio src={src} />; };
Toolkit Command: /generate-voiceover
The /generate-voiceover command handles the full workflow:
/generate-voiceover
- Reads VOICEOVER-SCRIPT.md
- Extracts narration for each scene
- Generates audio via ElevenLabs API
- Saves to public/audio/scenes/
- Creates manifest.json with durations
- Updates project.json with timing info
Popular Voices
-
George: JBFqnCBsd6RMkjVDRZzb (warm narrator)
-
Rachel: 21m00Tcm4TlvDq8ikWAM (clear female)
-
Adam: pNInz6obpgDQGcFmaJgB (professional male)
List all: client.voices.get_all()
For full API docs, see reference.md.