Text-to-Voice with Kyutai Pocket TTS

Convert text to natural speech using Kyutai's Pocket TTS - a lightweight 100M parameter model that runs efficiently on CPU.

Installation

pip install pocket-tts
# or use uvx to run without installing:
uvx pocket-tts generate

Requires Python 3.10+ and PyTorch 2.5+. GPU not required.

CLI Usage

Basic Generation

# Generate with defaults (saves to ./tts_output.wav)
uvx pocket-tts generate

# Specify text
pocket-tts generate --text "Hello, this is my message."

# Specify output file location
pocket-tts generate --text "Hello" --output-path ./audio/greeting.wav

# Full example with all common options
pocket-tts generate \
  --text "Welcome to the demo." \
  --voice alba \
  --output-path ./output/welcome.wav

CLI Options

Option	Default	Description
`--text`	"Hello world..."	Text to convert to speech
`--voice`	alba	Voice name, local file path, or HuggingFace URL
`--output-path`	`./tts_output.wav`	Where to save the generated audio file
`--temperature`	0.7	Generation temperature (higher = more expressive)
`--lsd-decode-steps`	1	Quality steps (higher = better quality, slower)
`--eos-threshold`	-4.0	End detection threshold (lower = finish earlier)
`--frames-after-eos`	auto	Extra frames after end (each frame = 80ms)
`--device`	cpu	Device to use (cpu/cuda)
`-q, --quiet`	false	Disable logging output

Voice Selection (CLI)

# Use a pre-made voice by name
pocket-tts generate --voice alba --text "Hello"
pocket-tts generate --voice javert --text "Hello"

# Use a local audio file for voice cloning
pocket-tts generate --voice ./my_voice.wav --text "Hello"

# Use a voice from HuggingFace
pocket-tts generate --voice "hf://kyutai/tts-voices/alba-mackenna/merchant.wav" --text "Hello"

Quality Tuning (CLI)

# Higher quality (more generation steps)
pocket-tts generate --lsd-decode-steps 5 --temperature 0.5 --output-path high_quality.wav

# More expressive/varied output
pocket-tts generate --temperature 1.0 --output-path expressive.wav

# Shorter output (finishes speaking earlier)
pocket-tts generate --eos-threshold -3.0 --output-path shorter.wav

Local Web Server

For quick iteration with multiple voices/texts:

uvx pocket-tts serve
# Open http://localhost:8000

Available Voices

Pre-made voices (use name directly with --voice):

Voice	Gender	License	Description
`alba`	Female	CC BY 4.0	Casual voice
`marius`	Male	CC0	Voice donation
`javert`	Male	CC0	Voice donation
`jean`	Male	CC-NC	EARS dataset
`fantine`	Female	CC BY 4.0	VCTK dataset
`cosette`	Female	CC-NC	Expresso dataset
`eponine`	Female	CC BY 4.0	VCTK dataset
`azelma`	Female	CC BY 4.0	VCTK dataset

Full voice catalog: https://huggingface.co/kyutai/tts-voices

For detailed voice information, see references/voices.md.

Voice Cloning

Clone any voice from an audio sample. For best results:

Use clean audio (minimal background noise)
10+ seconds recommended
Consider Adobe Podcast Enhance to clean samples

pocket-tts generate --voice ./my_recording.wav --text "Hello" --output-path cloned.wav

Output Format

Sample Rate: 24kHz
Channels: Mono
Format: 16-bit PCM WAV
Default location: ./tts_output.wav

Python API

For programmatic use:

from pocket_tts import TTSModel
import scipy.io.wavfile

tts_model = TTSModel.load_model()
voice_state = tts_model.get_state_for_audio_prompt("alba")
audio = tts_model.generate_audio(voice_state, "Hello world!")

# Save to specific location
scipy.io.wavfile.write("./audio/output.wav", tts_model.sample_rate, audio.numpy())

TTSModel.load_model()

model = TTSModel.load_model(
    variant="b6369a24",      # Model variant
    temp=0.7,                # Temperature (0.0-1.0)
    lsd_decode_steps=1,      # Generation steps
    noise_clamp=None,        # Max noise value
    eos_threshold=-4.0       # End-of-sequence threshold
)

Voice State

# Pre-made voice
voice_state = model.get_state_for_audio_prompt("alba")

# Local file
voice_state = model.get_state_for_audio_prompt("./my_voice.wav")

# HuggingFace
voice_state = model.get_state_for_audio_prompt("hf://kyutai/tts-voices/alba-mackenna/casual.wav")

Generate Audio

audio = model.generate_audio(voice_state, "Text to speak")
# Returns: torch.Tensor (1D)

Streaming

for chunk in model.generate_audio_stream(voice_state, "Long text..."):
    # Process each chunk as it's generated
    pass

Properties

model.sample_rate - 24000 Hz
model.device - "cpu" or "cuda"

Performance

~200ms latency to first audio chunk
~6x real-time on MacBook Air M4 CPU
Uses only 2 CPU cores

Limitations

English only
No built-in pause/silence control

text-to-voice

Safety Notice

Copy this and send it to your AI assistant to learn