Piper TTS Voice Training
Train custom text-to-speech voices compatible with Piper's lightweight ONNX runtime.
Overview
Piper produces fast, offline TTS suitable for embedded devices. Training involves:
-
Corpus preparation (text covering phonetic range)
-
Audio generation or recording
-
Quality validation via Whisper transcription
-
Fine-tuning from existing checkpoint (recommended) or training from scratch
-
ONNX export for deployment
Fine-tuning vs from-scratch:
-
Fine-tuning: ~1,300 phrases + 1,000 epochs (days on modest GPU)
-
From scratch: ~13,000+ phrases + 2,000+ epochs (weeks/months)
Workflow
- Corpus Preparation
Gather 1,300-1,500+ phrases covering broad phonetic range:
-
Use piper-recording-studio corpus as base
-
Add domain-specific phrases for your use case
-
Include varied sentence structures and lengths
Critical for non-US English: Ensure corpus uses correct regional spelling. See Localisation.
- Audio Generation
Generate or record training audio at 22050Hz mono WAV.
If using voice cloning (e.g., Chatterbox TTS):
-
Generate at source sample rate (often 24kHz)
-
Convert to 22050Hz: sox -v 0.95 input.wav -r 22050 -t wav output.wav
-
The -v 0.95 prevents clipping during resampling
Recording requirements:
-
Consistent microphone position and room acoustics
-
Minimal background noise
-
Natural speaking pace (not reading voice)
- Quality Validation with Whisper
Automate quality checks rather than manual listening:
import whisper from piper_phonemize import phonemize_text
model = whisper.load_model("base")
def validate_sample(audio_path, expected_text): result = model.transcribe(audio_path) transcribed = result["text"].strip()
# Compare phonemically to handle spelling/punctuation differences
expected_phonemes = phonemize_text(expected_text, "en-gb")
transcribed_phonemes = phonemize_text(transcribed, "en-gb")
return expected_phonemes == transcribed_phonemes
Retry failed samples up to 3 times. Target 95%+ dataset coverage.
- Dataset Format (LJSpeech)
Structure your dataset:
dataset/ ├── metadata.csv └── wavs/ ├── sample_0001.wav ├── sample_0002.wav └── ...
metadata.csv format: {id}|{text} (pipe-separated, no headers)
sample_0001|The quick brown fox jumps over the lazy dog. sample_0002|Pack my box with five dozen liquor jugs.
- Preprocessing
Convert to PyTorch tensors:
python3 -m piper_train.preprocess
--language en-gb
--input-dir dataset/
--output-dir piper_training_dir/
--dataset-format ljspeech
Use en-gb for Australian/NZ/UK voices (espeak-ng phoneme set).
- Training
Fine-tuning (recommended):
python3 -m piper_train
--dataset-dir piper_training_dir/
--accelerator gpu
--devices 1
--batch-size 12
--max_epochs 3000
--resume_from_checkpoint ljspeech-2000.ckpt
--checkpoint-epochs 100
--quality high
--precision 32
Key parameters:
-
--batch-size : Reduce if VRAM limited (12 works on 8GB)
-
--resume_from_checkpoint : Start from LJSpeech high-quality checkpoint
-
--precision 32 : More stable than mixed precision
-
--validation-split 0.0 --num-test-examples 0 : Skip validation for small datasets
Monitor with TensorBoard: watch loss_disc_all for convergence.
- ONNX Export
python3 -m piper_train.export_onnx checkpoint.ckpt output.onnx.unoptimized onnxsim output.onnx.unoptimized output.onnx
Create metadata file output.onnx.json from training config.json .
Localisation for Australian, New Zealand and UK English
Piper uses espeak-ng for phonemisation. American pronunciations in training data cause accent drift.
Corpus preparation:
-
Run scripts/convert_spelling.py on corpus text before training
-
Use en-gb or en-au espeak-ng voice for phonemisation
-
Review generated phonemes for Americanisms
Common spelling conversions:
American Australian/UK
-ize -ise
-or -our
-er -re
-og -ogue
-ense -ence
Phoneme considerations:
-
/r/ linking and intrusion patterns differ
-
Vowel sounds in words like "dance", "bath", "castle"
-
Final -ile pronunciation (hostile, missile)
For complete word lists and phonetic details, see references/localisation.md.
Validation: Use Whisper with language="en" and verify transcriptions match expected regional forms.
Dependencies
Pin versions to avoid API breakage:
pytorch-lightning==1.9.3 torch<2.6.0 piper-phonemize onnxruntime-gpu onnxsim
Docker containerisation recommended for reproducibility.
Hardware Requirements
Minimum (fine-tuning):
-
8GB VRAM GPU (Pascal or newer)
-
8GB system RAM
-
~5 days for 1,000 epochs on Tesla P4
From scratch: Multiply time by ~200x.
Troubleshooting
Issue Solution
CUDA OOM Reduce batch-size (try 8 or 4)
Checkpoint won't load Check pytorch-lightning version matches checkpoint
Garbled output Insufficient training epochs or dataset too small
Wrong accent Check espeak-ng language code and corpus spelling