piper-tts-training

Piper TTS Voice Training

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "piper-tts-training" with this command: npx skills add sammcj/agentic-coding/sammcj-agentic-coding-piper-tts-training

Piper TTS Voice Training

Train custom text-to-speech voices compatible with Piper's lightweight ONNX runtime.

Overview

Piper produces fast, offline TTS suitable for embedded devices. Training involves:

  • Corpus preparation (text covering phonetic range)

  • Audio generation or recording

  • Quality validation via Whisper transcription

  • Fine-tuning from existing checkpoint (recommended) or training from scratch

  • ONNX export for deployment

Fine-tuning vs from-scratch:

  • Fine-tuning: ~1,300 phrases + 1,000 epochs (days on modest GPU)

  • From scratch: ~13,000+ phrases + 2,000+ epochs (weeks/months)

Workflow

  1. Corpus Preparation

Gather 1,300-1,500+ phrases covering broad phonetic range:

  • Use piper-recording-studio corpus as base

  • Add domain-specific phrases for your use case

  • Include varied sentence structures and lengths

Critical for non-US English: Ensure corpus uses correct regional spelling. See Localisation.

  1. Audio Generation

Generate or record training audio at 22050Hz mono WAV.

If using voice cloning (e.g., Chatterbox TTS):

  • Generate at source sample rate (often 24kHz)

  • Convert to 22050Hz: sox -v 0.95 input.wav -r 22050 -t wav output.wav

  • The -v 0.95 prevents clipping during resampling

Recording requirements:

  • Consistent microphone position and room acoustics

  • Minimal background noise

  • Natural speaking pace (not reading voice)

  1. Quality Validation with Whisper

Automate quality checks rather than manual listening:

import whisper from piper_phonemize import phonemize_text

model = whisper.load_model("base")

def validate_sample(audio_path, expected_text): result = model.transcribe(audio_path) transcribed = result["text"].strip()

# Compare phonemically to handle spelling/punctuation differences
expected_phonemes = phonemize_text(expected_text, "en-gb")
transcribed_phonemes = phonemize_text(transcribed, "en-gb")

return expected_phonemes == transcribed_phonemes

Retry failed samples up to 3 times. Target 95%+ dataset coverage.

  1. Dataset Format (LJSpeech)

Structure your dataset:

dataset/ ├── metadata.csv └── wavs/ ├── sample_0001.wav ├── sample_0002.wav └── ...

metadata.csv format: {id}|{text} (pipe-separated, no headers)

sample_0001|The quick brown fox jumps over the lazy dog. sample_0002|Pack my box with five dozen liquor jugs.

  1. Preprocessing

Convert to PyTorch tensors:

python3 -m piper_train.preprocess
--language en-gb
--input-dir dataset/
--output-dir piper_training_dir/
--dataset-format ljspeech

Use en-gb for Australian/NZ/UK voices (espeak-ng phoneme set).

  1. Training

Fine-tuning (recommended):

python3 -m piper_train
--dataset-dir piper_training_dir/
--accelerator gpu
--devices 1
--batch-size 12
--max_epochs 3000
--resume_from_checkpoint ljspeech-2000.ckpt
--checkpoint-epochs 100
--quality high
--precision 32

Key parameters:

  • --batch-size : Reduce if VRAM limited (12 works on 8GB)

  • --resume_from_checkpoint : Start from LJSpeech high-quality checkpoint

  • --precision 32 : More stable than mixed precision

  • --validation-split 0.0 --num-test-examples 0 : Skip validation for small datasets

Monitor with TensorBoard: watch loss_disc_all for convergence.

  1. ONNX Export

python3 -m piper_train.export_onnx checkpoint.ckpt output.onnx.unoptimized onnxsim output.onnx.unoptimized output.onnx

Create metadata file output.onnx.json from training config.json .

Localisation for Australian, New Zealand and UK English

Piper uses espeak-ng for phonemisation. American pronunciations in training data cause accent drift.

Corpus preparation:

  • Run scripts/convert_spelling.py on corpus text before training

  • Use en-gb or en-au espeak-ng voice for phonemisation

  • Review generated phonemes for Americanisms

Common spelling conversions:

American Australian/UK

-ize -ise

-or -our

-er -re

-og -ogue

-ense -ence

Phoneme considerations:

  • /r/ linking and intrusion patterns differ

  • Vowel sounds in words like "dance", "bath", "castle"

  • Final -ile pronunciation (hostile, missile)

For complete word lists and phonetic details, see references/localisation.md.

Validation: Use Whisper with language="en" and verify transcriptions match expected regional forms.

Dependencies

Pin versions to avoid API breakage:

pytorch-lightning==1.9.3 torch<2.6.0 piper-phonemize onnxruntime-gpu onnxsim

Docker containerisation recommended for reproducibility.

Hardware Requirements

Minimum (fine-tuning):

  • 8GB VRAM GPU (Pascal or newer)

  • 8GB system RAM

  • ~5 days for 1,000 epochs on Tesla P4

From scratch: Multiply time by ~200x.

Troubleshooting

Issue Solution

CUDA OOM Reduce batch-size (try 8 or 4)

Checkpoint won't load Check pytorch-lightning version matches checkpoint

Garbled output Insufficient training epochs or dataset too small

Wrong accent Check espeak-ng language code and corpus spelling

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

critical-thinking-logical-reasoning

No summary provided by upstream source.

Repository SourceNeeds Review
-291
sammcj
Automation

writing-documentation-with-diataxis

No summary provided by upstream source.

Repository SourceNeeds Review
-159
sammcj
Automation

extract-wisdom

No summary provided by upstream source.

Repository SourceNeeds Review