Parlor On-Device AI

Skill by ara.so — Daily 2026 Skills collection.

Parlor is a real-time, on-device multimodal AI assistant. It combines Gemma 4 E2B (via LiteRT-LM) for speech and vision understanding with Kokoro TTS for voice output. Everything runs locally — no API keys, no cloud calls, no cost per request.

Architecture

Browser (mic + camera)
    │
    │  WebSocket (audio PCM + JPEG frames)
    ▼
FastAPI server
    ├── Gemma 4 E2B via LiteRT-LM (GPU)  →  understands speech + vision
    └── Kokoro TTS (MLX on Mac, ONNX on Linux)  →  speaks back
    │
    │  WebSocket (streamed audio chunks)
    ▼
Browser (playback + transcript)

Key features:

Silero VAD in browser — hands-free, no push-to-talk
Barge-in — interrupt AI mid-sentence by speaking
Sentence-level TTS streaming — audio starts before full response is ready
Platform-aware TTS — MLX backend on Apple Silicon, ONNX on Linux

Requirements

Python 3.12+
macOS with Apple Silicon or Linux with a supported GPU
~3 GB free RAM
uv package manager

Installation

git clone https://github.com/fikrikarim/parlor.git
cd parlor

# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

cd src
uv sync
uv run server.py

Open http://localhost:8000, grant camera and microphone permissions, and start talking.

Models download automatically on first run (~2.6 GB for Gemma 4 E2B, plus TTS models).

Configuration

Set environment variables before running:

# Use a pre-downloaded model instead of auto-downloading
export MODEL_PATH=/path/to/gemma-4-E2B-it.litertlm

# Change server port (default: 8000)
export PORT=9000

uv run server.py

Variable	Default	Description
`MODEL_PATH`	auto-download from HuggingFace	Path to local `.litertlm` model file
`PORT`	`8000`	Server port

Project Structure

src/
├── server.py              # FastAPI WebSocket server + Gemma 4 inference
├── tts.py                 # Platform-aware TTS (MLX on Mac, ONNX on Linux)
├── index.html             # Frontend UI (VAD, camera, audio playback)
├── pyproject.toml         # Dependencies
└── benchmarks/
    ├── bench.py           # End-to-end WebSocket benchmark
    └── benchmark_tts.py   # TTS backend comparison

Key Components

server.py — FastAPI WebSocket Server

The server handles two WebSocket connections: one for receiving audio/video from the browser, one for streaming audio back.

# Simplified pattern from server.py
from fastapi import FastAPI, WebSocket
import asyncio

app = FastAPI()

@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()
    async for data in websocket.iter_bytes():
        # data contains PCM audio + optional JPEG frame
        response_text = await run_gemma_inference(data)
        audio_chunks = await run_tts(response_text)
        for chunk in audio_chunks:
            await websocket.send_bytes(chunk)

tts.py — Platform-Aware TTS

Kokoro TTS selects backend based on platform:

# tts.py uses platform detection
import platform

def get_tts_backend():
    if platform.system() == "Darwin":
        # Apple Silicon: use MLX backend for GPU acceleration
        from kokoro_mlx import KokoroMLX
        return KokoroMLX()
    else:
        # Linux: use ONNX backend
        from kokoro import KokoroPipeline
        return KokoroPipeline(lang_code='a')

tts = get_tts_backend()

# Sentence-level streaming — yields audio as each sentence is ready
async def synthesize_streaming(text: str):
    for sentence in split_sentences(text):
        audio = tts.synthesize(sentence)
        yield audio

Gemma 4 E2B Inference via LiteRT-LM

# LiteRT-LM inference pattern
from litert_lm import LiteRTLM
import os

model_path = os.environ.get("MODEL_PATH", None)

# Auto-downloads if MODEL_PATH not set
model = LiteRTLM.from_pretrained(
    "google/gemma-4-E2B-it",
    local_path=model_path
)

async def run_gemma_inference(audio_pcm: bytes, image_jpeg: bytes = None):
    inputs = {"audio": audio_pcm}
    if image_jpeg:
        inputs["image"] = image_jpeg
    
    response = ""
    async for token in model.generate_stream(**inputs):
        response += token
    return response

Running Benchmarks

cd src

# End-to-end WebSocket latency benchmark
uv run benchmarks/bench.py

# Compare TTS backends (MLX vs ONNX)
uv run benchmarks/benchmark_tts.py

Performance Reference (Apple M3 Pro)

Stage	Time
Speech + vision understanding	~1.8–2.2s
Response generation (~25 tokens)	~0.3s
Text-to-speech (1–3 sentences)	~0.3–0.7s
Total end-to-end	~2.5–3.0s

Decode speed: ~83 tokens/sec on GPU.

Common Patterns

Extending the System Prompt

Modify the prompt in server.py to change the AI's persona or task:

SYSTEM_PROMPT = """You are a helpful language tutor. 
Respond conversationally in 1-3 sentences.
If the user makes a grammar mistake, gently correct them.
You can see through the user's camera and discuss what you observe."""

Adding a New Language for TTS

Kokoro supports multiple language codes. Set lang_code in tts.py:

# Language codes: 'a' = American English, 'b' = British English
# 'e' = Spanish, 'f' = French, 'z' = Chinese, 'j' = Japanese
pipeline = KokoroPipeline(lang_code='e')  # Spanish

Customizing VAD Sensitivity (index.html)

The Silero VAD threshold can be tuned in the frontend:

// In index.html — lower positiveSpeechThreshold = more sensitive
const vad = await MicVAD.new({
  positiveSpeechThreshold: 0.6,   // default ~0.8, lower = triggers more easily
  negativeSpeechThreshold: 0.35,  // how quickly it stops detecting speech
  minSpeechFrames: 3,
  onSpeechStart: () => { /* UI feedback */ },
  onSpeechEnd: (audio) => sendAudioToServer(audio),
});

Sending Frames Programmatically (WebSocket Client Example)

import asyncio
import websockets
import json
import base64

async def send_audio_frame(audio_pcm_bytes: bytes, jpeg_bytes: bytes = None):
    uri = "ws://localhost:8000/ws"
    async with websockets.connect(uri) as ws:
        payload = {
            "audio": base64.b64encode(audio_pcm_bytes).decode(),
        }
        if jpeg_bytes:
            payload["image"] = base64.b64encode(jpeg_bytes).decode()
        
        await ws.send(json.dumps(payload))
        
        # Receive streamed audio response
        async for message in ws:
            audio_chunk = message  # raw PCM bytes
            # play or save audio_chunk

Troubleshooting

Model download fails

# Pre-download manually via huggingface_hub
uv run python -c "
from huggingface_hub import hf_hub_download
path = hf_hub_download('google/gemma-4-E2B-it', 'gemma-4-E2B-it.litertlm')
print(path)
"
export MODEL_PATH=/path/shown/above
uv run server.py

Microphone/camera not working in browser

Must access via http://localhost (not IP address) — browsers block media APIs on non-localhost HTTP
Check browser permissions: address bar → lock icon → reset permissions

TTS not loading on Linux

# Ensure ONNX runtime is installed
uv add onnxruntime
# Or for GPU:
uv add onnxruntime-gpu

High latency or slow inference

Verify GPU is being used: check for Metal (Mac) or CUDA (Linux) in startup logs
Close other GPU-heavy applications
On Linux, confirm CUDA drivers match installed onnxruntime-gpu version

Port already in use

export PORT=8080
uv run server.py
# Or kill the existing process:
lsof -ti:8000 | xargs kill

`uv sync` fails — Python version mismatch

# Parlor requires Python 3.12+
python3 --version
# Install 3.12 via pyenv or system package manager, then:
uv python pin 3.12
uv sync

Dependencies (pyproject.toml)

Key packages installed by uv sync:

litert-lm — Google AI Edge inference runtime for Gemma
fastapi + uvicorn — async web/WebSocket server
kokoro — Kokoro TTS ONNX backend
kokoro-mlx — Kokoro TTS MLX backend (Mac only)
silero-vad — voice activity detection (browser-side via CDN)
huggingface-hub — model auto-download