parlor-on-device-ai

On-device, real-time multimodal AI voice and vision assistant powered by Gemma 4 E2B and Kokoro TTS, running entirely locally via FastAPI WebSocket server.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "parlor-on-device-ai" with this command: npx skills add aradotso/trending-skills/aradotso-trending-skills-parlor-on-device-ai

Parlor On-Device AI

Skill by ara.so — Daily 2026 Skills collection.

Parlor is a real-time, on-device multimodal AI assistant. It combines Gemma 4 E2B (via LiteRT-LM) for speech and vision understanding with Kokoro TTS for voice output. Everything runs locally — no API keys, no cloud calls, no cost per request.

Architecture

Browser (mic + camera)
    │
    │  WebSocket (audio PCM + JPEG frames)
    ▼
FastAPI server
    ├── Gemma 4 E2B via LiteRT-LM (GPU)  →  understands speech + vision
    └── Kokoro TTS (MLX on Mac, ONNX on Linux)  →  speaks back
    │
    │  WebSocket (streamed audio chunks)
    ▼
Browser (playback + transcript)

Key features:

  • Silero VAD in browser — hands-free, no push-to-talk
  • Barge-in — interrupt AI mid-sentence by speaking
  • Sentence-level TTS streaming — audio starts before full response is ready
  • Platform-aware TTS — MLX backend on Apple Silicon, ONNX on Linux

Requirements

  • Python 3.12+
  • macOS with Apple Silicon or Linux with a supported GPU
  • ~3 GB free RAM
  • uv package manager

Installation

git clone https://github.com/fikrikarim/parlor.git
cd parlor

# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

cd src
uv sync
uv run server.py

Open http://localhost:8000, grant camera and microphone permissions, and start talking.

Models download automatically on first run (~2.6 GB for Gemma 4 E2B, plus TTS models).

Configuration

Set environment variables before running:

# Use a pre-downloaded model instead of auto-downloading
export MODEL_PATH=/path/to/gemma-4-E2B-it.litertlm

# Change server port (default: 8000)
export PORT=9000

uv run server.py
VariableDefaultDescription
MODEL_PATHauto-download from HuggingFacePath to local .litertlm model file
PORT8000Server port

Project Structure

src/
├── server.py              # FastAPI WebSocket server + Gemma 4 inference
├── tts.py                 # Platform-aware TTS (MLX on Mac, ONNX on Linux)
├── index.html             # Frontend UI (VAD, camera, audio playback)
├── pyproject.toml         # Dependencies
└── benchmarks/
    ├── bench.py           # End-to-end WebSocket benchmark
    └── benchmark_tts.py   # TTS backend comparison

Key Components

server.py — FastAPI WebSocket Server

The server handles two WebSocket connections: one for receiving audio/video from the browser, one for streaming audio back.

# Simplified pattern from server.py
from fastapi import FastAPI, WebSocket
import asyncio

app = FastAPI()

@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()
    async for data in websocket.iter_bytes():
        # data contains PCM audio + optional JPEG frame
        response_text = await run_gemma_inference(data)
        audio_chunks = await run_tts(response_text)
        for chunk in audio_chunks:
            await websocket.send_bytes(chunk)

tts.py — Platform-Aware TTS

Kokoro TTS selects backend based on platform:

# tts.py uses platform detection
import platform

def get_tts_backend():
    if platform.system() == "Darwin":
        # Apple Silicon: use MLX backend for GPU acceleration
        from kokoro_mlx import KokoroMLX
        return KokoroMLX()
    else:
        # Linux: use ONNX backend
        from kokoro import KokoroPipeline
        return KokoroPipeline(lang_code='a')

tts = get_tts_backend()

# Sentence-level streaming — yields audio as each sentence is ready
async def synthesize_streaming(text: str):
    for sentence in split_sentences(text):
        audio = tts.synthesize(sentence)
        yield audio

Gemma 4 E2B Inference via LiteRT-LM

# LiteRT-LM inference pattern
from litert_lm import LiteRTLM
import os

model_path = os.environ.get("MODEL_PATH", None)

# Auto-downloads if MODEL_PATH not set
model = LiteRTLM.from_pretrained(
    "google/gemma-4-E2B-it",
    local_path=model_path
)

async def run_gemma_inference(audio_pcm: bytes, image_jpeg: bytes = None):
    inputs = {"audio": audio_pcm}
    if image_jpeg:
        inputs["image"] = image_jpeg
    
    response = ""
    async for token in model.generate_stream(**inputs):
        response += token
    return response

Running Benchmarks

cd src

# End-to-end WebSocket latency benchmark
uv run benchmarks/bench.py

# Compare TTS backends (MLX vs ONNX)
uv run benchmarks/benchmark_tts.py

Performance Reference (Apple M3 Pro)

StageTime
Speech + vision understanding~1.8–2.2s
Response generation (~25 tokens)~0.3s
Text-to-speech (1–3 sentences)~0.3–0.7s
Total end-to-end~2.5–3.0s

Decode speed: ~83 tokens/sec on GPU.

Common Patterns

Extending the System Prompt

Modify the prompt in server.py to change the AI's persona or task:

SYSTEM_PROMPT = """You are a helpful language tutor. 
Respond conversationally in 1-3 sentences.
If the user makes a grammar mistake, gently correct them.
You can see through the user's camera and discuss what you observe."""

Adding a New Language for TTS

Kokoro supports multiple language codes. Set lang_code in tts.py:

# Language codes: 'a' = American English, 'b' = British English
# 'e' = Spanish, 'f' = French, 'z' = Chinese, 'j' = Japanese
pipeline = KokoroPipeline(lang_code='e')  # Spanish

Customizing VAD Sensitivity (index.html)

The Silero VAD threshold can be tuned in the frontend:

// In index.html — lower positiveSpeechThreshold = more sensitive
const vad = await MicVAD.new({
  positiveSpeechThreshold: 0.6,   // default ~0.8, lower = triggers more easily
  negativeSpeechThreshold: 0.35,  // how quickly it stops detecting speech
  minSpeechFrames: 3,
  onSpeechStart: () => { /* UI feedback */ },
  onSpeechEnd: (audio) => sendAudioToServer(audio),
});

Sending Frames Programmatically (WebSocket Client Example)

import asyncio
import websockets
import json
import base64

async def send_audio_frame(audio_pcm_bytes: bytes, jpeg_bytes: bytes = None):
    uri = "ws://localhost:8000/ws"
    async with websockets.connect(uri) as ws:
        payload = {
            "audio": base64.b64encode(audio_pcm_bytes).decode(),
        }
        if jpeg_bytes:
            payload["image"] = base64.b64encode(jpeg_bytes).decode()
        
        await ws.send(json.dumps(payload))
        
        # Receive streamed audio response
        async for message in ws:
            audio_chunk = message  # raw PCM bytes
            # play or save audio_chunk

Troubleshooting

Model download fails

# Pre-download manually via huggingface_hub
uv run python -c "
from huggingface_hub import hf_hub_download
path = hf_hub_download('google/gemma-4-E2B-it', 'gemma-4-E2B-it.litertlm')
print(path)
"
export MODEL_PATH=/path/shown/above
uv run server.py

Microphone/camera not working in browser

  • Must access via http://localhost (not IP address) — browsers block media APIs on non-localhost HTTP
  • Check browser permissions: address bar → lock icon → reset permissions

TTS not loading on Linux

# Ensure ONNX runtime is installed
uv add onnxruntime
# Or for GPU:
uv add onnxruntime-gpu

High latency or slow inference

  • Verify GPU is being used: check for Metal (Mac) or CUDA (Linux) in startup logs
  • Close other GPU-heavy applications
  • On Linux, confirm CUDA drivers match installed onnxruntime-gpu version

Port already in use

export PORT=8080
uv run server.py
# Or kill the existing process:
lsof -ti:8000 | xargs kill

uv sync fails — Python version mismatch

# Parlor requires Python 3.12+
python3 --version
# Install 3.12 via pyenv or system package manager, then:
uv python pin 3.12
uv sync

Dependencies (pyproject.toml)

Key packages installed by uv sync:

  • litert-lm — Google AI Edge inference runtime for Gemma
  • fastapi + uvicorn — async web/WebSocket server
  • kokoro — Kokoro TTS ONNX backend
  • kokoro-mlx — Kokoro TTS MLX backend (Mac only)
  • silero-vad — voice activity detection (browser-side via CDN)
  • huggingface-hub — model auto-download

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

everything-claude-code-harness

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

paperclip-ai-orchestration

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

freecodecamp-curriculum

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

opencli-web-automation

No summary provided by upstream source.

Repository SourceNeeds Review