MLX Local Inference Stack
Full local AI inference on Apple Silicon Macs. All services expose OpenAI-compatible APIs.
Services Overview
| Service | Port | Access | Models |
|---|---|---|---|
| LLM + Whisper + Embedding | 8787 | LAN (0.0.0.0) | qwen3-14b, gemma-3-12b, whisper-large-v3-turbo, qwen3-embedding-0.6b/4b |
| ASR (Qwen3-ASR) | 8788 | localhost only | Qwen3-ASR-1.7B-8bit |
| Transcribe Daemon | — | file-based | Uses ASR + LLM |
LaunchAgents: com.mlx-server (8787), com.mlx-audio-server (8788), com.mlx-transcribe-daemon
1. LLM — Local Chat Completions
Models
| Model ID | Params | Best For |
|---|---|---|
qwen3-14b | 14B 4bit | Chinese, deep reasoning (built-in think mode) |
gemma-3-12b | 12B 4bit | English, code generation |
API
curl -X POST http://localhost:8787/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-14b",
"messages": [{"role": "user", "content": "Hello"}],
"temperature": 0.7,
"max_tokens": 2048
}'
Add "stream": true for streaming.
Python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8787/v1", api_key="unused")
response = client.chat.completions.create(
model="qwen3-14b",
messages=[{"role": "user", "content": "Hello"}],
temperature=0.7, max_tokens=2048
)
print(response.choices[0].message.content)
Qwen3 Think Mode
Qwen3 may include <think>...</think> chain-of-thought tags. Strip them:
import re
text = re.sub(r'<think>.*?</think>\s*', '', text, flags=re.DOTALL)
Model Selection Guide
| Scenario | Recommended |
|---|---|
| Chinese text | qwen3-14b |
| Cantonese | qwen3-14b |
| English writing | gemma-3-12b |
| Code generation | Either |
| Deep reasoning | qwen3-14b (think mode) |
| Quick Q&A | gemma-3-12b |
2. ASR — Speech-to-Text
Qwen3-ASR (best for Chinese/Cantonese)
curl -X POST http://127.0.0.1:8788/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=mlx-community/Qwen3-ASR-1.7B-8bit" \
-F "language=zh"
Whisper (multilingual, 99 languages)
curl -X POST http://localhost:8787/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper-large-v3-turbo"
ASR Model Comparison
| Qwen3-ASR (port 8788) | Whisper (port 8787) | |
|---|---|---|
| Chinese/Cantonese | Strong | Average |
| Multilingual | No | Yes (99 langs) |
| LAN access | No (localhost) | Yes |
| Loading | On-demand | Always loaded |
Supported audio formats
wav, mp3, m4a, flac, ogg, webm
Long audio
Split into 10-min chunks first:
ffmpeg -y -ss 0 -t 600 -i long.wav -ar 16000 -ac 1 chunk_000.wav
3. Embeddings — Text Vectorization
Models
| Model ID | Size | Use Case |
|---|---|---|
qwen3-embedding-0.6b | 0.6B 4bit | Fast retrieval, low latency |
qwen3-embedding-4b | 4B 4bit | High-accuracy semantic matching |
API
curl -X POST http://localhost:8787/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model": "qwen3-embedding-0.6b", "input": "text to embed"}'
Batch
curl -X POST http://localhost:8787/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model": "qwen3-embedding-4b", "input": ["text 1", "text 2"]}'
4. OCR — Image Text Extraction
Default Model: PaddleOCR-VL-1.5-6bit
| Item | Value |
|---|---|
| Model ID | paddleocr-vl-6bit |
| Speed | ~185 t/s |
| Memory | ~3.3 GB |
| Prompt | OCR: |
CLI
cd ~/.mlx-server/venv
python -m mlx_vlm.generate \
--model mlx-community/PaddleOCR-VL-1.5-6bit \
--image image.jpg \
--prompt "OCR:" \
--max-tokens 512 --temp 0.0
Python
from mlx_vlm import generate, load
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model, processor = load("mlx-community/PaddleOCR-VL-1.5-6bit")
config = load_config("mlx-community/PaddleOCR-VL-1.5-6bit")
prompt = apply_chat_template(processor, config, "OCR:", num_images=1)
out = generate(model, processor, prompt, "image.jpg",
max_tokens=512, temperature=0.0, verbose=False)
print(out.text if hasattr(out, "text") else out)
Notes
- Prompt must be exactly
OCR:for PaddleOCR-VL temperature=0.0for deterministic output- RGBA images must be converted to RGB first
- Venv:
~/.mlx-server/venv
5. TTS — Text-to-Speech
Model: Qwen3-TTS (cached, not auto-served)
| Item | Value |
|---|---|
| Model | Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit |
| Memory | ~2GB |
| Feature | Custom voice cloning |
CLI
~/.mlx-server/venv/bin/mlx_audio.tts.generate \
--model mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit \
--text "你好,这是一段测试语音"
As API (via mlx_audio.server on port 8788)
curl -X POST http://127.0.0.1:8788/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit",
"input": "你好世界"
}' --output speech.wav
6. Transcribe Daemon — Automatic Batch Transcription
Drop audio files into ~/transcribe/ for automatic processing:
- Daemon detects file (polls every 15s)
- Phase 1: Transcribe via Qwen3-ASR →
filename_raw.md - Phase 2: Correct via Qwen3-14B LLM →
filename_corrected.md - Move results to
~/transcribe/done/
LLM Correction Rules
- Fix homophone errors (的/得/地, 在/再)
- Preserve Cantonese characters (嘅、唔、咁、喺、冇、佢)
- Add punctuation and paragraphs
- Remove filler words
Supported formats
wav, mp3, m4a, flac, ogg, webm
Service Management
# LLM + Whisper + Embedding server (port 8787)
launchctl kickstart -k gui/$(id -u)/com.mlx-server
# ASR server (port 8788)
launchctl kickstart -k gui/$(id -u)/com.mlx-audio-server
# Transcribe daemon
launchctl kickstart gui/$(id -u)/com.mlx-transcribe-daemon
# Logs
tail -f ~/.mlx-server/logs/server.log
tail -f ~/.mlx-server/logs/mlx-audio-server.err.log
tail -f ~/.mlx-server/logs/transcribe-daemon.err.log
Requirements
- Apple Silicon Mac (M1/M2/M3/M4)
- Python 3.10+ with mlx, mlx-lm, mlx-audio, mlx-vlm
- Recommended: 32GB+ RAM for running multiple models