Linux AI Server — Headless AI Inference Cluster
Turn your Linux servers into a distributed AI inference cluster. No GUI, no Docker, no Kubernetes — just Linux + pip install. Your rack-mounted servers, cloud VMs, and spare Linux boxes all serve AI through one endpoint.
Why Linux AI server
- Zero GUI overhead — headless Linux AI uses all resources for inference, not desktops
- systemd native — Linux AI server starts on boot, restarts on failure, logs to journald
- SSH management — manage your Linux AI server cluster entirely over SSH
- Any Linux distro — Ubuntu, Debian, RHEL, Fedora, Arch, Alpine — if it runs Ollama, it joins the fleet
- NVIDIA CUDA — Linux AI server uses NVIDIA GPUs natively. No compatibility issues.
- Fleet routing — multiple Linux AI servers share the load. 7-signal scoring picks the best one.
Linux AI server setup
Quick install on each Linux server
# Install Ollama on Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Install the Linux AI router
pip install ollama-herd
Linux AI server router (pick one server)
herd # start Linux AI server router on port 11435
herd-node # register this Linux AI server
Linux AI server nodes (all other servers)
herd-node # auto-discovers the Linux AI server router
# Or explicit: herd-node --router-url http://router-ip:11435
Linux AI server systemd services
# /etc/systemd/system/herd-router.service
[Unit]
Description=Linux AI Server Router
After=network.target ollama.service
[Service]
Type=simple
ExecStart=/usr/local/bin/herd
Restart=always
RestartSec=5
User=ollama
[Install]
WantedBy=multi-user.target
# /etc/systemd/system/herd-node.service
[Unit]
Description=Linux AI Server Node
After=network.target ollama.service
[Service]
Type=simple
ExecStart=/usr/local/bin/herd-node
Restart=always
RestartSec=5
User=ollama
[Install]
WantedBy=multi-user.target
sudo systemctl enable --now herd-router # on the Linux AI router
sudo systemctl enable --now herd-node # on all Linux AI nodes
Linux AI server hardware guide
| Linux AI Server | GPU | RAM | Best Linux AI models |
|---|---|---|---|
| Rack server (NVIDIA A100) | 80GB | 256GB | deepseek-v3, qwen3.5:72b — frontier |
| Rack server (NVIDIA L40S) | 48GB | 128GB | llama3.3:70b, qwen3.5:32b |
| Desktop server (RTX 4090) | 24GB | 64GB | llama3.3:70b (Q4), deepseek-r1:32b |
| Mini PC / NUC (no GPU) | CPU | 32GB | phi4, gemma3:12b — CPU inference |
| Cloud VM (no GPU) | CPU | 16GB | phi4-mini, gemma3:4b |
| Raspberry Pi 5 | CPU | 8GB | gemma3:1b, phi4-mini — edge AI |
Linux AI server works with NVIDIA CUDA GPUs, AMD ROCm (experimental), and CPU-only inference.
Use your Linux AI server
OpenAI SDK
from openai import OpenAI
# Your Linux AI server endpoint
client = OpenAI(base_url="http://linux-ai-server:11435/v1", api_key="not-needed")
response = client.chat.completions.create(
model="llama3.3:70b",
messages=[{"role": "user", "content": "Write a Terraform module for AWS ECS"}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="")
curl from any machine
# Hit your Linux AI server from anywhere on the network
curl http://linux-ai-server:11435/api/chat -d '{
"model": "codestral",
"messages": [{"role": "user", "content": "Write a Dockerfile for a FastAPI app"}],
"stream": false
}'
Linux AI server environment
# Optimize Linux AI server Ollama
sudo systemctl edit ollama
# Add under [Service]:
# Environment="OLLAMA_KEEP_ALIVE=-1"
# Environment="OLLAMA_MAX_LOADED_MODELS=-1"
# Environment="OLLAMA_NUM_PARALLEL=2"
sudo systemctl restart ollama
Linux AI server firewall
# UFW (Ubuntu/Debian)
sudo ufw allow 11435/tcp
# firewalld (RHEL/Fedora)
sudo firewall-cmd --add-port=11435/tcp --permanent && sudo firewall-cmd --reload
Linux AI server monitoring
# Linux AI server fleet status
curl -s http://localhost:11435/fleet/status | python3 -m json.tool
# Linux AI server health — 15 automated checks
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool
# Linux AI server traces — recent requests
curl -s "http://localhost:11435/dashboard/api/traces?limit=10" | python3 -m json.tool
# Linux AI server logs
journalctl -u herd-router -f
tail -f ~/.fleet-manager/logs/herd.jsonl.$(date +%Y-%m-%d)
Dashboard at http://linux-ai-server:11435/dashboard — access from any browser on the network.
Also available on Linux AI server
Image generation
curl http://localhost:11435/api/generate-image \
-d '{"model": "z-image-turbo", "prompt": "server rack visualization", "width": 1024, "height": 1024}'
Embeddings
curl http://localhost:11435/api/embed \
-d '{"model": "nomic-embed-text", "input": "Linux AI server headless inference"}'
Full documentation
Contribute
Ollama Herd is open source (MIT). Linux server admins welcome:
Guardrails
- Linux AI server model downloads require explicit user confirmation.
- Linux AI server model deletion requires explicit user confirmation.
- Never delete or modify files in
~/.fleet-manager/. - No models are downloaded automatically — all pulls are user-initiated or require opt-in.