Linux Ollama — Fleet Routing for Ollama on Linux
Run Ollama on Linux with multi-machine load balancing. Linux Ollama Herd turns multiple Linux machines into one smart Ollama endpoint. Your server rack, your desktop, your edge device — all serving AI through one Linux Ollama URL.
Linux Ollama setup
Step 1: Install Ollama on Linux
curl -fsSL https://ollama.ai/install.sh | sh
Step 2: Install Linux Ollama Herd
pip install ollama-herd
Step 3: Start the Linux Ollama router
On one Linux machine (your router):
herd # starts Linux Ollama router on port 11435
herd-node # registers this Linux machine
On every other Linux machine:
herd-node # auto-discovers the Linux Ollama router via mDNS
No mDNS? Connect Linux nodes directly:
herd-node --router-url http://router-ip:11435
Linux Ollama systemd integration
Run Linux Ollama Herd as a systemd service for automatic startup:
# /etc/systemd/system/ollama-herd.service
[Unit]
Description=Linux Ollama Herd Router
After=network.target ollama.service
[Service]
Type=simple
ExecStart=/usr/local/bin/herd
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
sudo systemctl enable ollama-herd
sudo systemctl start ollama-herd
Node agent as a Linux systemd service:
# /etc/systemd/system/ollama-herd-node.service
[Unit]
Description=Linux Ollama Herd Node Agent
After=network.target ollama.service
[Service]
Type=simple
ExecStart=/usr/local/bin/herd-node
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
Use Linux Ollama
OpenAI SDK
from openai import OpenAI
# Your Linux Ollama fleet
client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
response = client.chat.completions.create(
model="llama3.3:70b",
messages=[{"role": "user", "content": "Write a systemd service file for a Python API"}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="")
curl (Ollama format)
# Linux Ollama inference
curl http://localhost:11435/api/chat -d '{
"model": "qwen3.5:32b",
"messages": [{"role": "user", "content": "Explain Linux process scheduling"}],
"stream": false
}'
Linux Ollama environment setup
# Optimize Linux Ollama performance via systemd
sudo systemctl edit ollama
# Add under [Service]:
# Environment="OLLAMA_KEEP_ALIVE=-1"
# Environment="OLLAMA_MAX_LOADED_MODELS=-1"
# Environment="OLLAMA_NUM_PARALLEL=2"
sudo systemctl restart ollama
Or via shell profile:
echo 'export OLLAMA_KEEP_ALIVE=-1' >> ~/.bashrc
echo 'export OLLAMA_MAX_LOADED_MODELS=-1' >> ~/.bashrc
source ~/.bashrc
Linux Ollama GPU support
| Linux GPU | vRAM | Best Linux Ollama models |
|---|---|---|
| NVIDIA RTX 4090 | 24GB | llama3.3:70b, qwen3.5:32b |
| NVIDIA A100 | 40/80GB | deepseek-v3, qwen3.5:72b |
| NVIDIA L40S | 48GB | llama3.3:70b (full precision) |
| AMD ROCm (experimental) | varies | Ollama ROCm support on Linux |
| CPU only | system RAM | phi4-mini, gemma3:1b — slower but works |
Linux Ollama supports NVIDIA CUDA, experimental AMD ROCm, and CPU-only inference.
Linux Ollama firewall
# UFW (Ubuntu/Debian)
sudo ufw allow 11435/tcp
# firewalld (RHEL/Fedora)
sudo firewall-cmd --add-port=11435/tcp --permanent
sudo firewall-cmd --reload
# iptables
sudo iptables -A INPUT -p tcp --dport 11435 -j ACCEPT
Monitor Linux Ollama
# Linux Ollama fleet status
curl -s http://localhost:11435/fleet/status | python3 -m json.tool
# Linux Ollama health — 15 automated checks
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool
# Models on Linux Ollama nodes
curl -s http://localhost:11435/api/ps | python3 -m json.tool
Dashboard at http://localhost:11435/dashboard — live Linux Ollama monitoring.
Linux Ollama logs
# JSONL structured logs
tail -f ~/.fleet-manager/logs/herd.jsonl.$(date +%Y-%m-%d) | python3 -m json.tool
# Check for Linux Ollama errors
grep '"level":"ERROR"' ~/.fleet-manager/logs/herd.jsonl.$(date +%Y-%m-%d)
Also available on Linux Ollama
Image generation
curl http://localhost:11435/api/generate-image \
-d '{"model": "z-image-turbo", "prompt": "Linux penguin in cyberspace", "width": 1024, "height": 1024}'
Embeddings
curl http://localhost:11435/api/embed \
-d '{"model": "nomic-embed-text", "input": "Linux Ollama local inference"}'
Full documentation
Contribute
Ollama Herd is open source (MIT). Linux Ollama users welcome:
Guardrails
- Linux Ollama model downloads require explicit user confirmation.
- Linux Ollama model deletion requires explicit user confirmation.
- Never delete or modify files in
~/.fleet-manager/. - No models are downloaded automatically — all pulls are user-initiated or require opt-in.