RTX Local AI — Your Gaming PC Is an AI Server

Your RTX GPU already runs games at 4K. Now run LLMs at the same speed. An RTX 4090 with 24GB vRAM loads 70B parameter models. An RTX 4080 with 16GB runs 14B-34B models fast. Stack multiple RTX PCs into a fleet and route AI requests to the best available RTX GPU.

RTX GPU model guide

RTX GPU	vRAM	Best RTX models	RTX performance
RTX 4090	24GB	`llama3.3:70b` (Q4), `qwen3.5:32b`, `deepseek-r1:32b`	RTX king — 70B models at speed
RTX 4080	16GB	`qwen3.5:14b`, `phi4`, `codestral`, `mistral-nemo`	RTX sweet spot for most tasks
RTX 4070 Ti	12GB	`phi4`, `gemma3:12b`, `llama3.2:3b`	Budget RTX with solid performance
RTX 4070	12GB	`phi4-mini`, `gemma3:4b`, `qwen3.5:7b`	Entry-level RTX for local AI
RTX 3090	24GB	Same as RTX 4090	Last-gen RTX, still great for AI
RTX 3080	10GB	`phi4-mini`, `llama3.2:3b`	Older RTX, lightweight models

Cross-platform: RTX Local AI works on Windows and Linux. Most RTX gaming PCs run Windows — that's fine.

Setup your RTX AI server

pip install ollama-herd    # PyPI: https://pypi.org/project/ollama-herd/

Single RTX gaming PC

herd         # start the RTX router
herd-node    # register this RTX machine

Multiple RTX PCs (RTX fleet)

On one RTX PC (the router):

herd
herd-node

On every other RTX PC:

herd-node    # auto-discovers the RTX router via mDNS

That's it. Every RTX PC in your fleet now shares AI workload.

Use your RTX for AI

OpenAI SDK

from openai import OpenAI

# Your RTX GPU serves this
rtx_client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")

# RTX 4090 handles 70B models easily
response = rtx_client.chat.completions.create(
    model="llama3.3:70b",
    messages=[{"role": "user", "content": "Write a game engine ECS system in Rust"}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

RTX-powered code generation

# Your RTX runs Codestral for code
response = rtx_client.chat.completions.create(
    model="codestral",
    messages=[{"role": "user", "content": "Optimize this HLSL shader for RTX ray tracing"}],
)
print(response.choices[0].message.content)

curl

# RTX inference
curl http://localhost:11435/api/chat -d '{
  "model": "qwen3.5:32b",
  "messages": [{"role": "user", "content": "Explain GPU memory architecture"}],
  "stream": false
}'

RTX vs cloud — cost comparison

Option	Monthly cost	RTX advantage
RTX 4090 (one-time $1,599)	$0/month	Your RTX runs unlimited inference forever
Cloud A100 (AWS)	$3.06/hour (~$2,200/month)	RTX pays for itself in weeks
OpenAI GPT-4o API	~$100-500/month at scale	RTX has zero per-token cost
RTX 4080 (one-time $1,199)	$0/month	Even budget RTX beats cloud

Monitor your RTX fleet

# RTX fleet overview
curl -s http://localhost:11435/fleet/status | python3 -m json.tool

# Check RTX GPU health
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool

# Models loaded on RTX GPUs
curl -s http://localhost:11435/api/ps | python3 -m json.tool

Dashboard at http://localhost:11435/dashboard — live RTX performance monitoring.

Optimize Ollama for RTX

Keep models loaded in your RTX vRAM permanently:

# Windows (most RTX gaming PCs)
[System.Environment]::SetEnvironmentVariable("OLLAMA_KEEP_ALIVE", "-1", "User")
[System.Environment]::SetEnvironmentVariable("OLLAMA_MAX_LOADED_MODELS", "-1", "User")
# Restart Ollama from system tray

# Linux
sudo systemctl edit ollama
# Add: Environment="OLLAMA_KEEP_ALIVE=-1"
# Add: Environment="OLLAMA_MAX_LOADED_MODELS=-1"
sudo systemctl restart ollama

Also available on your RTX fleet

Image generation

curl http://localhost:11435/api/generate-image \
  -d '{"model": "z-image-turbo", "prompt": "RTX-powered cyberpunk cityscape", "width": 1024, "height": 1024}'

Embeddings

curl http://localhost:11435/api/embed \
  -d '{"model": "nomic-embed-text", "input": "NVIDIA RTX local AI inference"}'

Full documentation

Contribute

Ollama Herd is open source (MIT). RTX gamers and AI builders welcome:

Guardrails

RTX model downloads require explicit user confirmation — models range from 1GB to 400GB+.
RTX model deletion requires explicit user confirmation.
Never delete or modify files in ~/.fleet-manager/.
No models are downloaded automatically — all pulls are user-initiated or require opt-in.

rtx-local-ai

Safety Notice

Copy this and send it to your AI assistant to learn