Local LLM Provider
Connect to local LLM endpoints (Ollama, llama.cpp, vLLM) with automatic fallback to cloud providers. This skill enables the agent to leverage local GPU/CPU inference while maintaining reliability through intelligent fallback.
When to Use
-
Running LLM inference locally for privacy (data never leaves your machine)
-
Using models not available via cloud APIs (e.g., fine-tuned models, Llama variants)
-
Reducing API costs for high-volume tasks
-
Working offline or with intermittent connectivity
-
Need low-latency responses for interactive tasks
Setup
No additional setup required if Ollama is already running. Otherwise:
Ollama Setup
Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
Pull a model
ollama pull llama3.2
Start the server (default: http://localhost:11434)
ollama serve
llama.cpp Server Setup
Build llama-server
make llama-server
Start the server
llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 133000 --host 127.0.0.1 --port 8080
vLLM Server Setup
Install vLLM
pip install vllm
Start the server
vllm serve meta-llama/Llama-3.1-8B-Instruct
Usage
Query a local model
node /job/.pi/skills/local-llm-provider/query.js "What is 2+2?" --model llama3.2
Query with custom parameters
node /job/.pi/skills/local-llm-provider/query.js "Explain quantum computing" --model mixtral --temp 0.8 --max-tokens 500
List available models
node /job/.pi/skills/local-llm-provider/list-models.js
Check server health
node /job/.pi/skills/local-llm-provider/health.js
Stream responses
node /job/.pi/skills/local-llm-provider/query.js "Tell me a story" --stream
Configuration
Create a config.json in the skill directory for persistent settings:
{ "providers": [ { "name": "ollama", "url": "http://localhost:11434", "enabled": true, "fallback_order": 1 }, { "name": "llamacpp", "url": "http://localhost:8080/v1", "enabled": false, "fallback_order": 2 }, { "name": "vllm", "url": "http://localhost:8000/v1", "enabled": false, "fallback_order": 3 } ], "default_model": "llama3.2", "fallback_to_cloud": true, "cloud_provider": "anthropic", "timeout_ms": 120000 }
Provider Fallback
The skill implements intelligent fallback:
-
Primary: Try local Ollama first
-
Secondary: Try llama.cpp server
-
Tertiary: Try vLLM server
-
Fallback: Use cloud provider (if enabled)
Each provider failure triggers automatic retry with the next available provider.
Supported Models
Ollama
-
llama3.2, llama3.1, llama3
-
mistral, mixtral
-
qwen2.5, qwen2
-
phi3, phi4
-
gemma2, gemma
-
codellama
-
and many more
llama.cpp
-
Any GGUF format model
-
Mistral variants
-
Llama variants
-
Qwen variants
vLLM
-
Llama 3.1, 3.0
-
Mistral
-
Qwen
-
Any HuggingFace model
API Integration
As a library
const { LocalLLMProvider } = require('./provider.js');
const provider = new LocalLLMProvider({ providers: [ { name: 'ollama', url: 'http://localhost:11434', enabled: true }, { name: 'anthropic', api_key: process.env.ANTHROPIC_API_KEY, enabled: false } ], default_model: 'llama3.2', fallback_to_cloud: true });
const response = await provider.complete('Hello, how are you?'); console.log(response);
Output Format
The query returns JSON:
{ "success": true, "provider": "ollama", "model": "llama3.2", "response": "I'm doing well, thank you for asking!", "tokens": 42, "duration_ms": 1500, "done": true }
When streaming:
{ "success": true, "provider": "ollama", "model": "llama3.2", "response": "I", "tokens": 1, "done": false }
On fallback failure:
{ "success": false, "error": "All providers failed", "providers_tried": ["ollama", "llamacpp"], "last_error": "Connection refused" }
Environment Variables
Variable Description Default
OLLAMA_BASE_URL
Ollama server URL http://localhost:11434
LLAMACPP_BASE_URL
llama.cpp server URL http://localhost:8080/v1
VLLM_BASE_URL
vLLM server URL http://localhost:8000/v1
LOCAL_LLM_DEFAULT_MODEL
Default model to use llama3.2
Limitations
-
Requires local server to be running
-
Model quality depends on local hardware
-
Not all models support all features (e.g., function calling)
-
Some providers have different API formats
Tips
-
For best performance: Use Ollama with GPU acceleration
-
For variety: Pull multiple models (ollama pull mixtral )
-
For privacy: Always use local providers first
-
For reliability: Keep cloud fallback enabled for critical tasks
-
For speed: Use smaller models (7B) for simple tasks, larger (70B) for complex reasoning