Local LLM Provider

Connect to local LLM endpoints (Ollama, llama.cpp, vLLM) with automatic fallback to cloud providers. This skill enables the agent to leverage local GPU/CPU inference while maintaining reliability through intelligent fallback.

When to Use

Running LLM inference locally for privacy (data never leaves your machine)
Using models not available via cloud APIs (e.g., fine-tuned models, Llama variants)
Reducing API costs for high-volume tasks
Working offline or with intermittent connectivity
Need low-latency responses for interactive tasks

Setup

No additional setup required if Ollama is already running. Otherwise:

Ollama Setup

Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Pull a model

ollama pull llama3.2

Start the server (default: http://localhost:11434)

ollama serve

llama.cpp Server Setup

Build llama-server

make llama-server

Start the server

llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 133000 --host 127.0.0.1 --port 8080

vLLM Server Setup

Install vLLM

pip install vllm

Start the server

vllm serve meta-llama/Llama-3.1-8B-Instruct

Usage

Query a local model

node /job/.pi/skills/local-llm-provider/query.js "What is 2+2?" --model llama3.2

Query with custom parameters

node /job/.pi/skills/local-llm-provider/query.js "Explain quantum computing" --model mixtral --temp 0.8 --max-tokens 500

List available models

node /job/.pi/skills/local-llm-provider/list-models.js

Check server health

node /job/.pi/skills/local-llm-provider/health.js

Stream responses

node /job/.pi/skills/local-llm-provider/query.js "Tell me a story" --stream

Configuration

Create a config.json in the skill directory for persistent settings:

{ "providers": [ { "name": "ollama", "url": "http://localhost:11434", "enabled": true, "fallback_order": 1 }, { "name": "llamacpp", "url": "http://localhost:8080/v1", "enabled": false, "fallback_order": 2 }, { "name": "vllm", "url": "http://localhost:8000/v1", "enabled": false, "fallback_order": 3 } ], "default_model": "llama3.2", "fallback_to_cloud": true, "cloud_provider": "anthropic", "timeout_ms": 120000 }

Provider Fallback

The skill implements intelligent fallback:

Primary: Try local Ollama first
Secondary: Try llama.cpp server
Tertiary: Try vLLM server
Fallback: Use cloud provider (if enabled)

Each provider failure triggers automatic retry with the next available provider.

Supported Models

Ollama

llama3.2, llama3.1, llama3
mistral, mixtral
qwen2.5, qwen2
phi3, phi4
gemma2, gemma
codellama
and many more

llama.cpp

Any GGUF format model
Mistral variants
Llama variants
Qwen variants

vLLM

Llama 3.1, 3.0
Mistral
Qwen
Any HuggingFace model

API Integration

As a library

const { LocalLLMProvider } = require('./provider.js');

const provider = new LocalLLMProvider({ providers: [ { name: 'ollama', url: 'http://localhost:11434', enabled: true }, { name: 'anthropic', api_key: process.env.ANTHROPIC_API_KEY, enabled: false } ], default_model: 'llama3.2', fallback_to_cloud: true });

const response = await provider.complete('Hello, how are you?'); console.log(response);

Output Format

The query returns JSON:

{ "success": true, "provider": "ollama", "model": "llama3.2", "response": "I'm doing well, thank you for asking!", "tokens": 42, "duration_ms": 1500, "done": true }

When streaming:

{ "success": true, "provider": "ollama", "model": "llama3.2", "response": "I", "tokens": 1, "done": false }

On fallback failure:

{ "success": false, "error": "All providers failed", "providers_tried": ["ollama", "llamacpp"], "last_error": "Connection refused" }

Environment Variables

Variable Description Default

OLLAMA_BASE_URL

Ollama server URL http://localhost:11434

LLAMACPP_BASE_URL

llama.cpp server URL http://localhost:8080/v1

VLLM_BASE_URL

vLLM server URL http://localhost:8000/v1

LOCAL_LLM_DEFAULT_MODEL

Default model to use llama3.2

Limitations

Requires local server to be running
Model quality depends on local hardware
Not all models support all features (e.g., function calling)
Some providers have different API formats

Tips

For best performance: Use Ollama with GPU acceleration
For variety: Pull multiple models (ollama pull mixtral )
For privacy: Always use local providers first
For reliability: Keep cloud fallback enabled for critical tasks
For speed: Use smaller models (7B) for simple tasks, larger (70B) for complex reasoning

local-llm-provider

Safety Notice

Copy this and send it to your AI assistant to learn

Install Ollama

Pull a model

Start the server (default: http://localhost:11434)

Build llama-server

Start the server

Install vLLM

Start the server

Source Transparency

Related Skills

robot-personality

delegate-multi-agent

delegate-agent

agent-delegate