local-llm-provider

Connect to local LLM endpoints (Ollama, llama.cpp, vLLM) with automatic fallback to cloud providers. This skill enables the agent to leverage local GPU/CPU inference while maintaining reliability through intelligent fallback.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "local-llm-provider" with this command: npx skills add winsorllc/upgraded-carnival/winsorllc-upgraded-carnival-local-llm-provider

Local LLM Provider

Connect to local LLM endpoints (Ollama, llama.cpp, vLLM) with automatic fallback to cloud providers. This skill enables the agent to leverage local GPU/CPU inference while maintaining reliability through intelligent fallback.

When to Use

  • Running LLM inference locally for privacy (data never leaves your machine)

  • Using models not available via cloud APIs (e.g., fine-tuned models, Llama variants)

  • Reducing API costs for high-volume tasks

  • Working offline or with intermittent connectivity

  • Need low-latency responses for interactive tasks

Setup

No additional setup required if Ollama is already running. Otherwise:

Ollama Setup

Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Pull a model

ollama pull llama3.2

Start the server (default: http://localhost:11434)

ollama serve

llama.cpp Server Setup

Build llama-server

make llama-server

Start the server

llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 133000 --host 127.0.0.1 --port 8080

vLLM Server Setup

Install vLLM

pip install vllm

Start the server

vllm serve meta-llama/Llama-3.1-8B-Instruct

Usage

Query a local model

node /job/.pi/skills/local-llm-provider/query.js "What is 2+2?" --model llama3.2

Query with custom parameters

node /job/.pi/skills/local-llm-provider/query.js "Explain quantum computing" --model mixtral --temp 0.8 --max-tokens 500

List available models

node /job/.pi/skills/local-llm-provider/list-models.js

Check server health

node /job/.pi/skills/local-llm-provider/health.js

Stream responses

node /job/.pi/skills/local-llm-provider/query.js "Tell me a story" --stream

Configuration

Create a config.json in the skill directory for persistent settings:

{ "providers": [ { "name": "ollama", "url": "http://localhost:11434", "enabled": true, "fallback_order": 1 }, { "name": "llamacpp", "url": "http://localhost:8080/v1", "enabled": false, "fallback_order": 2 }, { "name": "vllm", "url": "http://localhost:8000/v1", "enabled": false, "fallback_order": 3 } ], "default_model": "llama3.2", "fallback_to_cloud": true, "cloud_provider": "anthropic", "timeout_ms": 120000 }

Provider Fallback

The skill implements intelligent fallback:

  • Primary: Try local Ollama first

  • Secondary: Try llama.cpp server

  • Tertiary: Try vLLM server

  • Fallback: Use cloud provider (if enabled)

Each provider failure triggers automatic retry with the next available provider.

Supported Models

Ollama

  • llama3.2, llama3.1, llama3

  • mistral, mixtral

  • qwen2.5, qwen2

  • phi3, phi4

  • gemma2, gemma

  • codellama

  • and many more

llama.cpp

  • Any GGUF format model

  • Mistral variants

  • Llama variants

  • Qwen variants

vLLM

  • Llama 3.1, 3.0

  • Mistral

  • Qwen

  • Any HuggingFace model

API Integration

As a library

const { LocalLLMProvider } = require('./provider.js');

const provider = new LocalLLMProvider({ providers: [ { name: 'ollama', url: 'http://localhost:11434', enabled: true }, { name: 'anthropic', api_key: process.env.ANTHROPIC_API_KEY, enabled: false } ], default_model: 'llama3.2', fallback_to_cloud: true });

const response = await provider.complete('Hello, how are you?'); console.log(response);

Output Format

The query returns JSON:

{ "success": true, "provider": "ollama", "model": "llama3.2", "response": "I'm doing well, thank you for asking!", "tokens": 42, "duration_ms": 1500, "done": true }

When streaming:

{ "success": true, "provider": "ollama", "model": "llama3.2", "response": "I", "tokens": 1, "done": false }

On fallback failure:

{ "success": false, "error": "All providers failed", "providers_tried": ["ollama", "llamacpp"], "last_error": "Connection refused" }

Environment Variables

Variable Description Default

OLLAMA_BASE_URL

Ollama server URL http://localhost:11434

LLAMACPP_BASE_URL

llama.cpp server URL http://localhost:8080/v1

VLLM_BASE_URL

vLLM server URL http://localhost:8000/v1

LOCAL_LLM_DEFAULT_MODEL

Default model to use llama3.2

Limitations

  • Requires local server to be running

  • Model quality depends on local hardware

  • Not all models support all features (e.g., function calling)

  • Some providers have different API formats

Tips

  • For best performance: Use Ollama with GPU acceleration

  • For variety: Pull multiple models (ollama pull mixtral )

  • For privacy: Always use local providers first

  • For reliability: Keep cloud fallback enabled for critical tasks

  • For speed: Use smaller models (7B) for simple tasks, larger (70B) for complex reasoning

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

robot-personality

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

delegate-multi-agent

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

delegate-agent

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

agent-delegate

No summary provided by upstream source.

Repository SourceNeeds Review