Ollama

When to Use

User needs to install, run, integrate, tune, or debug Ollama for local or self-hosted model workflows. Agent handles smoke tests, model selection, API usage, Modelfile customization, embeddings, RAG fit checks, and safe operations.

Use this instead of generic AI advice when the blocker is specific to local runtime behavior: wrong model tag, broken JSON output, poor retrieval, slow inference, context sizing, GPU fallback, or unsafe remote exposure.

Architecture

Memory lives in ~/ollama/. If ~/ollama/ does not exist, run setup.md. See memory-template.md for structure.

~/ollama/
|-- memory.md          # Durable context and activation boundaries
|-- environment.md     # Host, GPU, OS, runtime, and service notes
|-- model-registry.md  # Approved models, tags, quants, and fit notes
|-- modelfiles.md      # Reusable Modelfile patterns and parameter decisions
|-- rag-notes.md       # Embedding choices, chunking, retrieval checks, vector dimensions
`-- incident-log.md    # Repeated failures, fixes, and rollback notes

Quick Reference

Load only the file needed for the current blocker.

Topic	File
Setup guide	`setup.md`
Memory template	`memory-template.md`
Install and smoke-test workflow	`install-and-smoke-test.md`
Local API and OpenAI-compatible patterns	`api-patterns.md`
Modelfile creation and context control	`modelfile-workflows.md`
Embeddings and local RAG checks	`embeddings-and-rag.md`
Runtime operations and performance tuning	`operations-and-performance.md`
Failure recovery and incident triage	`troubleshooting.md`

Requirements

Local ollama access on the target machine, or permission to guide installation.
Enough RAM, VRAM, and disk for the exact model and context window being proposed.
Explicit user approval before exposing Ollama beyond localhost, changing service managers, or deleting model files.
Exact model tags and runtime facts must be verified with live commands such as ollama list, ollama ps, and ollama show.

Never assume model capabilities, context length, quantization, or GPU usage from memory alone.

Operating Coverage

This skill is for practical Ollama execution, not abstract local-LLM discussion. It covers:

local installs on macOS, Linux, and Windows
CLI workflows for pull, run, copy, show, create, and remove
REST API usage on http://127.0.0.1:11434/api and OpenAI-compatible usage on /v1
hardware-aware model sizing, context tuning, and throughput tradeoffs
Modelfile-based customization for prompts, parameters, adapters, and reproducible model names
embeddings and local RAG pipelines where indexing, querying, and retrieval must stay consistent

Data Storage

Keep only durable operational context in ~/ollama/:

host facts that materially change advice: OS, GPU class, CPU-only constraints, service manager, remote or local deployment
approved model tags, copied aliases, quant choices, and context limits that worked in practice
Modelfile defaults, JSON output patterns, and safe OpenAI-compatible mappings
embedding model choices, vector dimensions, chunking defaults, and retrieval checks
recurring failures such as partial pulls, CPU fallback, port conflicts, or broken upgrades

Core Rules

1. Verify the Runtime Before Giving Advice

Confirm ollama is installed and reachable before proposing any deeper fix.
Start with the smallest factual checks: ollama --version, ollama list, ollama ps, and one minimal generation or /api/tags request.
Treat "it runs" and "it runs correctly" as different states.

2. Pin Exact Model Names and Inspect Them Live

Use exact tags, not vague family names, for anything reproducible or production-adjacent.
Inspect the real model with ollama show or /api/show before claiming context length, quantization, or capabilities.
Avoid silent drift from floating tags when stability matters.

3. Separate Runtime, Modelfile, and App Prompt Responsibilities

Debug local behavior in layers: runtime first, then model definition, then application prompt.
If output quality changed, check whether SYSTEM, TEMPLATE, or PARAMETER settings in the Modelfile are fighting the app prompt.
Put durable defaults in a named model, not in ad hoc copy-pasted prompts.

4. Choose Models by Hardware and Latency Budget

A model that technically loads but falls back to CPU or swaps memory is not a good fit.
Use ollama ps to confirm processor split before promising performance.
Keep separate defaults for chat, coding, extraction, vision, and embeddings instead of forcing one model to do everything.

5. Make API and Structured Output Flows Deterministic

Prefer non-streaming responses when the next step needs strict parsing.
Use format: "json" or a JSON schema, set low temperature, and validate the parsed result before taking downstream actions.
For OpenAI-compatible clients, verify /v1 assumptions instead of assuming every feature maps 1:1.

6. Treat Embeddings and RAG as a Single System

Use the same embedding model for indexing and querying unless you intentionally migrate and re-index.
Inspect retrieved chunks before blaming the model for weak answers.
Fix chunking, metadata, top-k, and vector dimensions before increasing prompt size.

7. Treat Remote Access and Upgrades as Operational Changes

Do not bind Ollama to non-localhost or open port 11434 without explicit approval and a minimal-risk network plan.
Record service manager changes, environment variables, and rollback steps before upgrading.
Protect model storage and disk headroom before large pulls or replacements.

Ollama Traps

Using latest everywhere -> upgrades silently change behavior and break reproducibility.
Testing only with ollama run -> app integration still fails on /api or /v1.
Assuming slow responses mean "bad model" -> often it is CPU fallback, oversized context, or disk pressure.
Letting app prompts and Modelfile instructions fight each other -> outputs become inconsistent and hard to debug.
Re-indexing with one embedding model and querying with another -> retrieval quality collapses without obvious errors.
Exposing the API on a LAN without auth or scoping -> local convenience becomes a security problem.
Chasing larger context before fixing retrieval or prompt shape -> memory use rises while answer quality barely improves.

External Endpoints

Use external network access only when the task requires model downloads, official docs lookup, or optional cloud execution explicitly approved by the user.

Endpoint	Data Sent	Purpose
https://ollama.com/*	model identifiers, optional doc queries, and optional cloud API requests	Official docs, library lookups, model pulls managed by the Ollama runtime, and optional cloud execution

No other data is sent externally.

Security & Privacy

Data that leaves your machine:

model identifiers and download requests when pulling models through Ollama
optional prompts and attachments only if the user explicitly chooses https://ollama.com/api instead of local inference
optional documentation lookups against official Ollama pages

Data that stays local:

prompts and outputs served through the local Ollama runtime on the user machine
durable workflow notes under ~/ollama/
local Modelfiles, retrieval notes, and performance baselines unless the user exports them

This skill does NOT:

expose Ollama remotely without explicit approval
store OLLAMA_API_KEY or other secrets in skill files
mix local and cloud execution silently
invent unsupported model features, GPU behavior, or API compatibility
recommend remote installers or destructive cleanup without explaining risk first

Trust

By using this skill, model pulls and optional cloud requests may go to Ollama infrastructure when the user explicitly chooses those paths. Only install if you trust Ollama with that data.

Scope

This skill ONLY:

installs, verifies, operates, and troubleshoots Ollama workflows
helps choose, pin, inspect, and customize models with reproducible patterns
keeps local memory for host constraints, model defaults, and recurring failure fixes

This skill NEVER:

claim that every Ollama model supports the same tools, context, or JSON reliability
recommend unauthenticated remote exposure as a default
treat local RAG quality as solved without checking embeddings, chunking, and retrieval results
modify its own skill files

Related Skills

Install with clawhub install <slug> if user confirms:

ai - Frame when local Ollama is the right fit versus cloud inference.
models - Compare local model families, sizes, and capability tradeoffs before pinning defaults.
api - Reuse robust HTTP request, retry, and parsing patterns around local services.
embeddings - Extend vector search and chunking strategy beyond the Ollama runtime itself.
langchain - Integrate Ollama into multi-step chains, agents, and retrieval pipelines.

Feedback

If useful: clawhub star ollama
Stay updated: clawhub sync

Safety Notice

Copy this and send it to your AI assistant to learn

When to Use

Architecture

Quick Reference

Requirements

Operating Coverage

Data Storage

Core Rules

1. Verify the Runtime Before Giving Advice

2. Pin Exact Model Names and Inspect Them Live

3. Separate Runtime, Modelfile, and App Prompt Responsibilities

4. Choose Models by Hardware and Latency Budget

5. Make API and Structured Output Flows Deterministic

6. Treat Embeddings and RAG as a Single System

7. Treat Remote Access and Upgrades as Operational Changes

Ollama Traps

External Endpoints

Security & Privacy

Trust

Scope

Related Skills

Feedback

Source Transparency

Related Skills

Talagent

GoHighLevel

Agent Memory System v8

Billions Network - Verified Agent Identity