distributed-inference

Distributed inference for Llama, Qwen, DeepSeek across heterogeneous hardware. Self-hosted distributed inference — scatter requests across macOS, Linux, Windows, and any machine running Ollama. Thermal-aware distributed inference scheduling, 7-signal distributed inference scoring, adaptive capacity learning, context-aware model placement. No orchestration layer, no container runtime — just HTTP and mDNS. 分布式推理 across local devices | Inferencia distribuida en hardware local.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "distributed-inference" with this command: npx skills add distributed-inference

Distributed Inference

A coordination layer for distributed inference across heterogeneous machines. Each node is autonomous — it runs its own Ollama, manages its own models, and works fine standalone. The distributed inference coordinator routes requests to the optimal node using a multi-signal distributed inference scoring function and records every distributed inference decision for analysis.

Install Distributed Inference

pip install ollama-herd
herd              # start the distributed inference coordinator
herd-node         # start a distributed inference agent on each node

Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd

Distributed Inference Architecture

Distributed Inference Coordinator (:11435)    Node Agents
┌──────────────────────┐     ┌──────────────────┐
│ Distributed Scoring  │◄────│ Heartbeat + Metrics│  (mDNS or explicit URL)
│ Inference Queue Mgr  │     │ Capacity Learner   │
│ Streaming Proxy      │     └──────────────────┘
│ Trace Store          │     ┌──────────────────┐
│ Latency Store        │     │ Heartbeat + Metrics│  (N nodes)
└──────────────────────┘     └──────────────────┘
        │
        ▼
   Ollama instances (one per distributed inference node)

Distributed inference nodes discover the coordinator via mDNS (_fleet-manager._tcp.local.) or connect explicitly with --router-url. Each distributed inference node sends heartbeats every 5 seconds containing: CPU utilization, memory usage and pressure classification, disk metrics, loaded models with context lengths, available models, and an optional capacity score from the behavioral model.

Distributed Inference Scoring Function

The distributed inference coordinator evaluates every online node for every request using 7 weighted signals:

Distributed Inference SignalMax WeightWhat it measures
Thermal state+50Is the model already loaded in GPU memory? Hot (+50), warm (+30), cold (+10)
Memory fit+20Available distributed inference memory headroom relative to model size
Queue depth-30Pending + in-flight distributed inference requests on this node:model pair
Wait time-25Estimated distributed inference wait based on p75 historical latency × queue depth
Role affinity+15Large models prefer high-memory distributed inference nodes
Availability trend+10Capacity learner's prediction of distributed inference node availability
Context fit+15Does the loaded model's context window fit the estimated distributed inference token count?

Distributed inference nodes with insufficient memory, critical pressure, or missing models are eliminated before scoring. The highest-scoring distributed inference node wins.

Adaptive Distributed Inference Capacity

Distributed inference nodes optionally learn usage patterns and constrain their availability:

  • 168-slot behavioral model — one slot per hour of the week, learns when the distributed inference machine is typically free
  • Dynamic memory ceiling — maps availability score to how much RAM the distributed inference coordinator can use

Enable with FLEET_NODE_ENABLE_CAPACITY_LEARNING=true on the distributed inference node agent.

Context-aware Distributed Inference Model Placement

The distributed inference coordinator protects against a known Ollama behavior where changing num_ctx at runtime triggers a full model reload. For an 89GB model, this causes multi-minute hangs.

  • num_ctx ≤ loaded context → stripped from the distributed inference request
  • num_ctx > loaded context → searches loaded models across all distributed inference nodes for sufficient context
  • Configurable: FLEET_CONTEXT_PROTECTION=strip|warn|passthrough

Distributed Inference API

Distributed Inference Coordinator State

# distributed_inference_fleet_state — full distributed inference topology
curl -s http://localhost:11435/fleet/status | python3 -m json.tool

# distributed_inference_models — models across all distributed inference nodes
curl -s http://localhost:11435/api/tags | python3 -m json.tool

# distributed_inference_hot_models — models in GPU memory
curl -s http://localhost:11435/api/ps | python3 -m json.tool

Distributed Inference (OpenAI-compatible)

# distributed_inference_chat — route via distributed inference scoring
curl -s http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello via distributed inference"}]}'

Distributed Inference (Ollama-native)

curl -s http://localhost:11435/api/chat \
  -d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello via distributed inference"}]}'

Distributed Inference Model Fallback Chains

curl -s http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.3:70b","fallback_models":["qwen2.5:32b","qwen2.5:7b"],"messages":[{"role":"user","content":"Hello with distributed inference fallback"}]}'

Distributed Inference Trace Analysis

# distributed_inference_traces — recent routing decisions
curl -s "http://localhost:11435/dashboard/api/traces?limit=20" | python3 -m json.tool

# distributed_inference_score_breakdown
sqlite3 ~/.fleet-manager/latency.db "SELECT request_id, model, node_id, score, scores_breakdown FROM request_traces ORDER BY timestamp DESC LIMIT 1"

Distributed Inference Node Performance

sqlite3 ~/.fleet-manager/latency.db "SELECT node_id, model, COUNT(*) as n, ROUND(AVG(latency_ms)/1000.0, 1) as avg_s, ROUND(AVG(COALESCE(completion_tokens,0) * 1000.0 / NULLIF(latency_ms,0)), 1) as tok_per_s FROM request_traces WHERE status='completed' GROUP BY node_id, model HAVING n > 10 ORDER BY tok_per_s DESC"

Distributed Inference Health and Capacity

curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool
curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool
curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool

Distributed Inference Model Lifecycle

# distributed_inference_model_inventory
curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool

# Pull model to a distributed inference node
curl -s -X POST http://localhost:11435/dashboard/api/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.3:70b", "node_id": "mac-studio"}'

# Remove model from a distributed inference node
curl -s -X POST http://localhost:11435/dashboard/api/delete \
  -H "Content-Type: application/json" \
  -d '{"model": "old-model:7b", "node_id": "mac-studio"}'

Distributed Inference Fault Tolerance

MechanismDistributed Inference Behavior
Auto-retryIf a distributed inference node fails before the first chunk, re-score and retry on next-best node
Holding queueWhen all distributed inference nodes are saturated, requests queue for up to 30 seconds
Zombie reaperBackground task reclaims stuck distributed inference in-flight slots
VRAM fallbackRoutes to a loaded model in the same category rather than cold-loading
Auto-pullPulls missing models onto the distributed inference node with the most available memory
Graceful drainSIGTERM triggers drain: in-flight distributed inference requests finish, pending redistribute

Distributed Inference Data Model

All distributed inference state is in SQLite at ~/.fleet-manager/latency.db:

-- Distributed inference request traces (every routing decision)
SELECT * FROM request_traces LIMIT 1;

Structured distributed inference logs at ~/.fleet-manager/logs/herd.jsonl — daily rotation, 30-day retention.

Distributed Inference Dashboard

http://localhost:11435/dashboard — eight tabs covering distributed inference fleet overview, trends, model insights, per-app analytics, benchmarks, health checks, model recommendations, and settings.

Distributed Inference Constraints

  • Never restart distributed inference services or modify ~/.fleet-manager/ without explicit user confirmation.
  • Distributed inference model pull/delete operations require user confirmation (10-100+ GB transfers).
  • If the distributed inference coordinator is unreachable, suggest herd or uv run herd.
  • If no distributed inference nodes are online, suggest herd-node on target machines.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

Local Llm Router

Local LLM model router for Llama, Qwen, DeepSeek, Phi, Mistral, and Gemma across multiple devices. Self-hosted local LLM inference routing on macOS, Linux, a...

Registry SourceRecently Updated
2350Profile unavailable
General

Ollama Ollama Herd

Ollama Ollama Herd — multimodal Ollama model router that herds your Ollama LLMs into one smart Ollama endpoint. Route Ollama Llama, Qwen, DeepSeek, Phi, Mist...

Registry SourceRecently Updated
1512Profile unavailable
Coding

Gpu Cluster Manager

Turn your spare GPUs into one inference endpoint. Auto-discovers machines on your network, routes requests to the best available device, learns when your mac...

Registry SourceRecently Updated
2140Profile unavailable
General

Ollama Herd

Ollama multimodal model router for Llama, Qwen, DeepSeek, Phi, and Mistral — plus mflux image generation, speech-to-text, and embeddings. Self-hosted Ollama...

Registry SourceRecently Updated
2330Profile unavailable