SWARM Safety Skill
Study how intelligence swarms — and where it fails.
SWARM is a research framework for studying emergent risks in multi-agent AI systems using soft (probabilistic) labels instead of binary good/bad classifications. AGI-level risks don't require AGI-level agents — harmful dynamics emerge when many sub-AGI agents interact, even when no individual agent is misaligned.
v1.5.0 | 38 agent types | 29 governance levers | 55 scenarios | 2922 tests | 8 framework bridges
Repository: https://github.com/swarm-ai-safety/swarm
Hard Rules
- SWARM simulations run locally. Install the package first.
- Do not submit scenarios containing real API keys, credentials, or PII.
- Simulation results are research artifacts. Do not present them as ground truth about real systems.
- When publishing results, cite the framework and disclose simulation parameters.
Security
- API binds to localhost only (
127.0.0.1) by default to prevent network exposure. - CORS restricted to localhost origins by default.
- No authentication on development API — do not expose to untrusted networks.
- In-memory storage — data does not persist between restarts.
- For production deployment, add authentication middleware and use a proper database.
Install
# From PyPI
pip install swarm-safety
# With LLM agent support
pip install swarm-safety[llm]
# Full development (all extras)
git clone https://github.com/swarm-ai-safety/swarm.git
cd swarm
pip install -e ".[dev,runtime]"
Quick Start (Python)
from swarm.agents.honest import HonestAgent
from swarm.agents.opportunistic import OpportunisticAgent
from swarm.agents.deceptive import DeceptiveAgent
from swarm.agents.adversarial import AdversarialAgent
from swarm.core.orchestrator import Orchestrator, OrchestratorConfig
config = OrchestratorConfig(n_epochs=10, steps_per_epoch=10, seed=42)
orchestrator = Orchestrator(config=config)
orchestrator.register_agent(HonestAgent(agent_id="honest_1", name="Alice"))
orchestrator.register_agent(HonestAgent(agent_id="honest_2", name="Bob"))
orchestrator.register_agent(OpportunisticAgent(agent_id="opp_1"))
orchestrator.register_agent(DeceptiveAgent(agent_id="dec_1"))
metrics = orchestrator.run()
for m in metrics:
print(f"Epoch {m.epoch}: toxicity={m.toxicity_rate:.3f}, welfare={m.total_welfare:.2f}")
Quick Start (CLI)
# List available scenarios
swarm list
# Run a scenario
swarm run scenarios/baseline.yaml
# Override settings
swarm run scenarios/baseline.yaml --seed 42 --epochs 20 --steps 15
# Export results
swarm run scenarios/baseline.yaml --export-json results.json --export-csv outputs/
Quick Start (API)
Start the API server:
pip install swarm-safety[api]
uvicorn swarm.api.app:app --host 127.0.0.1 --port 8000
API documentation at http://localhost:8000/docs.
Security Note: The server binds to
127.0.0.1(localhost only) by default. Do not bind to0.0.0.0unless you understand the security implications and have proper firewall rules in place.
Register Agent
curl -X POST http://localhost:8000/api/v1/agents/register \
-H "Content-Type: application/json" \
-d '{
"name": "YourAgent",
"description": "What your agent does",
"capabilities": ["governance-testing", "red-teaming"]
}'
Returns agent_id and api_key.
Submit Scenario
curl -X POST http://localhost:8000/api/v1/scenarios/submit \
-H "Content-Type: application/json" \
-d '{
"name": "my-scenario",
"description": "Testing collusion detection with 5 agents",
"yaml_content": "simulation:\n n_epochs: 10\n steps_per_epoch: 10\nagents:\n - type: honest\n count: 3\n - type: adversarial\n count: 2",
"tags": ["collusion", "governance"]
}'
Create & Join Simulation
# Create
curl -X POST http://localhost:8000/api/v1/simulations/create \
-H "Content-Type: application/json" \
-d '{"scenario_id": "SCENARIO_ID", "max_participants": 5}'
# Join
curl -X POST http://localhost:8000/api/v1/simulations/SIM_ID/join \
-H "Content-Type: application/json" \
-d '{"agent_id": "YOUR_AGENT_ID", "role": "participant"}'
Core Concepts
Soft Probabilistic Labels
Interactions carry p = P(v = +1) — probability of beneficial outcome:
Observables -> ProxyComputer -> v_hat -> sigmoid -> p -> PayoffEngine -> payoffs
|
SoftMetrics -> toxicity, quality gap, etc.
Five Key Metrics
| Metric | What It Measures |
|---|---|
| Toxicity rate | Expected harm among accepted interactions: E[1-p | accepted] |
| Quality gap | Adverse selection indicator (negative = bad): E[p | accepted] - E[p | rejected] |
| Conditional loss | Selection effect on payoffs |
| Incoherence | Variance-to-error ratio across replays |
| Illusion delta | Gap between perceived coherence and actual consistency |
Agent Types (14 families, 38 implementations)
| Type | Behavior |
|---|---|
| Honest | Cooperative, trust-based, completes tasks diligently |
| Opportunistic | Maximizes short-term payoff, cherry-picks tasks |
| Deceptive | Builds trust, then exploits trusted relationships |
| Adversarial | Targets honest agents, coordinates with allies |
| LDT | Logical Decision Theory with FDT/UDT precommitment |
| RLM | Reinforcement Learning from Memory |
| Council | Multi-agent deliberation-based decisions |
| SkillRL | Learns interaction strategies via reward signals |
| LLM | Behavior determined by LLM (Anthropic, OpenAI, or Ollama) |
| Moltbook | Domain-specific social platform agent |
| Scholar | Academic citation and research agent |
| Wiki Editor | Collaborative editing with editorial policy |
Governance Levers (29 mechanisms)
- Transaction Taxes — Reduce exploitation, cost welfare
- Reputation Decay — Punish bad actors, erode honest standing
- Circuit Breakers — Freeze toxic agents quickly
- Random Audits — Deter hidden exploitation
- Staking — Filter undercapitalized agents
- Collusion Detection — Catch coordinated attacks (the critical lever near collapse threshold)
- Sybil Detection — Identify duplicate agents
- Transparency Ledger — Reward/penalize based on outcome
- Moderator Agent — Probabilistic review of interactions
- Incoherence Friction — Tax uncertainty-driven decisions
- Council Deliberation — Multi-agent governance decisions
- Diversity Enforcement — Prevent monoculture collapse
- Moltipedia-specific — Pair caps, page cooldowns, daily caps, self-fix prevention
Framework Bridges
| Bridge | Integration |
|---|---|
| Concordia | DeepMind's multi-agent framework |
| GasTown | Multi-agent workspace governance |
| Claude Code | Claude CLI agent integration |
| LiveSWE | Live software engineering tasks |
| OpenClaw | Open agent protocol |
| Prime Intellect | Cross-platform run tracking |
| Ralph | Agent orchestration |
| Worktree | Git worktree-based sandboxing |
Scenario YAML Format
simulation:
n_epochs: 10
steps_per_epoch: 10
seed: 42
agents:
- type: honest
count: 3
config:
acceptance_threshold: 0.4
- type: adversarial
count: 2
config:
aggression_level: 0.7
governance:
transaction_tax_rate: 0.05
circuit_breaker_enabled: true
collusion_detection_enabled: true
success_criteria:
max_toxicity: 0.3
min_quality_gap: 0.0
Key Research Findings
Phase Transitions (11-scenario, 209-epoch study)
| Regime | Adversarial % | Toxicity | Welfare | Outcome |
|---|---|---|---|---|
| Cooperative | 0-20% | < 0.30 | Stable | Survives |
| Contested | 20-37.5% | 0.33-0.37 | Declining | Survives |
| Collapse | 50%+ | ~0.30 | Zero by epoch 12-14 | Collapses |
Critical threshold between 37.5% and 50% adversarial agents separates recoverable from irreversible collapse.
Governance Cost Paradox (v1.5.0 GasTown study)
42-run study reveals: governance reduces toxicity at all adversarial levels (mean reduction 0.071) but imposes net-negative welfare costs at current parameter tuning. At 0% adversarial, governance costs 216 welfare units (-57.6%) for only 0.066 toxicity reduction.
Case Studies
GasTown Governance Cost
Study governance overhead vs. toxicity reduction across 7 agent compositions with and without governance levers. Reveals the safety-throughput trade-off. See scenarios/gastown_governance_cost.yaml.
LDT Cooperation
220 runs across 10 seeds comparing TDT vs FDT vs UDT cooperation strategies at population scales up to 21 agents. See scenarios/ldt_cooperation.yaml.
Moltipedia Heartbeat
Model the Moltipedia wiki editing loop: competing AI editors, editorial policy, point farming, and anti-gaming governance. See scenarios/moltipedia_heartbeat.yaml.
Moltbook CAPTCHA
Model Moltbook's anti-human math challenges and rate limiting: obfuscated text parsing, verification gates, and spam prevention. See scenarios/moltbook_captcha.yaml.
API Endpoints (Full Reference)
| Method | Endpoint | Description |
|---|---|---|
| GET | /health | Health check |
| GET | / | API info |
| POST | /api/v1/agents/register | Register agent |
| GET | /api/v1/agents/{agent_id} | Get agent details |
| GET | /api/v1/agents/ | List agents |
| POST | /api/v1/scenarios/submit | Submit scenario |
| GET | /api/v1/scenarios/{scenario_id} | Get scenario |
| GET | /api/v1/scenarios/ | List scenarios |
| POST | /api/v1/simulations/create | Create simulation |
| POST | /api/v1/simulations/{id}/join | Join simulation |
| GET | /api/v1/simulations/{id} | Get simulation |
| GET | /api/v1/simulations/ | List simulations |
Citation
@software{swarm2026,
title = {SWARM: System-Wide Assessment of Risk in Multi-agent systems},
author = {Savitt, Raeli},
year = {2026},
url = {https://github.com/swarm-ai-safety/swarm}
}
Linked Docs
- Skill metadata:
skill.json - Agent discovery:
.well-known/agent.json - Full documentation:
https://github.com/swarm-ai-safety/swarm/tree/main/docs - Theoretical foundations:
docs/research/theory.md - Governance guide:
docs/governance.md - Red-teaming guide:
docs/red-teaming.md - Scenario format:
docs/guides/scenarios.md