Agent Security Audit
Performs a security audit of an AI agent system. Applies patterns 18-21 from "Patterns for Building AI Agents" (Bhagwat & Gienow, 2025): preventing the lethal trifecta, sandboxing code execution, granular access control, and input/output guardrails.
When to use
Use this skill when the user needs to:
-
Audit an existing agent for security vulnerabilities
-
Design security controls for a new agent
-
Prevent prompt injection and data exfiltration
-
Set up sandboxing for code execution
-
Design access control and guardrails
Instructions
Step 1: Understand the Agent
Use the AskUserQuestion tool to gather context:
-
What does the agent do?
-
Does it access private/sensitive data? (user data, internal docs, credentials)
-
Does it process untrusted input? (public content, user uploads, external APIs)
-
Can it communicate externally? (send emails, create PRs, call APIs, write files)
-
Does it execute code? (run scripts, shell commands, code generation)
-
What authentication/authorization exists today?
Read any existing spec documents (.specs/<spec-name>/ ) before proceeding.
Step 2: Lethal Trifecta Analysis (Pattern 18)
The "lethal trifecta" (coined by Simon Willison) is the combination of:
-
Access to private data — agent can read sensitive information
-
Exposure to untrusted content — agent processes external/user-generated input
-
Exfiltration capability — agent can send data outside the system
When all three are present, prompt injection attacks become possible: malicious instructions hidden in external content trick the agent into accessing private data and sending it to an attacker.
Analyze the agent:
Lethal Trifecta Analysis
Leg 1: Private Data Access
- Reads user PII
- Accesses internal documents
- Has database read access
- Can read credentials/secrets
- Accesses private repositories Risk level: [None / Low / Medium / High]
Leg 2: Untrusted Content Exposure
- Processes user-generated content
- Reads public web pages
- Parses uploaded files
- Ingests third-party API responses
- Reads public issues/tickets/comments Risk level: [None / Low / Medium / High]
Leg 3: Exfiltration Capability
- Can send emails
- Can create PRs/issues
- Can call external APIs
- Can write to public endpoints
- Can modify shared state Risk level: [None / Low / Medium / High]
Trifecta Status: [SAFE / AT RISK / VULNERABLE]
If all three legs are present: The agent is VULNERABLE. Recommend removing at least one leg:
-
Easiest: remove exfiltration — constrain agent actions after processing untrusted input
-
Alternative: isolate data access — use separate agents for private data vs. untrusted content
-
Alternative: sanitize input — add middleware to intercept and clean untrusted content before it reaches the LLM
Use AskUserQuestion to recommend and confirm the mitigation approach.
Step 3: Sandbox Assessment (Pattern 19)
If the agent executes code, audit the sandbox:
Code Execution Sandbox
Current State
- Code runs in isolated container
- Network access restricted
- File system access restricted
- Resource limits set (CPU, memory, time)
- No access to production credentials
- No access to host file system
Threats
| Threat | Risk | Mitigation |
|---|---|---|
| Secret exfiltration | [Risk] | [Mitigation] |
| Environment deletion | [Risk] | [Mitigation] |
| Resource abuse (crypto mining) | [Risk] | [Mitigation] |
| Accidental resource hogging | [Risk] | [Mitigation] |
Recommendations
- Runtime: [Docker / E2B / Daytona / other]
- Note: Docker has 10-20s cold starts; consider agentic runtimes for sub-second startup
- Resource limits: CPU: [X], Memory: [X], Timeout: [X]
- Network policy: [Allow-list specific endpoints / Block all / etc.]
If the agent does NOT execute code, note this and skip to Step 4.
Step 4: Access Control Review (Pattern 20)
Agents need MORE granular access control than humans because they are:
-
Infinitely diligent — security by obscurity doesn't work
-
Ephemeral — sessions are short-lived, credentials need scoping
-
Unpredictable — LLM behavior is nondeterministic
Access Control Review
Authentication
- Agent has its own identity (not using a shared service account)
- OAuth flow implemented for user-delegated access
- Credentials are scoped to specific operations
- Credentials are short-lived / rotated
Authorization
| Tool/Action | Current Access | Recommended Access | Justification |
|---|---|---|---|
| [Database read] | [Full access] | [Read-only, filtered by user] | [Least privilege] |
| [API call X] | [Admin] | [Scoped to operation] | [Least privilege] |
| [File write] | [Unrestricted] | [Specific directory only] | [Blast radius reduction] |
Permission Modes
- Planning mode — agent has reduced permissions during reasoning
- Restrict: UPDATE, DELETE, external API calls
- Allow: SELECT, read-only operations
- Execution mode — elevated permissions only for confirmed actions
- Requires: explicit user approval or automated policy check
Just-in-Time Access
- Credentials granted per-task, not per-session
- Access scoped to specific user context
- Unused permissions revoked after task completion
Step 5: Guardrails Design (Pattern 21)
Design input and output guardrails — live, low-latency checks that prevent harm in real-time.
Guardrails
Input Guardrails
Intercept incoming inputs BEFORE they reach the LLM.
| Guard | Description | Action on Trigger |
|---|---|---|
| Prompt Injection | Detect attempts to override system instructions | Block + return default message |
| Jailbreak Detection | Detect attempts to bypass safety constraints | Block + log + alert |
| PII Detection | Detect sensitive personal information in input | Redact or block |
| Off-Topic | Detect requests outside agent's domain | Redirect to appropriate handler |
| On-Brand | Ensure input aligns with acceptable use | Block inappropriate content |
Output Guardrails
Screen generated output BEFORE it reaches the user or tools.
| Guard | Description | Action on Trigger |
|---|---|---|
| Data Leakage | Detect private data in output | Redact + log |
| Hallucination Check | Verify factual claims against source data | Flag for review |
| Toxicity | Detect harmful, biased, or inappropriate content | Block + regenerate |
| Format Validation | Ensure output matches expected schema | Retry with format instructions |
| Action Validation | Verify tool calls are within authorized scope | Block unauthorized actions |
Implementation Notes
- Guardrails must be LOW LATENCY — they run on every request
- Use specialized lightweight models or rule-based systems for speed
- Log all guardrail triggers for monitoring and tuning
- Guardrails complement evals — evals are after-the-fact, guardrails are real-time
Use AskUserQuestion to prioritize which guardrails to implement first based on the agent's risk profile.
Step 6: Generate Security Report
Compile all outputs into .specs/<spec-name>/agent-security.md :
Agent Security Audit: [System Name]
Executive Summary
Overall Risk: [Low / Medium / High / Critical] Lethal Trifecta: [SAFE / AT RISK / VULNERABLE] Immediate Actions Required: [Count]
Lethal Trifecta Analysis
[From Step 2]
Sandbox Assessment
[From Step 3]
Access Control
[From Step 4]
Guardrails
[From Step 5]
Priority Actions
| # | Action | Severity | Effort |
|---|---|---|---|
| 1 | [Action] | Critical | [Low/Med/High] |
| 2 | [Action] | High | [Low/Med/High] |
Step 7: Offer Next Steps
Use AskUserQuestion to offer:
-
Implement top-priority fix — start with the highest-severity action item
-
Full review — run agent:review to validate against all 22 patterns
-
Re-audit — run agent:secure again after implementing fixes
Arguments
-
<args>
-
Optional spec name or path to agent code
-
<spec-name> — reads existing agent design from .specs/<spec-name>/
-
<path> — analyzes agent code at the given path
Examples:
-
agent:secure customer-support — audit the customer-support agent
-
agent:secure src/agents/ — audit agent code in the given directory