AI Engineering
Overview
Build effective agentic systems using proven patterns. Start simple, add complexity only when needed.
For specialized prompt design guidance (techniques, patterns, examples for agentic systems), see the prompt-engineering skill.
Core Principle
Find the simplest solution first. Agentic systems trade latency and cost for better task performance. Only increase complexity when simpler solutions fall short.
- Start with optimized single LLM calls (retrieval, in-context examples)
- Add workflows for predictable, multi-step tasks
- Use agents when flexibility and autonomous decision-making are required
When to Build an Agent
Before committing to an agent, validate that your use case truly requires agentic capabilities. Consider alternatives first—deterministic solutions are simpler, faster, and more reliable.
Use agents when workflows involve:
| Criteria | Description | Example |
|---|---|---|
| Complex decision-making | Nuanced judgment, exceptions, context-sensitive decisions | Refund approval with edge cases |
| Brittle rule systems | Rulesets that are unwieldy, costly to maintain, or error-prone | Vendor security reviews |
| Unstructured data | Interpreting natural language, documents, or conversational input | Processing insurance claims |
If your use case doesn't clearly fit these criteria, a deterministic or simple LLM solution may suffice.
Agentic System Taxonomy
Understanding the spectrum of agentic capabilities helps you choose the right level of complexity for your use case.
| Level | Name | Description | Use Case |
|---|---|---|---|
| Level 0 | Core Reasoning System | LM operates in isolation, responding based on pre-trained knowledge only | Explaining concepts, general knowledge |
| Level 1 | Connected Problem-Solver | LM connects to external tools to retrieve real-time information and take actions | Answering "What's the score?", querying databases |
| Level 2 | Strategic Problem-Solver | Agent actively curates context, plans multi-step tasks, and engineers focused queries for each step | "Find coffee shops halfway between two locations" |
| Level 3 | Collaborative Multi-Agent System | Multiple specialized agents coordinate under a central manager or through peer handoffs | Product launch with research, marketing, and web dev agents |
| Level 4 | Self-Evolving System | Agents can dynamically create new tools or agents to fill capability gaps | Agent creates sentiment analysis agent when needed |
Progression guidance: Start at Level 0 or 1. Only increase levels when the current level cannot handle your use case effectively.
Prompt Engineering
Effective prompts are critical to agentic system performance. When designing or refining prompts for LLM calls, workflows, or agents, leverage the prompt-engineering skill if available. It provides specialized guidance for crafting prompts that produce reliable, high-quality outputs.
Context Engineering
Context engineering is the practice of dynamically assembling and managing information within an LLM's context window to enable stateful, intelligent agents. It represents an evolution from prompt engineering—while prompts focus on static instructions, context engineering addresses the entire payload dynamically.
Key principles:
- Curate attention: Prevent context overload by including only relevant information for each step
- Dynamic filtering: Transform previous outputs into focused queries for the next step
- Progressive refinement: Each step should produce a distilled, actionable input for the next
Example: Instead of passing an entire document to summarize, extract key entities first, then retrieve only relevant context about those entities.
For comprehensive guidance on sessions, memory, and context management, see references/context-engineering.md.
Agentic Problem-Solving Process
All autonomous agents operate on a continuous cyclical process. Understanding this loop is fundamental to building effective agents.
The 5-Step Loop:
- Get the Mission - Receive a high-level goal from user or automated trigger
- Scan the Scene - Gather context from available resources: instructions, session history, available tools, long-term memory
- Think It Through - Analyze mission against scene, devise a plan using chain-of-reasoning
- Take Action - Execute the first concrete step by invoking a tool or generating response
- Observe and Iterate - Observe the outcome, add to context/memory, loop back to step 3
This "Think, Act, Observe" cycle continues until the mission is complete or an exit condition is reached.
Code example (Think, Act, Observe with tools):
import anthropic
client = anthropic.Anthropic()
def agent_loop(mission: str, max_iterations: int = 10):
"""Run the Think-Act-Observe loop until mission complete."""
context = f"Mission: {mission}\nAvailable tools: search, read_page, finish"
for i in range(max_iterations):
# THINK: LLM analyzes current state and plans next action
response = client.messages.create(
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": context}],
tools=[search_tool, read_page_tool, finish_tool]
)
# Extract the model's reasoning and intended action
for block in response.content:
if block.type == "text":
print(f"Thought: {block.text}")
elif block.type == "tool_use" and block.name == "search":
# ACT: Execute the tool
result = search(block.input["query"])
# OBSERVE: Add result to context, loop continues
context += f"\nObservation: {result}"
elif block.type == "tool_use" and block.name == "finish":
# EXIT: Mission complete
return block.input["summary"]
return "Max iterations reached"
Pattern Selection Guide
| Pattern | Use When | Key Benefit |
|---|---|---|
| Augmented LLM | Single task needing external data/tools | Retrieval, tools, memory |
| Prompt Chaining | Task decomposes into fixed subtasks | Trade latency for accuracy |
| Routing | Distinct categories need separate handling | Separation of concerns |
| Parallelization | Subtasks are independent OR multiple attempts needed | Speed OR confidence |
| Orchestrator-Workers | Subtasks unpredictable, input-dependent | Dynamic task breakdown |
| Evaluator-Optimizer | Clear evaluation criteria, iteration adds value | Iterative refinement |
| Autonomous Agent | Open-ended problems, unpredictable steps | Flexibility at scale |
Decision Framework
Is the task solvable with a single well-crafted prompt?
├─ Yes → Optimize with retrieval/examples → Done
└─ No → Are subtasks fixed and predictable?
├─ Yes → Use Workflow (chaining/routing/parallelization)
└─ No → Are subtasks input-dependent?
├─ Yes → Use Orchestrator-Workers
└─ No → Is the problem open-ended with unpredictable steps?
├─ Yes → Use Autonomous Agent
└─ No → Reconsider approach
Workflow Patterns
For detailed workflow implementations with code examples, see references/workflows.md.
When to use workflows: Tasks with predictable, multi-step steps where subtasks are fixed or input-dependent.
Quick reference:
- Prompt Chaining - Sequential LLM calls, each processing previous output
- Routing - Classify input and direct to specialized handler
- Parallelization - Sectioning (independent subtasks) or Voting (multiple attempts)
- Orchestrator-Workers - Central LLM breaks down tasks, delegates to workers, synthesizes results
- Evaluator-Optimizer - One LLM generates, another evaluates and provides feedback in a loop
Code example (Orchestrator-Workers):
# Orchestrator breaks down task
subtasks = llm(f"Break down: {task}")
# Workers execute in parallel
results = [execute(s) for s in subtasks]
# Orchestrator synthesizes
final = llm(f"Synthesize results: {results}")
Code example (Prompt Chaining - complete):
import anthropic
client = anthropic.Anthropic()
def analyze_document(text: str) -> str:
"""Complete prompt chaining: extract → summarize → recommend."""
# STEP 1: Extract key entities
step1 = client.messages.create(
model="claude-sonnet-4-6",
messages=[{
"role": "user",
"content": f"Extract all entities (people, orgs, dates) from:\n{text}"
}]
)
entities = step1.content[0].text
# STEP 2: Summarize using extracted entities
step2 = client.messages.create(
model="claude-sonnet-4-6",
messages=[{
"role": "user",
"content": f"Summarize this document using these entities: {entities}\n\nDocument: {text}"
}]
)
summary = step2.content[0].text
# STEP 3: Generate recommendations based on summary
step3 = client.messages.create(
model="claude-sonnet-4-6",
messages=[{
"role": "user",
"content": f"Based on this summary, provide 3 actionable recommendations:\n{summary}"
}]
)
return step3.content[0].text
Error Handling & Guardrails
Guardrails are a layered defense. No single layer is sufficient—combine multiple specialized checks for resilient agents.
Layered Defense Pattern:
Input → Relevance Check → Safety Filter → Agent → Tool Safeguards → Output Validation → Response
↓block ↓block ↓risk-rating ↓block
For a complete implementation with code examples and tests, see references/agent-design.md.
Agent Design
For comprehensive agent design patterns, characteristics, and best practices, see references/agent-design.md.
Core agent characteristics:
- Explicit Role & Responsibility - Clearly defined mandate
- Single-Purpose Focus - Narrow scope, high performance
- Minimal, Purpose-Built Tooling - Only necessary tools
- Deterministic Orchestration - Clear execution structure
- Cooperation & Delegation - Structured interaction
- Self-Constraint & Guardrails - Prevents scope creep
- State Awareness - Session memory for tasks
- Long-Term Memory - Curated, retrievable knowledge
- Observability - Inspectable decisions and outcomes
- Failure Awareness - Graceful recovery
Key topics:
- Autonomous Agents and the Run Loop - The "Think, Act, Observe" cycle with exit conditions
- Guardrails - Layered defense: relevance classifiers, safety filters, PII filters, tool safeguards
- Multi-Agent Patterns - Manager (agents as tools), Decentralized (handoffs), Sequential, Iterative Refinement
- Real-World Examples - Customer support agents, coding agents with test verification
Agent-Computer Interface (ACI)
Tool design matters as much as prompt engineering. For comprehensive tool design patterns, see references/aci.md.
Core principles:
- Give tokens to think - Don't force the model into corners
- Keep formats natural - Match patterns from training data
- Minimize overhead - Avoid line counting, escape sequences
- Publish tasks, not APIs - Tools should encapsulate user-facing actions
Key patterns:
- Tool Types - Information Retrieval, Action/Execution, System/API Integration, Human-in-the-Loop
- Output Design - Return references for large data, descriptive error messages for recovery
- Input Validation - Schema validation for runtime checks and LLM guidance
- Documentation - Clear descriptions, examples, edge cases, parameter constraints
Model Context Protocol (MCP)
MCP is an open standard for connecting AI applications to external tools and data sources. For comprehensive coverage, see references/mcp.md.
What it solves: The "N×M integration problem" - without a standard, every model-tool pairing requires custom connectors.
Core architecture:
- Host - Manages UX, orchestrates tools, enforces security
- Client - Maintains server connections, manages sessions
- Server - Advertises tools, executes commands, handles governance
Key capabilities:
- Tools - Standardized function definitions with JSON Schema
- Resources - Static data access (validate trusted sources only)
- Prompts - Reusable prompt templates (use rarely - security risk)
- Sampling - Server can request LLM completion from client
- Elicitation - Server can request user input via client UI
When to use MCP:
- Multi-environment deployments
- Sharing tools across applications
- Dynamic tool discovery needs
- Ecosystem participation
Security considerations:
- Dynamic Capability Injection, Tool Shadowing, Confused Deputy
- Requires multi-layered defense: HIL → API Gateway → SDK Allowlists → Schema Validation
Implementation Guidance
For practical implementation guidance including model selection, task decomposition, and debugging, see references/implementation.md.
Quick start:
# Single call with retrieval
response = claude.messages.create(
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": query}],
tools=[search_tool, database_tool]
)
Key topics:
- Start Simple - Optimize single calls first, add complexity only when needed
- Framework Considerations - Claude Agent SDK, Agno, CrewAI, LangChain (or direct APIs)
- Model Selection - Prototype with best, optimize cost/latency with smaller models
- Task Decomposition - Break down until each step is automatable or human-gated
- Performance & Scalability - Context window management, dynamic tool loading, state management
- Debugging - Common issues: tool usage, loops, edge cases, compounding errors
Operations & Security
For production operations, security, and agent learning patterns, see references/operations.md.
Agent Ops (GenAIOps):
- Evaluation Strategy - Define success metrics first, use LM as Judge, metrics-driven development
- Observability - OpenTelemetry traces for full trajectory: prompts, reasoning, tool calls, observations
- Human Feedback Loop - Collect failures, convert to test cases, "close the loop" on error classes
Agent Identity & Security:
- Agent as Principal - Distinct from users and service accounts, requires verifiable identity with least privilege
- Security Layers - Deterministic guardrails (rules) + Reasoning-based defenses (guard models)
- Tool Security Threats - Dynamic Capability Injection, Tool Shadowing, Confused Deputy, Malicious Definitions
Multi-Layered Defense:
Human-in-the-Loop → API Gateway → SDK Allowlists → Schema Validation → Secure Design
Quality & Evaluation
For comprehensive agent quality frameworks, evaluation strategies, and observability practices, see references/quality-evaluation.md.
Four Pillars of Agent Quality:
- Effectiveness - Goal completion, accuracy, instruction following
- Efficiency - Latency, cost per interaction, token usage
- Robustness - Edge case handling, error recovery, consistency
- Safety - Guardrails, content filtering, policy compliance
Evaluation Hierarchy:
- End-to-End (Black Box) - Measure final outputs against golden dataset
- Trajectory (Glass Box) - Inspect intermediate steps, tool calls, reasoning
Evaluators:
- Automated Metrics - Exact match, similarity scores, rule-based checks
- LLM-as-a-Judge - Use powerful model to assess against rubric
- Agent-as-a-Judge - Specialized evaluator agent critiques outputs
- Human-in-the-Loop - Authoritative feedback for edge cases
Resources
- Workflows Reference - Detailed workflow patterns with code examples
- Context Engineering - Sessions, memory, and context management
- Agent Design - Agent characteristics, ACI, guardrails, multi-agent patterns
- Implementation Guide - Practical implementation guidance and debugging
- Operations & Security - Production operations, security, and agent learning
- Quality & Evaluation - Agent quality frameworks, evaluation strategies, observability
- ACI Guide - Agent-Computer Interface deep dive with tool design patterns
- MCP Guide - Model Context Protocol for tool interoperability
- Examples - Real-world implementations and case studies