Context Optimization Techniques
Context optimization extends effective capacity through strategic compression, masking, caching, and partitioning. Effective optimization can double or triple effective context capacity.
When to Activate
-
Context limits constrain task complexity
-
Optimizing for cost reduction (fewer tokens = lower costs)
-
Reducing latency for long conversations
-
Building production systems at scale
Core Strategies
Compaction
Summarize context contents when approaching limits, reinitialize with summary.
Priority for compression:
-
Tool outputs → replace with summaries
-
Old turns → summarize early conversation
-
Retrieved docs → summarize if recent versions exist
-
Never compress system prompt
Summary preservation by type:
-
Tool outputs: Key findings, metrics, conclusions
-
Conversations: Key decisions, commitments, context shifts
-
Documents: Key facts and claims
Observation Masking
Tool outputs can comprise 80%+ of token usage. Replace verbose outputs with compact references once their purpose is served.
Masking Strategy:
Category Action
Never mask Current task observations, most recent turn, active reasoning
Consider masking 3+ turns ago, verbose outputs with extractable key points
Always mask Repeated outputs, boilerplate, already summarized
Example:
if len(observation) > max_length: ref_id = store_observation(observation) return f"[Obs:{ref_id} elided. Key: {extract_key(observation)}]"
KV-Cache Optimization
Reuse cached computations across requests with identical prefixes.
Cache-friendly ordering:
-
System prompt (stable, first)
-
Tool definitions (stable)
-
Frequently reused elements
-
Unique content (last)
Design tips:
-
Avoid dynamic content like timestamps
-
Use consistent formatting
-
Keep structure stable across sessions
Context Partitioning
Split work across sub-agents with isolated contexts. Each operates in clean context focused on its subtask.
Aggregation pattern:
-
Validate all partitions completed
-
Merge compatible results
-
Summarize if still too large
Budget Management
Design explicit token budgets:
-
System prompt: X tokens
-
Tool definitions: Y tokens
-
Retrieved docs: Z tokens
-
Message history: W tokens
-
Reserved buffer: 10-20%
Trigger optimization when:
-
Token utilization > 70%
-
Response quality degrades
-
Costs increase due to long contexts
Decision Framework
Dominant component Apply
Tool outputs Observation masking
Retrieved documents Summarization or partitioning
Message history Compaction with summarization
Multiple Combine strategies
Performance Targets
-
Compaction: 50-70% reduction, <5% quality degradation
-
Masking: 60-80% reduction in masked observations
-
Cache optimization: 70%+ hit rate for stable workloads
Guidelines
-
Measure before optimizing—know current state
-
Apply compaction before masking when possible
-
Design for cache stability with consistent prompts
-
Partition before context becomes problematic
-
Balance token savings against quality preservation