Context Optimization Techniques
Extend effective context capacity through compression, masking, caching, and partitioning. Effective optimization can 2-3x effective context capacity without larger models.
Optimization Strategies
Strategy Token Reduction Use Case
Compaction 50-70% Message history dominates
Observation Masking 60-80% Tool outputs dominate
KV-Cache Optimization 70%+ cache hits Stable workloads
Context Partitioning Variable Complex multi-task
Compaction
Summarize context when approaching limits:
if context_tokens / context_limit > 0.8: context = compact_context(context)
Priority for compression:
-
Tool outputs → replace with summaries
-
Old turns → summarize early conversation
-
Retrieved docs → summarize if recent versions exist
-
Never compress system prompt
Summary generation by type:
-
Tool outputs: Preserve findings, metrics, conclusions
-
Conversational: Preserve decisions, commitments, context shifts
-
Documents: Preserve key facts, remove supporting evidence
Observation Masking
Tool outputs can be 80%+ of tokens. Replace verbose outputs with references:
if len(observation) > max_length: ref_id = store_observation(observation) return f"[Obs:{ref_id} elided. Key: {extract_key(observation)}]"
Masking rules:
-
Never mask: Current task critical, most recent turn, active reasoning
-
Consider: 3+ turns old, key points extractable, purpose served
-
Always mask: Repeated outputs, boilerplate, already summarized
KV-Cache Optimization
Cache Key/Value tensors for requests with identical prefixes:
Cache-friendly ordering: stable content first
context = [ system_prompt, # Cacheable tool_definitions, # Cacheable reused_templates, # Reusable unique_content # Unique ]
Design for cache stability:
-
Avoid dynamic content (timestamps)
-
Use consistent formatting
-
Keep structure stable across sessions
Context Partitioning
Split work across sub-agents with isolated contexts:
Each sub-agent has clean, focused context
results = await gather( research_agent.search("topic A"), research_agent.search("topic B"), research_agent.search("topic C") )
Coordinator synthesizes without carrying full context
synthesized = await coordinator.synthesize(results)
Budget Management
context_budget = { "system_prompt": 2000, "tool_definitions": 3000, "retrieved_docs": 10000, "message_history": 15000, "reserved_buffer": 2000 }
Monitor and trigger optimization at 70-80%
When to Optimize
Signal Action
Utilization >70% Start monitoring
Utilization >80% Apply compaction
Quality degradation Investigate cause
Tool outputs dominate Observation masking
Docs dominate Summarization/partitioning
Performance Targets
-
Compaction: 50-70% reduction, <5% quality loss
-
Masking: 60-80% reduction in masked observations
-
Cache: 70%+ hit rate for stable workloads
Best Practices
-
Measure before optimizing
-
Apply compaction before masking
-
Design for cache stability
-
Partition before context becomes problematic
-
Monitor effectiveness over time
-
Balance token savings vs quality
-
Test at production scale
-
Implement graceful degradation