Context Engineering
Managing context windows for optimal agent performance
Context Engineering
Context engineering is the art of finding the smallest possible set of high-signal tokens that maximize the likelihood of desired outcomes.
The context window is not unlimited. As more tokens accumulate, the model's ability to recall and reason weakens. Context engineering is about being intentional about what goes in.
The Problem
Context window: 200K tokens
Your budget: ~200K tokens
Naive approach:
System prompt: 2K tokens
Conversation: 50K tokens (growing)
Tool results: 100K tokens (accumulated)
Memory: 30K tokens (all memories loaded)
─────────────────────────
Total: 182K tokens → model struggles to focusAfter ~100K tokens, most models show degraded recall. The best agents actively manage context, not just fill it.
Strategy 1: Sub-Agent Isolation
Instead of one agent accumulating all tool results, delegate to sub-agents that explore independently and return summaries:
const researcher = new Agent({
name: 'researcher',
model: claude('claude-sonnet-4-6'),
instructions: `Research the given topic thoroughly.
Return a concise summary of your findings (max 2000 words).
Include key facts, sources, and your assessment.`,
tools: [webSearch, readUrl],
maxTurns: 15, // Can do extensive research
})
const mainAgent = new Agent({
name: 'main',
model: claude('claude-sonnet-4-6'),
instructions: 'Help users with their questions. Delegate research to the research tool.',
tools: [
researcher.asTool({
name: 'research',
description: 'Research a topic and return a summary',
}),
],
})The researcher uses 50K tokens exploring, but returns only a 2K token summary. The main agent's context stays clean.
Strategy 2: Context Compaction
Automatically summarize and compress older messages when context grows large:
const agent = new Agent({
name: 'long-conversation',
model: claude('claude-sonnet-4-6'),
instructions: 'You are a helpful assistant.',
context: {
compaction: {
enabled: true,
/** Trigger compaction when context exceeds this percentage of the model's window */
threshold: 0.7, // 70% full
/** Strategy for compaction */
strategy: 'summarize', // or 'truncate' or 'selective'
/** What to preserve during compaction */
preserve: ['system', 'last_5_turns', 'tool_definitions'],
},
},
})Compaction Strategies
| Strategy | Description | Best For |
|---|---|---|
summarize | Use an LLM to summarize older messages | Long conversations |
truncate | Drop oldest messages | Simple chat |
selective | Keep messages matching criteria, summarize rest | Task-oriented |
Custom Compaction
const agent = new Agent({
name: 'custom-compaction',
model: claude('claude-sonnet-4-6'),
context: {
compaction: {
enabled: true,
threshold: 0.7,
strategy: 'custom',
compact: async (messages, context) => {
// Your custom logic
const systemMessages = messages.filter(m => m.role === 'system')
const recentMessages = messages.slice(-10)
const olderMessages = messages.slice(0, -10).filter(m => m.role !== 'system')
// Summarize older messages
const summary = await Runner.run(
new Agent({
name: 'summarizer',
model: claude('claude-haiku-4-5'),
instructions: 'Summarize the key points of this conversation.',
}),
{ messages: olderMessages },
)
return [
...systemMessages,
{ role: 'system', content: `Previous conversation summary:\n${summary.output}` },
...recentMessages,
]
},
},
},
})Strategy 3: Just-in-Time Retrieval
Don't pre-load everything. Let the agent pull in information as needed:
// Bad: load all docs into context upfront
const agent = new Agent({
instructions: `Here are all our docs:\n${allDocs}`, // 50K tokens wasted
tools: [],
})
// Good: let the agent search and read on demand
const agent = new Agent({
instructions: 'Use the search and read tools to find relevant documentation.',
tools: [
Tool.create({
name: 'search_docs',
description: 'Search documentation by keyword',
parameters: z.object({ query: z.string() }),
execute: async ({ query }) => searchIndex.search(query, { limit: 5 }),
}),
Tool.create({
name: 'read_doc',
description: 'Read a specific documentation page',
parameters: z.object({ path: z.string() }),
execute: async ({ path }) => docs.get(path),
}),
],
})Strategy 4: Smart Memory Recall
Don't load all memories. Search for relevant ones:
const agent = new Agent({
name: 'assistant',
model: claude('claude-sonnet-4-6'),
memory: Memory.persistent({
store: Memory.stores.postgres({ connectionString }),
namespace: 'user-123',
auto: {
recall: true,
maxRecall: 10, // Only load top 10 relevant memories
// Memories are ranked by relevance to the current query
},
}),
})Strategy 5: Caching
Cache hit rate is the most important metric for production agents:
const agent = new Agent({
name: 'cached',
model: claude('claude-sonnet-4-6'),
context: {
caching: {
enabled: true,
/** Cache the system prompt (changes rarely) */
cacheSystemPrompt: true,
/** Cache tool definitions (changes rarely) */
cacheToolDefinitions: true,
/** Cache the first N messages (for multi-turn conversations) */
cacheMessagePrefix: true,
},
},
})With prompt caching, repeated context (system prompt, tool definitions) is charged at a fraction of the normal input token rate.
Measuring Context Health
const result = await Runner.run(agent, { messages })
console.log(`Context usage: ${result.usage.contextTokens} / ${result.usage.contextWindow}`)
console.log(`Utilization: ${(result.usage.contextTokens / result.usage.contextWindow * 100).toFixed(1)}%`)
console.log(`Compactions: ${result.usage.compactions}`)
console.log(`Cache hit rate: ${(result.usage.cacheHitRate * 100).toFixed(1)}%`)Anti-Patterns
Don't: Stuff everything into instructions
// Bad
const agent = new Agent({
instructions: `${systemPrompt}\n${allFAQs}\n${allPolicies}\n${allExamples}`, // 30K tokens
})
// Good
const agent = new Agent({
instructions: systemPrompt, // 500 tokens
tools: [searchFAQs, searchPolicies, searchExamples], // Agent retrieves as needed
})Don't: Accumulate tool results indefinitely
// Bad: every tool result stays in context forever
// After 20 tool calls, context is full of old results
// Good: use sub-agents for exploration-heavy tasks
// Sub-agent accumulates results → returns summary → parent context stays cleanDon't: Load all memories
// Bad
memory: Memory.persistent({ auto: { recall: true, maxRecall: 100 } }) // 100 memories = noise
// Good
memory: Memory.persistent({ auto: { recall: true, maxRecall: 10 } }) // Top 10 = focusedBest Practices
-
Monitor context utilization — Track how full the context window gets during typical runs. If it regularly exceeds 50%, you need to optimize.
-
Use sub-agents for exploration — Any task that involves searching, reading, or exploring should be delegated to a sub-agent.
-
Enable caching — If your system prompt and tools don't change between runs, cache them. This can reduce costs by 90% on the cached portion.
-
Compact early — Don't wait until the context is 95% full. Start compacting at 70% to leave room for the model to work.
-
Measure, then optimize — Don't guess at context usage. Measure it per run, identify the biggest consumers, and optimize those first.