Agent SDK

Context Engineering

Managing context windows for optimal agent performance

Context Engineering

Context engineering is the art of finding the smallest possible set of high-signal tokens that maximize the likelihood of desired outcomes.

The context window is not unlimited. As more tokens accumulate, the model's ability to recall and reason weakens. Context engineering is about being intentional about what goes in.

The Problem

Context window: 200K tokens
Your budget:    ~200K tokens

Naive approach:
  System prompt:    2K tokens
  Conversation:     50K tokens (growing)
  Tool results:     100K tokens (accumulated)
  Memory:           30K tokens (all memories loaded)
  ─────────────────────────
  Total:            182K tokens → model struggles to focus

After ~100K tokens, most models show degraded recall. The best agents actively manage context, not just fill it.

Strategy 1: Sub-Agent Isolation

Instead of one agent accumulating all tool results, delegate to sub-agents that explore independently and return summaries:

const researcher = new Agent({
  name: 'researcher',
  model: claude('claude-sonnet-4-6'),
  instructions: `Research the given topic thoroughly.
Return a concise summary of your findings (max 2000 words).
Include key facts, sources, and your assessment.`,
  tools: [webSearch, readUrl],
  maxTurns: 15, // Can do extensive research
})

const mainAgent = new Agent({
  name: 'main',
  model: claude('claude-sonnet-4-6'),
  instructions: 'Help users with their questions. Delegate research to the research tool.',
  tools: [
    researcher.asTool({
      name: 'research',
      description: 'Research a topic and return a summary',
    }),
  ],
})

The researcher uses 50K tokens exploring, but returns only a 2K token summary. The main agent's context stays clean.

Strategy 2: Context Compaction

Automatically summarize and compress older messages when context grows large:

const agent = new Agent({
  name: 'long-conversation',
  model: claude('claude-sonnet-4-6'),
  instructions: 'You are a helpful assistant.',
  context: {
    compaction: {
      enabled: true,
      /** Trigger compaction when context exceeds this percentage of the model's window */
      threshold: 0.7, // 70% full
      /** Strategy for compaction */
      strategy: 'summarize', // or 'truncate' or 'selective'
      /** What to preserve during compaction */
      preserve: ['system', 'last_5_turns', 'tool_definitions'],
    },
  },
})

Compaction Strategies

StrategyDescriptionBest For
summarizeUse an LLM to summarize older messagesLong conversations
truncateDrop oldest messagesSimple chat
selectiveKeep messages matching criteria, summarize restTask-oriented

Custom Compaction

const agent = new Agent({
  name: 'custom-compaction',
  model: claude('claude-sonnet-4-6'),
  context: {
    compaction: {
      enabled: true,
      threshold: 0.7,
      strategy: 'custom',
      compact: async (messages, context) => {
        // Your custom logic
        const systemMessages = messages.filter(m => m.role === 'system')
        const recentMessages = messages.slice(-10)
        const olderMessages = messages.slice(0, -10).filter(m => m.role !== 'system')

        // Summarize older messages
        const summary = await Runner.run(
          new Agent({
            name: 'summarizer',
            model: claude('claude-haiku-4-5'),
            instructions: 'Summarize the key points of this conversation.',
          }),
          { messages: olderMessages },
        )

        return [
          ...systemMessages,
          { role: 'system', content: `Previous conversation summary:\n${summary.output}` },
          ...recentMessages,
        ]
      },
    },
  },
})

Strategy 3: Just-in-Time Retrieval

Don't pre-load everything. Let the agent pull in information as needed:

// Bad: load all docs into context upfront
const agent = new Agent({
  instructions: `Here are all our docs:\n${allDocs}`, // 50K tokens wasted
  tools: [],
})

// Good: let the agent search and read on demand
const agent = new Agent({
  instructions: 'Use the search and read tools to find relevant documentation.',
  tools: [
    Tool.create({
      name: 'search_docs',
      description: 'Search documentation by keyword',
      parameters: z.object({ query: z.string() }),
      execute: async ({ query }) => searchIndex.search(query, { limit: 5 }),
    }),
    Tool.create({
      name: 'read_doc',
      description: 'Read a specific documentation page',
      parameters: z.object({ path: z.string() }),
      execute: async ({ path }) => docs.get(path),
    }),
  ],
})

Strategy 4: Smart Memory Recall

Don't load all memories. Search for relevant ones:

const agent = new Agent({
  name: 'assistant',
  model: claude('claude-sonnet-4-6'),
  memory: Memory.persistent({
    store: Memory.stores.postgres({ connectionString }),
    namespace: 'user-123',
    auto: {
      recall: true,
      maxRecall: 10,  // Only load top 10 relevant memories
      // Memories are ranked by relevance to the current query
    },
  }),
})

Strategy 5: Caching

Cache hit rate is the most important metric for production agents:

const agent = new Agent({
  name: 'cached',
  model: claude('claude-sonnet-4-6'),
  context: {
    caching: {
      enabled: true,
      /** Cache the system prompt (changes rarely) */
      cacheSystemPrompt: true,
      /** Cache tool definitions (changes rarely) */
      cacheToolDefinitions: true,
      /** Cache the first N messages (for multi-turn conversations) */
      cacheMessagePrefix: true,
    },
  },
})

With prompt caching, repeated context (system prompt, tool definitions) is charged at a fraction of the normal input token rate.

Measuring Context Health

const result = await Runner.run(agent, { messages })

console.log(`Context usage: ${result.usage.contextTokens} / ${result.usage.contextWindow}`)
console.log(`Utilization: ${(result.usage.contextTokens / result.usage.contextWindow * 100).toFixed(1)}%`)
console.log(`Compactions: ${result.usage.compactions}`)
console.log(`Cache hit rate: ${(result.usage.cacheHitRate * 100).toFixed(1)}%`)

Anti-Patterns

Don't: Stuff everything into instructions

// Bad
const agent = new Agent({
  instructions: `${systemPrompt}\n${allFAQs}\n${allPolicies}\n${allExamples}`, // 30K tokens
})

// Good
const agent = new Agent({
  instructions: systemPrompt, // 500 tokens
  tools: [searchFAQs, searchPolicies, searchExamples], // Agent retrieves as needed
})

Don't: Accumulate tool results indefinitely

// Bad: every tool result stays in context forever
// After 20 tool calls, context is full of old results

// Good: use sub-agents for exploration-heavy tasks
// Sub-agent accumulates results → returns summary → parent context stays clean

Don't: Load all memories

// Bad
memory: Memory.persistent({ auto: { recall: true, maxRecall: 100 } }) // 100 memories = noise

// Good
memory: Memory.persistent({ auto: { recall: true, maxRecall: 10 } }) // Top 10 = focused

Best Practices

  1. Monitor context utilization — Track how full the context window gets during typical runs. If it regularly exceeds 50%, you need to optimize.

  2. Use sub-agents for exploration — Any task that involves searching, reading, or exploring should be delegated to a sub-agent.

  3. Enable caching — If your system prompt and tools don't change between runs, cache them. This can reduce costs by 90% on the cached portion.

  4. Compact early — Don't wait until the context is 95% full. Start compacting at 70% to leave room for the model to work.

  5. Measure, then optimize — Don't guess at context usage. Measure it per run, identify the biggest consumers, and optimize those first.