Agent SDK

Production

Deploying agents to production with reliability and scale

Production

Moving from prototype to production requires attention to reliability, cost, security, and operational readiness. This guide covers the key considerations.

Production Checklist

Before deploying an agent to production, verify:

  • Guardrails configured — Input and output guardrails are active
  • MaxTurns set — Every agent has a maxTurns limit
  • Tool timeouts set — All network-calling tools have timeouts
  • Error handlingonError hooks return graceful fallbacks
  • Observability — Tracing, logging, and metrics are configured
  • Evaluation passing — E2E evaluation pass rate meets threshold
  • Cost budgets — Per-user and per-run token limits are set
  • Secrets management — No API keys in agent context or logs
  • Sandboxing — Code execution tools are sandboxed
  • Human approval — Destructive tools require approval

Cost Management

Token Budgets

const agent = new Agent({
  name: 'production-agent',
  model: claude('claude-sonnet-4-6'),
  instructions: 'You are a helpful assistant.',
  limits: {
    maxTokensPerRun: 50_000,     // Hard cap per run
    maxTokensPerDay: 1_000_000,  // Daily cap per agent
    maxCostPerRun: 0.50,         // Dollar cap per run
  },
})

Cost Optimization Strategies

StrategyImpactEffort
Enable prompt caching50-90% reduction on cached tokensLow
Use cheaper models for routing80% reduction on routing callsLow
Sub-agent isolation30-50% reduction via focused contextsMedium
Context compaction20-40% reduction in long conversationsMedium
Tool result summarization30% reduction in tool-heavy workflowsMedium

Estimating Costs

import { CostEstimator } from 'assistme-agent-sdk'

const estimate = CostEstimator.estimate({
  model: claude('claude-sonnet-4-6'),
  avgInputTokens: 2000,
  avgOutputTokens: 500,
  avgToolCalls: 3,
  runsPerDay: 10_000,
  cachingEnabled: true,
  cacheHitRate: 0.7,
})

console.log(`Estimated daily cost: $${estimate.dailyCost.toFixed(2)}`)
console.log(`Estimated monthly cost: $${estimate.monthlyCost.toFixed(2)}`)
console.log(`Cost per run: $${estimate.costPerRun.toFixed(4)}`)

Reliability

Retry Logic

const result = await Runner.run(agent, {
  messages,
  retry: {
    maxRetries: 3,
    backoff: 'exponential', // 1s, 2s, 4s
    retryOn: ['rate_limit', 'server_error', 'timeout'],
  },
})

Fallback Models

const agent = new Agent({
  name: 'resilient',
  model: claude('claude-sonnet-4-6'),
  fallbackModels: [
    openai('gpt-4o'),           // If Claude is down
    gemini('gemini-2.5-pro'),   // If both are down
  ],
})

When the primary model fails after retries, the SDK automatically tries fallback models in order.

Circuit Breaker

import { CircuitBreaker } from 'assistme-agent-sdk'

const breaker = CircuitBreaker.create({
  failureThreshold: 5,    // Open after 5 failures
  resetTimeout: 60_000,   // Try again after 1 minute
  halfOpenRequests: 2,    // Test with 2 requests before closing
})

const result = await Runner.run(agent, {
  messages,
  circuitBreaker: breaker,
})

Scaling

Concurrent Runs

import { RunQueue } from 'assistme-agent-sdk'

const queue = RunQueue.create({
  concurrency: 10,         // Max 10 concurrent runs
  maxQueueSize: 1000,      // Max 1000 pending runs
  timeout: 60_000,         // Per-run timeout
  priorityField: 'priority', // Optional priority queue
})

// Submit runs
const handle = await queue.submit(agent, { messages, metadata: { priority: 'high' } })
const result = await handle.result()

Rate Limiting

import { RateLimiter } from 'assistme-agent-sdk'

const limiter = RateLimiter.create({
  perUser: { requests: 20, window: '1m' },
  perAgent: { requests: 100, window: '1m' },
  global: { requests: 1000, window: '1m' },
})

const result = await Runner.run(agent, {
  messages,
  rateLimiter: limiter,
  userId: 'user-123',
})

Deployment Patterns

Serverless

// Vercel / AWS Lambda / Cloudflare Workers
export async function POST(request: Request) {
  const { messages } = await request.json()

  const stream = Runner.stream(agent, { messages })
  return new Response(stream.toSSE(), {
    headers: { 'Content-Type': 'text/event-stream' },
  })
}

Long-Running Service

// Express / Fastify / Hono
import { Hono } from 'hono'

const app = new Hono()

app.post('/chat', async (c) => {
  const { messages } = await c.req.json()
  const stream = Runner.stream(agent, { messages })
  return c.newResponse(stream.toSSE(), {
    headers: { 'Content-Type': 'text/event-stream' },
  })
})

Worker Queue

// BullMQ / SQS / Pub/Sub
import { Worker } from 'bullmq'

const worker = new Worker('agent-tasks', async (job) => {
  const { messages, userId, taskId } = job.data

  const result = await Runner.run(agent, {
    messages,
    metadata: { userId, taskId },
  })

  await saveResult(taskId, result)
}, { concurrency: 10 })

Environment Configuration

import { Config } from 'assistme-agent-sdk'

const config = Config.create({
  // Different settings per environment
  development: {
    model: claude('claude-haiku-4-5'),    // Cheap model for dev
    logging: { level: 'debug' },
    guardrails: { enabled: false },        // Skip guardrails in dev
  },
  staging: {
    model: claude('claude-sonnet-4-6'),
    logging: { level: 'info' },
    guardrails: { enabled: true },
  },
  production: {
    model: claude('claude-sonnet-4-6'),
    logging: { level: 'warn' },
    guardrails: { enabled: true },
    retry: { maxRetries: 3 },
    circuitBreaker: { failureThreshold: 5 },
  },
})

const agent = new Agent({
  name: 'assistant',
  ...config.current(),
})

Monitoring Alerts

Set up alerts for production anomalies:

import { Alerts } from 'assistme-agent-sdk'

Alerts.configure({
  channels: [
    { type: 'slack', webhook: process.env.SLACK_WEBHOOK },
    { type: 'pagerduty', serviceKey: process.env.PD_KEY },
  ],
  rules: [
    { metric: 'agent_run_error_rate', threshold: 0.05, window: '5m', severity: 'critical' },
    { metric: 'agent_run_p99_latency', threshold: 30_000, window: '5m', severity: 'warning' },
    { metric: 'agent_daily_cost', threshold: 100, window: '24h', severity: 'warning' },
    { metric: 'guardrail_block_rate', threshold: 0.20, window: '1h', severity: 'warning' },
  ],
})

Best Practices

  1. Start with a single deployment, scale gradually — Don't over-engineer. A single serverless function handles most early workloads.

  2. Monitor costs from day one — Token costs can surprise you. Set alerts before they become problems.

  3. Use fallback models — Model APIs go down. Having a fallback prevents total outages.

  4. Test with production-like evaluation — Your E2E evals should use the same model and tools as production.

  5. Set hard limits — MaxTurns, token budgets, and rate limits are your safety net against runaway costs.

  6. Log for debugging, metric for alerting — Don't try to alert on log patterns. Use structured metrics for operational alerts.