Production

Deploying agents to production with reliability and scale

Production

Moving from prototype to production requires attention to reliability, cost, security, and operational readiness. This guide covers the key considerations.

Production Checklist

Before deploying an agent to production, verify:

Cost Management

Token Budgets

const agent = new Agent({
  name: 'production-agent',
  model: claude('claude-sonnet-4-6'),
  instructions: 'You are a helpful assistant.',
  limits: {
    maxTokensPerRun: 50_000,     // Hard cap per run
    maxTokensPerDay: 1_000_000,  // Daily cap per agent
    maxCostPerRun: 0.50,         // Dollar cap per run
  },
})

Cost Optimization Strategies

Strategy	Impact	Effort
Enable prompt caching	50-90% reduction on cached tokens	Low
Use cheaper models for routing	80% reduction on routing calls	Low
Sub-agent isolation	30-50% reduction via focused contexts	Medium
Context compaction	20-40% reduction in long conversations	Medium
Tool result summarization	30% reduction in tool-heavy workflows	Medium

Estimating Costs

import { CostEstimator } from 'assistme-agent-sdk'

const estimate = CostEstimator.estimate({
  model: claude('claude-sonnet-4-6'),
  avgInputTokens: 2000,
  avgOutputTokens: 500,
  avgToolCalls: 3,
  runsPerDay: 10_000,
  cachingEnabled: true,
  cacheHitRate: 0.7,
})

console.log(`Estimated daily cost: $${estimate.dailyCost.toFixed(2)}`)
console.log(`Estimated monthly cost: $${estimate.monthlyCost.toFixed(2)}`)
console.log(`Cost per run: $${estimate.costPerRun.toFixed(4)}`)

Reliability

Retry Logic

const result = await Runner.run(agent, {
  messages,
  retry: {
    maxRetries: 3,
    backoff: 'exponential', // 1s, 2s, 4s
    retryOn: ['rate_limit', 'server_error', 'timeout'],
  },
})

Fallback Models

const agent = new Agent({
  name: 'resilient',
  model: claude('claude-sonnet-4-6'),
  fallbackModels: [
    openai('gpt-4o'),           // If Claude is down
    gemini('gemini-2.5-pro'),   // If both are down
  ],
})

When the primary model fails after retries, the SDK automatically tries fallback models in order.

Circuit Breaker

import { CircuitBreaker } from 'assistme-agent-sdk'

const breaker = CircuitBreaker.create({
  failureThreshold: 5,    // Open after 5 failures
  resetTimeout: 60_000,   // Try again after 1 minute
  halfOpenRequests: 2,    // Test with 2 requests before closing
})

const result = await Runner.run(agent, {
  messages,
  circuitBreaker: breaker,
})

Scaling

Concurrent Runs

import { RunQueue } from 'assistme-agent-sdk'

const queue = RunQueue.create({
  concurrency: 10,         // Max 10 concurrent runs
  maxQueueSize: 1000,      // Max 1000 pending runs
  timeout: 60_000,         // Per-run timeout
  priorityField: 'priority', // Optional priority queue
})

// Submit runs
const handle = await queue.submit(agent, { messages, metadata: { priority: 'high' } })
const result = await handle.result()

Rate Limiting

import { RateLimiter } from 'assistme-agent-sdk'

const limiter = RateLimiter.create({
  perUser: { requests: 20, window: '1m' },
  perAgent: { requests: 100, window: '1m' },
  global: { requests: 1000, window: '1m' },
})

const result = await Runner.run(agent, {
  messages,
  rateLimiter: limiter,
  userId: 'user-123',
})

Deployment Patterns

Serverless

// Vercel / AWS Lambda / Cloudflare Workers
export async function POST(request: Request) {
  const { messages } = await request.json()

  const stream = Runner.stream(agent, { messages })
  return new Response(stream.toSSE(), {
    headers: { 'Content-Type': 'text/event-stream' },
  })
}

Long-Running Service

// Express / Fastify / Hono
import { Hono } from 'hono'

const app = new Hono()

app.post('/chat', async (c) => {
  const { messages } = await c.req.json()
  const stream = Runner.stream(agent, { messages })
  return c.newResponse(stream.toSSE(), {
    headers: { 'Content-Type': 'text/event-stream' },
  })
})

Worker Queue

// BullMQ / SQS / Pub/Sub
import { Worker } from 'bullmq'

const worker = new Worker('agent-tasks', async (job) => {
  const { messages, userId, taskId } = job.data

  const result = await Runner.run(agent, {
    messages,
    metadata: { userId, taskId },
  })

  await saveResult(taskId, result)
}, { concurrency: 10 })

Environment Configuration

import { Config } from 'assistme-agent-sdk'

const config = Config.create({
  // Different settings per environment
  development: {
    model: claude('claude-haiku-4-5'),    // Cheap model for dev
    logging: { level: 'debug' },
    guardrails: { enabled: false },        // Skip guardrails in dev
  },
  staging: {
    model: claude('claude-sonnet-4-6'),
    logging: { level: 'info' },
    guardrails: { enabled: true },
  },
  production: {
    model: claude('claude-sonnet-4-6'),
    logging: { level: 'warn' },
    guardrails: { enabled: true },
    retry: { maxRetries: 3 },
    circuitBreaker: { failureThreshold: 5 },
  },
})

const agent = new Agent({
  name: 'assistant',
  ...config.current(),
})

Monitoring Alerts

Set up alerts for production anomalies:

import { Alerts } from 'assistme-agent-sdk'

Alerts.configure({
  channels: [
    { type: 'slack', webhook: process.env.SLACK_WEBHOOK },
    { type: 'pagerduty', serviceKey: process.env.PD_KEY },
  ],
  rules: [
    { metric: 'agent_run_error_rate', threshold: 0.05, window: '5m', severity: 'critical' },
    { metric: 'agent_run_p99_latency', threshold: 30_000, window: '5m', severity: 'warning' },
    { metric: 'agent_daily_cost', threshold: 100, window: '24h', severity: 'warning' },
    { metric: 'guardrail_block_rate', threshold: 0.20, window: '1h', severity: 'warning' },
  ],
})

Best Practices

Start with a single deployment, scale gradually — Don't over-engineer. A single serverless function handles most early workloads.
Monitor costs from day one — Token costs can surprise you. Set alerts before they become problems.
Use fallback models — Model APIs go down. Having a fallback prevents total outages.
Test with production-like evaluation — Your E2E evals should use the same model and tools as production.
Set hard limits — MaxTurns, token budgets, and rate limits are your safety net against runaway costs.
Log for debugging, metric for alerting — Don't try to alert on log patterns. Use structured metrics for operational alerts.

Evaluation

Testing and evaluating agent behavior for reliability

API Reference

Complete API reference for the AssistMe Agent SDK

Production

On this page