Production
Deploying agents to production with reliability and scale
Production
Moving from prototype to production requires attention to reliability, cost, security, and operational readiness. This guide covers the key considerations.
Production Checklist
Before deploying an agent to production, verify:
- Guardrails configured — Input and output guardrails are active
- MaxTurns set — Every agent has a
maxTurnslimit - Tool timeouts set — All network-calling tools have timeouts
- Error handling —
onErrorhooks return graceful fallbacks - Observability — Tracing, logging, and metrics are configured
- Evaluation passing — E2E evaluation pass rate meets threshold
- Cost budgets — Per-user and per-run token limits are set
- Secrets management — No API keys in agent context or logs
- Sandboxing — Code execution tools are sandboxed
- Human approval — Destructive tools require approval
Cost Management
Token Budgets
const agent = new Agent({
name: 'production-agent',
model: claude('claude-sonnet-4-6'),
instructions: 'You are a helpful assistant.',
limits: {
maxTokensPerRun: 50_000, // Hard cap per run
maxTokensPerDay: 1_000_000, // Daily cap per agent
maxCostPerRun: 0.50, // Dollar cap per run
},
})Cost Optimization Strategies
| Strategy | Impact | Effort |
|---|---|---|
| Enable prompt caching | 50-90% reduction on cached tokens | Low |
| Use cheaper models for routing | 80% reduction on routing calls | Low |
| Sub-agent isolation | 30-50% reduction via focused contexts | Medium |
| Context compaction | 20-40% reduction in long conversations | Medium |
| Tool result summarization | 30% reduction in tool-heavy workflows | Medium |
Estimating Costs
import { CostEstimator } from 'assistme-agent-sdk'
const estimate = CostEstimator.estimate({
model: claude('claude-sonnet-4-6'),
avgInputTokens: 2000,
avgOutputTokens: 500,
avgToolCalls: 3,
runsPerDay: 10_000,
cachingEnabled: true,
cacheHitRate: 0.7,
})
console.log(`Estimated daily cost: $${estimate.dailyCost.toFixed(2)}`)
console.log(`Estimated monthly cost: $${estimate.monthlyCost.toFixed(2)}`)
console.log(`Cost per run: $${estimate.costPerRun.toFixed(4)}`)Reliability
Retry Logic
const result = await Runner.run(agent, {
messages,
retry: {
maxRetries: 3,
backoff: 'exponential', // 1s, 2s, 4s
retryOn: ['rate_limit', 'server_error', 'timeout'],
},
})Fallback Models
const agent = new Agent({
name: 'resilient',
model: claude('claude-sonnet-4-6'),
fallbackModels: [
openai('gpt-4o'), // If Claude is down
gemini('gemini-2.5-pro'), // If both are down
],
})When the primary model fails after retries, the SDK automatically tries fallback models in order.
Circuit Breaker
import { CircuitBreaker } from 'assistme-agent-sdk'
const breaker = CircuitBreaker.create({
failureThreshold: 5, // Open after 5 failures
resetTimeout: 60_000, // Try again after 1 minute
halfOpenRequests: 2, // Test with 2 requests before closing
})
const result = await Runner.run(agent, {
messages,
circuitBreaker: breaker,
})Scaling
Concurrent Runs
import { RunQueue } from 'assistme-agent-sdk'
const queue = RunQueue.create({
concurrency: 10, // Max 10 concurrent runs
maxQueueSize: 1000, // Max 1000 pending runs
timeout: 60_000, // Per-run timeout
priorityField: 'priority', // Optional priority queue
})
// Submit runs
const handle = await queue.submit(agent, { messages, metadata: { priority: 'high' } })
const result = await handle.result()Rate Limiting
import { RateLimiter } from 'assistme-agent-sdk'
const limiter = RateLimiter.create({
perUser: { requests: 20, window: '1m' },
perAgent: { requests: 100, window: '1m' },
global: { requests: 1000, window: '1m' },
})
const result = await Runner.run(agent, {
messages,
rateLimiter: limiter,
userId: 'user-123',
})Deployment Patterns
Serverless
// Vercel / AWS Lambda / Cloudflare Workers
export async function POST(request: Request) {
const { messages } = await request.json()
const stream = Runner.stream(agent, { messages })
return new Response(stream.toSSE(), {
headers: { 'Content-Type': 'text/event-stream' },
})
}Long-Running Service
// Express / Fastify / Hono
import { Hono } from 'hono'
const app = new Hono()
app.post('/chat', async (c) => {
const { messages } = await c.req.json()
const stream = Runner.stream(agent, { messages })
return c.newResponse(stream.toSSE(), {
headers: { 'Content-Type': 'text/event-stream' },
})
})Worker Queue
// BullMQ / SQS / Pub/Sub
import { Worker } from 'bullmq'
const worker = new Worker('agent-tasks', async (job) => {
const { messages, userId, taskId } = job.data
const result = await Runner.run(agent, {
messages,
metadata: { userId, taskId },
})
await saveResult(taskId, result)
}, { concurrency: 10 })Environment Configuration
import { Config } from 'assistme-agent-sdk'
const config = Config.create({
// Different settings per environment
development: {
model: claude('claude-haiku-4-5'), // Cheap model for dev
logging: { level: 'debug' },
guardrails: { enabled: false }, // Skip guardrails in dev
},
staging: {
model: claude('claude-sonnet-4-6'),
logging: { level: 'info' },
guardrails: { enabled: true },
},
production: {
model: claude('claude-sonnet-4-6'),
logging: { level: 'warn' },
guardrails: { enabled: true },
retry: { maxRetries: 3 },
circuitBreaker: { failureThreshold: 5 },
},
})
const agent = new Agent({
name: 'assistant',
...config.current(),
})Monitoring Alerts
Set up alerts for production anomalies:
import { Alerts } from 'assistme-agent-sdk'
Alerts.configure({
channels: [
{ type: 'slack', webhook: process.env.SLACK_WEBHOOK },
{ type: 'pagerduty', serviceKey: process.env.PD_KEY },
],
rules: [
{ metric: 'agent_run_error_rate', threshold: 0.05, window: '5m', severity: 'critical' },
{ metric: 'agent_run_p99_latency', threshold: 30_000, window: '5m', severity: 'warning' },
{ metric: 'agent_daily_cost', threshold: 100, window: '24h', severity: 'warning' },
{ metric: 'guardrail_block_rate', threshold: 0.20, window: '1h', severity: 'warning' },
],
})Best Practices
-
Start with a single deployment, scale gradually — Don't over-engineer. A single serverless function handles most early workloads.
-
Monitor costs from day one — Token costs can surprise you. Set alerts before they become problems.
-
Use fallback models — Model APIs go down. Having a fallback prevents total outages.
-
Test with production-like evaluation — Your E2E evals should use the same model and tools as production.
-
Set hard limits — MaxTurns, token budgets, and rate limits are your safety net against runaway costs.
-
Log for debugging, metric for alerting — Don't try to alert on log patterns. Use structured metrics for operational alerts.