Middleware
Composable wrappers that extend model behavior without modifying core logic
Middleware
Middleware wraps the model provider to add cross-cutting behavior — logging, caching, rate limiting, reasoning extraction — without modifying the agent or provider code. Middleware is composable: stack multiple middlewares, and they execute in order.
Basic Middleware
import { wrapModel } from 'assistme-agent-sdk'
import { claude } from 'assistme-agent-sdk-provider-claude'
const loggedModel = wrapModel(claude('claude-sonnet-4-6'), {
name: 'logging',
transformRequest: async (request, next) => {
console.log(`[${new Date().toISOString()}] Model request: ${request.messages.length} messages`)
const response = await next(request)
console.log(`[${new Date().toISOString()}] Model response: ${response.usage.inputTokens + response.usage.outputTokens} tokens`)
return response
},
})
const agent = new Agent({
name: 'assistant',
model: loggedModel, // Uses the wrapped model
instructions: 'You are a helpful assistant.',
})Composing Middleware
Stack multiple middlewares — they execute in order (outermost first):
const model = wrapModel(
claude('claude-sonnet-4-6'),
loggingMiddleware(),
cachingMiddleware({ ttl: '5m' }),
rateLimitMiddleware({ maxRpm: 60 }),
retryMiddleware({ maxRetries: 3 }),
)
// Request flow:
// logging → caching → rate limiting → retry → actual model call
// Response flow:
// actual response → retry → rate limiting → caching → loggingBuilt-in Middleware
Caching
Cache identical requests to avoid redundant model calls:
import { cachingMiddleware } from 'assistme-agent-sdk/middleware'
const model = wrapModel(claude('claude-sonnet-4-6'), cachingMiddleware({
store: 'memory', // or 'redis', 'sqlite'
ttl: '10m', // Cache for 10 minutes
keyFn: (request) => {
// Custom cache key (default: hash of messages + tools)
return hash(request.messages)
},
}))Rate Limiting
Prevent exceeding provider rate limits:
import { rateLimitMiddleware } from 'assistme-agent-sdk/middleware'
const model = wrapModel(claude('claude-sonnet-4-6'), rateLimitMiddleware({
maxRequestsPerMinute: 60,
maxTokensPerMinute: 100_000,
strategy: 'sliding_window', // or 'fixed_window', 'token_bucket'
onLimited: (waitMs) => {
console.log(`Rate limited, waiting ${waitMs}ms`)
},
}))Retry
Automatic retry with exponential backoff:
import { retryMiddleware } from 'assistme-agent-sdk/middleware'
const model = wrapModel(claude('claude-sonnet-4-6'), retryMiddleware({
maxRetries: 3,
backoff: 'exponential', // 1s, 2s, 4s
retryOn: ['rate_limit', 'server_error', 'timeout'],
}))Reasoning Extraction
Extract and expose chain-of-thought reasoning from models that support it:
import { reasoningMiddleware } from 'assistme-agent-sdk/middleware'
const model = wrapModel(claude('claude-sonnet-4-6'), reasoningMiddleware({
// Extract <thinking> blocks from the response
tagName: 'thinking',
// Make reasoning available in the response metadata
exposeAs: 'reasoning',
}))
const result = await Runner.run(
new Agent({ model, instructions: 'Think step by step.' }),
{ messages },
)
console.log(result.metadata.reasoning) // The model's chain-of-thoughtLogging
Structured logging of all model interactions:
import { loggingMiddleware } from 'assistme-agent-sdk/middleware'
const model = wrapModel(claude('claude-sonnet-4-6'), loggingMiddleware({
level: 'info',
logRequest: true,
logResponse: true,
redact: ['authorization', 'api_key'], // Redact sensitive fields
logger: customLogger,
}))Default Settings
Apply default model parameters:
import { defaultSettingsMiddleware } from 'assistme-agent-sdk/middleware'
const model = wrapModel(claude('claude-sonnet-4-6'), defaultSettingsMiddleware({
temperature: 0.7,
maxTokens: 4096,
// These can be overridden per-agent or per-run
}))Streaming Simulation
Make non-streaming models behave as if they stream:
import { simulateStreamingMiddleware } from 'assistme-agent-sdk/middleware'
const model = wrapModel(
openaiCompatible('local-model', { baseUrl }),
simulateStreamingMiddleware({
chunkSize: 10, // Characters per chunk
delayMs: 20, // Delay between chunks
}),
)
// Now Runner.stream() works even if the model doesn't natively support streamingCustom Middleware
The middleware interface:
import { ModelMiddleware } from 'assistme-agent-sdk'
interface ModelMiddleware {
name: string
/** Transform the request before it reaches the model */
transformRequest?: (
request: ModelRequest,
next: (request: ModelRequest) => Promise<ModelResponse>,
) => Promise<ModelResponse>
/** Transform streaming events */
transformStream?: (
request: ModelRequest,
next: (request: ModelRequest) => AsyncGenerator<ModelStreamEvent>,
) => AsyncGenerator<ModelStreamEvent>
}Example: Cost Tracking Middleware
const costTracker: ModelMiddleware = {
name: 'cost-tracker',
transformRequest: async (request, next) => {
const start = Date.now()
const response = await next(request)
const durationMs = Date.now() - start
const cost = calculateCost(response.usage, request.model)
await db.insert('model_costs', {
model: request.model,
inputTokens: response.usage.inputTokens,
outputTokens: response.usage.outputTokens,
cost,
durationMs,
timestamp: new Date(),
})
return response
},
}Example: PII Redaction Middleware
const piiRedaction: ModelMiddleware = {
name: 'pii-redaction',
transformRequest: async (request, next) => {
// Redact PII from messages before sending to the model
const redactedMessages = request.messages.map(msg => ({
...msg,
content: typeof msg.content === 'string'
? redactPII(msg.content)
: msg.content,
}))
return next({ ...request, messages: redactedMessages })
},
}Example: A/B Testing Middleware
const abTest: ModelMiddleware = {
name: 'ab-test',
transformRequest: async (request, next) => {
// Route 10% of traffic to the experimental model
if (Math.random() < 0.1) {
const experimentalModel = claude('claude-opus-4-6')
return experimentalModel.generate(request)
}
return next(request)
},
}Middleware vs. Hooks
| Middleware | Hooks | |
|---|---|---|
| Scope | Model calls | Agent lifecycle |
| Modifies | Request/response data | Tool calls, run flow |
| Composable | Yes (stacked) | Yes (parallel) |
| Use case | Cross-cutting model concerns | Agent behavior observation |
Use middleware for concerns at the model level (caching, logging, rate limiting). Use hooks for concerns at the agent level (tool validation, analytics, error handling).
Best Practices
-
Order matters — Place caching before rate limiting so cached responses don't count against rate limits. Place logging outermost to capture everything.
-
Keep middleware stateless — Middleware should not maintain internal state between requests. Use external stores (Redis, DB) for stateful concerns like caching.
-
Don't block in middleware — Long-running middleware adds latency to every model call. Use fire-and-forget for analytics.
-
Use built-in middleware first — The built-in caching, rate limiting, and retry middleware are tested and optimized. Only build custom middleware for unique requirements.
-
Test middleware independently — Each middleware should be testable in isolation with mock model calls.