Guardrails
Input and output validation for safe, controlled agent behavior
Guardrails
Guardrails are validation functions that run on agent inputs and outputs. They are the primary mechanism for ensuring agent behavior stays within defined boundaries. Unlike prompt instructions (which are suggestions), guardrails are enforced programmatically.
Why Guardrails?
Prompt instructions can be circumvented through jailbreaks, prompt injection, or edge cases the instruction didn't anticipate. Guardrails provide a hard boundary:
Prompt instructions: "Don't discuss competitors" → Can be bypassed
Output guardrail: blocks competitor mentions → Cannot be bypassedInput Guardrails
Input guardrails validate user messages before the agent processes them:
import { Guardrail } from 'assistme-agent-sdk'
const topicFilter = Guardrail.input({
name: 'topic_filter',
validate: async (input, context) => {
// Use a classifier model or simple rules
if (containsBlockedContent(input)) {
return {
allow: false,
reason: 'This topic is outside my area of expertise.',
}
}
return { allow: true }
},
})LLM-Based Input Guardrails
Use a fast, cheap model to classify inputs:
const intentClassifier = Guardrail.input({
name: 'intent_classifier',
validate: async (input, context) => {
const classification = await Runner.run(
new Agent({
name: 'classifier',
model: claude('claude-haiku-4-5'),
instructions: 'Classify if the input is a valid customer support request.',
output: z.object({
isValid: z.boolean(),
category: z.enum(['support', 'sales', 'abuse', 'off_topic']),
reasoning: z.string(),
}),
}),
{ messages: [{ role: 'user', content: input }] },
)
if (classification.output.category === 'abuse') {
return { allow: false, reason: 'Request flagged for review.' }
}
if (classification.output.category === 'off_topic') {
return { allow: false, reason: 'Please ask customer support questions only.' }
}
// Attach metadata for the main agent to use
return {
allow: true,
metadata: { category: classification.output.category },
}
},
})Output Guardrails
Output guardrails validate the agent's response before it reaches the user:
const piiFilter = Guardrail.output({
name: 'pii_filter',
validate: async (output) => {
const piiPatterns = [
{ name: 'SSN', pattern: /\b\d{3}-\d{2}-\d{4}\b/g, replacement: '[SSN REDACTED]' },
{ name: 'Email', pattern: /\b[\w.+-]+@[\w-]+\.[\w.]+\b/g, replacement: '[EMAIL REDACTED]' },
{ name: 'Phone', pattern: /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, replacement: '[PHONE REDACTED]' },
]
let modified = output
let detected: string[] = []
for (const { name, pattern, replacement } of piiPatterns) {
if (pattern.test(modified)) {
detected.push(name)
pattern.lastIndex = 0 // Reset lastIndex after .test() to avoid skipping matches
modified = modified.replace(pattern, replacement)
}
}
if (detected.length > 0) {
return {
allow: true, // Allow but modify
modified,
metadata: { redacted: detected },
}
}
return { allow: true }
},
})Blocking vs. Modifying
Output guardrails can either block the response entirely or modify it:
// Block: refuse to return the response
return { allow: false, reason: 'Response violates policy.' }
// Modify: clean up the response and return the modified version
return { allow: true, modified: cleanedOutput }
// Pass: return the response unchanged
return { allow: true }Guardrail Types
See the full GuardrailResult interface in the API Reference. Key fields:
allow: boolean— Whether to allow the input/outputreason?: string— Reason for blocking (shown to user or logged)modified?: string— Modified content (for output guardrails that clean/redact)metadata?: Record<string, unknown>— Metadata to attach to the run context
Composing Guardrails
Multiple Guardrails
Guardrails run in order. If any guardrail blocks, the pipeline stops:
const agent = new Agent({
name: 'secure-agent',
model: claude('claude-sonnet-4-6'),
instructions: 'You are a helpful assistant.',
guardrails: {
input: [
rateLimiter, // Check rate limits first (fast)
contentFilter, // Then content moderation (fast)
intentClassifier, // Then intent classification (LLM call)
],
output: [
piiFilter, // Redact PII
factChecker, // Verify claims
brandVoice, // Ensure brand consistency
],
},
})Order matters: Place fast, cheap guardrails first to avoid unnecessary LLM calls when a simple rule would have caught the issue.
Parallel Guardrails
For performance, run independent guardrails in parallel:
const agent = new Agent({
name: 'fast-guard',
model: claude('claude-sonnet-4-6'),
instructions: 'You are a helpful assistant.',
guardrails: {
input: [
Guardrail.parallel([
contentFilter,
languageDetector,
rateLimiter,
]),
],
},
})All three run concurrently. If any one blocks, the input is rejected.
Common Guardrail Patterns
Rate Limiting
const rateLimiter = Guardrail.input({
name: 'rate_limiter',
validate: async (input, context) => {
const count = await redis.incr(`rate:${context.userId}`)
await redis.expire(`rate:${context.userId}`, 60)
if (count > 20) {
return { allow: false, reason: 'Rate limit exceeded. Please wait a moment.' }
}
return { allow: true }
},
})Token Budget
const tokenBudget = Guardrail.input({
name: 'token_budget',
validate: async (input, context) => {
const used = await getTokenUsage(context.userId, 'today')
if (used > 100_000) {
return { allow: false, reason: 'Daily token budget exceeded.' }
}
return { allow: true }
},
})Hallucination Detection
const factChecker = Guardrail.output({
name: 'fact_checker',
validate: async (output, context) => {
// Use a separate model to verify claims
const verification = await Runner.run(
new Agent({
name: 'verifier',
model: claude('claude-haiku-4-5'),
instructions: 'Check if the following response contains any clearly false claims.',
output: z.object({
hasFalseClaims: z.boolean(),
claims: z.array(z.object({
claim: z.string(),
assessment: z.enum(['true', 'false', 'uncertain']),
})),
}),
}),
{ messages: [{ role: 'user', content: output }] },
)
if (verification.output.hasFalseClaims) {
const falseClaims = verification.output.claims
.filter(c => c.assessment === 'false')
.map(c => c.claim)
return {
allow: false,
reason: `Response contains unverified claims: ${falseClaims.join(', ')}`,
metadata: { falseClaims },
}
}
return { allow: true }
},
})Guardrail Events
Monitor guardrail activity through the streaming API:
const stream = Runner.stream(agent, { messages })
for await (const event of stream) {
if (event.type === 'guardrail_triggered') {
console.log(`Guardrail: ${event.guardrail}`)
console.log(`Phase: ${event.phase}`) // 'input' or 'output'
console.log(`Allow: ${event.allow}`)
console.log(`Reason: ${event.reason}`)
}
}Best Practices
-
Layer your defenses — Use both input and output guardrails. Input guardrails catch bad requests early; output guardrails catch bad responses that slipped through.
-
Fast guardrails first — Order guardrails from cheapest/fastest to most expensive. A regex check should run before an LLM classification.
-
Use cheap models for classification — Guardrail LLM calls should use fast, inexpensive models (like Haiku). The main agent uses the powerful model; guardrails just need to classify.
-
Don't over-guard — Too many guardrails add latency and cost to every interaction. Focus on the guardrails that address real risks for your use case.
-
Log everything — Every guardrail trigger should be logged for analysis. Patterns in blocked requests reveal attack vectors and false positives.
-
Test adversarially — Guardrails should be tested with adversarial inputs specifically designed to bypass them.