Guardrails

Guardrails are validation functions that run on agent inputs and outputs. They are the primary mechanism for ensuring agent behavior stays within defined boundaries. Unlike prompt instructions (which are suggestions), guardrails are enforced programmatically.

Why Guardrails?

Prompt instructions can be circumvented through jailbreaks, prompt injection, or edge cases the instruction didn't anticipate. Guardrails provide a hard boundary:

Prompt instructions:  "Don't discuss competitors"  → Can be bypassed
Output guardrail:     blocks competitor mentions     → Cannot be bypassed

Input Guardrails

Input guardrails validate user messages before the agent processes them:

import { Guardrail } from 'assistme-agent-sdk'

const topicFilter = Guardrail.input({
  name: 'topic_filter',
  validate: async (input, context) => {
    // Use a classifier model or simple rules
    if (containsBlockedContent(input)) {
      return {
        allow: false,
        reason: 'This topic is outside my area of expertise.',
      }
    }
    return { allow: true }
  },
})

LLM-Based Input Guardrails

Use a fast, cheap model to classify inputs:

const intentClassifier = Guardrail.input({
  name: 'intent_classifier',
  validate: async (input, context) => {
    const classification = await Runner.run(
      new Agent({
        name: 'classifier',
        model: claude('claude-haiku-4-5'),
        instructions: 'Classify if the input is a valid customer support request.',
        output: z.object({
          isValid: z.boolean(),
          category: z.enum(['support', 'sales', 'abuse', 'off_topic']),
          reasoning: z.string(),
        }),
      }),
      { messages: [{ role: 'user', content: input }] },
    )

    if (classification.output.category === 'abuse') {
      return { allow: false, reason: 'Request flagged for review.' }
    }

    if (classification.output.category === 'off_topic') {
      return { allow: false, reason: 'Please ask customer support questions only.' }
    }

    // Attach metadata for the main agent to use
    return {
      allow: true,
      metadata: { category: classification.output.category },
    }
  },
})

Output Guardrails

Output guardrails validate the agent's response before it reaches the user:

const piiFilter = Guardrail.output({
  name: 'pii_filter',
  validate: async (output) => {
    const piiPatterns = [
      { name: 'SSN', pattern: /\b\d{3}-\d{2}-\d{4}\b/g, replacement: '[SSN REDACTED]' },
      { name: 'Email', pattern: /\b[\w.+-]+@[\w-]+\.[\w.]+\b/g, replacement: '[EMAIL REDACTED]' },
      { name: 'Phone', pattern: /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, replacement: '[PHONE REDACTED]' },
    ]

    let modified = output
    let detected: string[] = []

    for (const { name, pattern, replacement } of piiPatterns) {
      if (pattern.test(modified)) {
        detected.push(name)
        pattern.lastIndex = 0 // Reset lastIndex after .test() to avoid skipping matches
        modified = modified.replace(pattern, replacement)
      }
    }

    if (detected.length > 0) {
      return {
        allow: true, // Allow but modify
        modified,
        metadata: { redacted: detected },
      }
    }

    return { allow: true }
  },
})

Blocking vs. Modifying

Output guardrails can either block the response entirely or modify it:

// Block: refuse to return the response
return { allow: false, reason: 'Response violates policy.' }

// Modify: clean up the response and return the modified version
return { allow: true, modified: cleanedOutput }

// Pass: return the response unchanged
return { allow: true }

Guardrail Types

See the full GuardrailResult interface in the API Reference. Key fields:

allow: boolean — Whether to allow the input/output
reason?: string — Reason for blocking (shown to user or logged)
modified?: string — Modified content (for output guardrails that clean/redact)
metadata?: Record<string, unknown> — Metadata to attach to the run context

Composing Guardrails

Multiple Guardrails

Guardrails run in order. If any guardrail blocks, the pipeline stops:

const agent = new Agent({
  name: 'secure-agent',
  model: claude('claude-sonnet-4-6'),
  instructions: 'You are a helpful assistant.',
  guardrails: {
    input: [
      rateLimiter,       // Check rate limits first (fast)
      contentFilter,     // Then content moderation (fast)
      intentClassifier,  // Then intent classification (LLM call)
    ],
    output: [
      piiFilter,         // Redact PII
      factChecker,       // Verify claims
      brandVoice,        // Ensure brand consistency
    ],
  },
})

Order matters: Place fast, cheap guardrails first to avoid unnecessary LLM calls when a simple rule would have caught the issue.

Parallel Guardrails

For performance, run independent guardrails in parallel:

const agent = new Agent({
  name: 'fast-guard',
  model: claude('claude-sonnet-4-6'),
  instructions: 'You are a helpful assistant.',
  guardrails: {
    input: [
      Guardrail.parallel([
        contentFilter,
        languageDetector,
        rateLimiter,
      ]),
    ],
  },
})

All three run concurrently. If any one blocks, the input is rejected.

Common Guardrail Patterns

Rate Limiting

const rateLimiter = Guardrail.input({
  name: 'rate_limiter',
  validate: async (input, context) => {
    const count = await redis.incr(`rate:${context.userId}`)
    await redis.expire(`rate:${context.userId}`, 60)

    if (count > 20) {
      return { allow: false, reason: 'Rate limit exceeded. Please wait a moment.' }
    }
    return { allow: true }
  },
})

Token Budget

const tokenBudget = Guardrail.input({
  name: 'token_budget',
  validate: async (input, context) => {
    const used = await getTokenUsage(context.userId, 'today')
    if (used > 100_000) {
      return { allow: false, reason: 'Daily token budget exceeded.' }
    }
    return { allow: true }
  },
})

Hallucination Detection

const factChecker = Guardrail.output({
  name: 'fact_checker',
  validate: async (output, context) => {
    // Use a separate model to verify claims
    const verification = await Runner.run(
      new Agent({
        name: 'verifier',
        model: claude('claude-haiku-4-5'),
        instructions: 'Check if the following response contains any clearly false claims.',
        output: z.object({
          hasFalseClaims: z.boolean(),
          claims: z.array(z.object({
            claim: z.string(),
            assessment: z.enum(['true', 'false', 'uncertain']),
          })),
        }),
      }),
      { messages: [{ role: 'user', content: output }] },
    )

    if (verification.output.hasFalseClaims) {
      const falseClaims = verification.output.claims
        .filter(c => c.assessment === 'false')
        .map(c => c.claim)

      return {
        allow: false,
        reason: `Response contains unverified claims: ${falseClaims.join(', ')}`,
        metadata: { falseClaims },
      }
    }
    return { allow: true }
  },
})

Guardrail Events

Monitor guardrail activity through the streaming API:

const stream = Runner.stream(agent, { messages })

for await (const event of stream) {
  if (event.type === 'guardrail_triggered') {
    console.log(`Guardrail: ${event.guardrail}`)
    console.log(`Phase: ${event.phase}`)    // 'input' or 'output'
    console.log(`Allow: ${event.allow}`)
    console.log(`Reason: ${event.reason}`)
  }
}

Best Practices

Layer your defenses — Use both input and output guardrails. Input guardrails catch bad requests early; output guardrails catch bad responses that slipped through.
Fast guardrails first — Order guardrails from cheapest/fastest to most expensive. A regex check should run before an LLM classification.
Use cheap models for classification — Guardrail LLM calls should use fast, inexpensive models (like Haiku). The main agent uses the powerful model; guardrails just need to classify.
Don't over-guard — Too many guardrails add latency and cost to every interaction. Focus on the guardrails that address real risks for your use case.
Log everything — Every guardrail trigger should be logged for analysis. Patterns in blocked requests reveal attack vectors and false positives.
Test adversarially — Guardrails should be tested with adversarial inputs specifically designed to bypass them.

Guardrails

On this page