guardrails & safety

Guardrails & Safety

Guardrails are the firewall of an AI system. They sit between the user and the agent (Input Guardrail) and between the agent and the user (Output Guardrail). They enforce policy, security, and tone. Unlike the main agent, which tries to be helpful, the guardrail tries to be safe and compliant.

When to Use

Jailbreak Prevention: Stopping users from tricking the model ("Ignore previous instructions...").
PII Protection: Detecting and redacting phone numbers, emails, or credit cards.
Topic Adherence: Ensuring a customer support bot doesn't discuss politics or religion.
Brand Safety: preventing the model from generating offensive or competitor-promoting content.

Use Cases

Input Filter: Blocking prompts that violate usage policies.
Output Filter: Blocking model responses that contain hate speech or hallucinations.
Sandboxing: Ensuring code generated by the agent acts within safe bounds (e.g., no network access).

Implementation Pattern

def guarded_execution(user_input): # Layer 1: Input Guardrail # Check for prompt injection or policy violations if not safety_agent.check_input(user_input).safe: return "I cannot answer that request."

# Layer 2: Main Execution
response = main_agent.run(user_input)

# Layer 3: Output Guardrail
# Check for PII or harmful content in the response
if not safety_agent.check_output(response).safe:
    log_violation(user_input, response)
    return "Response withheld due to safety policy."
    
return response

guardrails & safety

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

human-in-the-loop

reflection

planning

adaptation