Guardrails & Safety
Guardrails are the firewall of an AI system. They sit between the user and the agent (Input Guardrail) and between the agent and the user (Output Guardrail). They enforce policy, security, and tone. Unlike the main agent, which tries to be helpful, the guardrail tries to be safe and compliant.
When to Use
-
Jailbreak Prevention: Stopping users from tricking the model ("Ignore previous instructions...").
-
PII Protection: Detecting and redacting phone numbers, emails, or credit cards.
-
Topic Adherence: Ensuring a customer support bot doesn't discuss politics or religion.
-
Brand Safety: preventing the model from generating offensive or competitor-promoting content.
Use Cases
-
Input Filter: Blocking prompts that violate usage policies.
-
Output Filter: Blocking model responses that contain hate speech or hallucinations.
-
Sandboxing: Ensuring code generated by the agent acts within safe bounds (e.g., no network access).
Implementation Pattern
def guarded_execution(user_input): # Layer 1: Input Guardrail # Check for prompt injection or policy violations if not safety_agent.check_input(user_input).safe: return "I cannot answer that request."
# Layer 2: Main Execution
response = main_agent.run(user_input)
# Layer 3: Output Guardrail
# Check for PII or harmful content in the response
if not safety_agent.check_output(response).safe:
log_violation(user_input, response)
return "Response withheld due to safety policy."
return response