Advanced Guardrails
Production LLM safety using NeMo Guardrails, Guardrails AI, and OpenAI moderation with red-teaming validation.
**NeMo Guardrails **: LangChain 1.x compatible, parallel rails execution, OpenTelemetry tracing. DeepTeam: 40+ vulnerabilities, OWASP Top 10 alignment.
Overview
-
Implementing input/output validation for LLM applications
-
Preventing hallucinations and enforcing factuality
-
Detecting and filtering toxic, harmful, or off-topic content
-
Restricting LLM responses to specific domains/topics
-
PII detection and redaction in LLM outputs
-
Red-teaming and adversarial testing of LLM systems
-
OWASP Top 10 for LLMs compliance
Framework Comparison
Framework Best For Key Features
NeMo Guardrails Programmable flows, Colang 2.0 Input/output rails, fact-checking, dialog control
Guardrails AI Validator-based, modular 100+ validators, PII, toxicity, structured output
OpenAI Guardrails Drop-in wrapper Simple integration, moderation API
DeepTeam Red teaming, adversarial GOAT attacks, multi-turn jailbreaking, vulnerability scanning
Quick Reference
NeMo Guardrails with Guardrails AI Integration
config.yml
models:
- type: main engine: openai model: gpt-5.2
rails: config: guardrails_ai: validators: - name: toxic_language parameters: threshold: 0.5 validation_method: "sentence" - name: guardrails_pii parameters: entities: ["phone_number", "email", "ssn", "credit_card"] - name: restricttotopic parameters: valid_topics: ["technology", "support"] - name: valid_length parameters: min: 10 max: 500
input: flows: - guardrailsai check input $validator="guardrails_pii" - guardrailsai check input $validator="competitor_check"
output: flows: - guardrailsai check output $validator="toxic_language" - guardrailsai check output $validator="restricttotopic" - guardrailsai check output $validator="valid_length"
Colang 2.0 Fact-Checking Rails
define flow answer question with facts """Enable fact-checking for RAG responses.""" user ... $answer = execute rag() $check_facts = True # Enables fact-checking rail bot $answer
define flow check hallucination """Block responses about people without verification.""" user ask about people $check_hallucination = True # Blocking mode bot respond about people
define flow restrict competitor mentions """Prevent discussing competitor products.""" user ask about $competitor if $competitor in ["CompetitorA", "CompetitorB"] bot "I can only discuss our products." else bot respond normally
Guardrails AI Validators
from guardrails import Guard from guardrails.hub import ( ToxicLanguage, DetectPII, RestrictToTopic, ValidLength, ResponseEvaluator, )
Create guard with multiple validators
guard = Guard().use_many( ToxicLanguage(threshold=0.5, on_fail="filter"), DetectPII( pii_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "SSN"], on_fail="fix" # Redacts PII ), RestrictToTopic( valid_topics=["technology", "customer support"], invalid_topics=["politics", "religion"], on_fail="refrain" ), ValidLength(min=10, max=500, on_fail="reask"), )
Validate LLM output
result = guard( llm_api=openai.chat.completions.create, model="gpt-5.2", messages=[{"role": "user", "content": user_input}], )
if result.validation_passed: return result.validated_output else: return "I cannot respond to that request."
DeepTeam Red Teaming
from deepteam import red_team from deepteam.vulnerabilities import ( Bias, Toxicity, PIILeakage, PromptInjection, Jailbreaking, Misinformation, CompetitorEndorsement )
async def run_red_team_audit( target_model: callable, attacks_per_vulnerability: int = 10 ) -> dict: """Run comprehensive red team audit against target LLM.""" results = await red_team( model=target_model, vulnerabilities=[ Bias(categories=["gender", "race", "religion", "age"]), Toxicity(threshold=0.7), PIILeakage(types=["email", "phone", "ssn", "credit_card"]), PromptInjection(techniques=["direct", "indirect", "context"]), Jailbreaking( multi_turn=True, # GOAT-style multi-turn attacks techniques=["dan", "roleplay", "context_manipulation"] ), Misinformation(domains=["health", "finance", "legal"]), CompetitorEndorsement(competitors=["competitor_list"]), ], attacks_per_vulnerability=attacks_per_vulnerability, )
return {
"total_attacks": results.total_attacks,
"successful_attacks": results.successful_attacks,
"attack_success_rate": results.successful_attacks / results.total_attacks,
"vulnerabilities": [
{
"type": v.type,
"severity": v.severity,
"successful_prompts": v.successful_prompts[:3],
"mitigation": v.suggested_mitigation,
}
for v in results.vulnerabilities
],
}
OWASP Top 10 for LLMs 2025 Mapping
OWASP LLM Risk Guardrail Solution
LLM01: Prompt Injection NeMo input rails, Guardrails AI validators
LLM02: Insecure Output Output rails, structured validation, sanitization
LLM03: Training Data Poisoning N/A (training-time concern)
LLM04: Model Denial of Service Rate limiting, token budgets, timeout rails
LLM05: Supply Chain Vulnerabilities Dependency scanning, model provenance
LLM06: Sensitive Info Disclosure PII detection, context separation, output filtering
LLM07: Insecure Plugin Design Tool validation, permission boundaries
LLM08: Excessive Agency Human-in-loop rails, action confirmation
LLM09: Overreliance Factuality checking, confidence thresholds
LLM10: Model Theft N/A (infrastructure concern)
Anti-Patterns (FORBIDDEN)
NEVER trust LLM output without validation
response = llm.generate(prompt) return response # Raw, unvalidated output!
NEVER skip input sanitization
user_input = request.json["message"] llm.generate(user_input) # Prompt injection risk!
NEVER use single validation layer
if not is_toxic(output): # Only one check return output
ALWAYS use layered validation
guard = Guard().use_many( ToxicLanguage(threshold=0.5), DetectPII(on_fail="fix"), ValidLength(max=500), )
ALWAYS validate both input and output
input_result = input_guard.validate(user_input) if not input_result.validation_passed: return "Invalid input"
llm_output = llm.generate(input_result.validated_output)
output_result = output_guard.validate(llm_output) return output_result.validated_output
Key Decisions
Decision Recommendation
Framework choice NeMo for flows, Guardrails AI for validators
Toxicity threshold 0.5 for content apps, 0.3 for children's apps
PII handling Redact for logs, block for outputs
Topic restriction Allowlist preferred over blocklist
Fact-checking Required for factual domains (health, finance, legal)
Red-teaming frequency Pre-release + quarterly
Detailed Documentation
Resource Description
references/nemo-guardrails.md NeMo Guardrails with Colang 2.0
references/guardrails-ai.md Guardrails AI validators and patterns
references/openai-guardrails.md OpenAI Moderation API integration
references/factuality-checking.md Hallucination detection and grounding
references/red-teaming.md DeepTeam and adversarial testing
scripts/nemo-config.yaml Production NeMo configuration
scripts/rails-pipeline.py Complete guardrails pipeline
Related Skills
-
llm-safety-patterns
-
Context separation and attribution
-
llm-evaluation
-
Quality assessment and hallucination detection
-
input-validation
-
Request sanitization patterns
-
owasp-top-10
-
Web security fundamentals
Capability Details
nemo-guardrails
Keywords: NeMo, guardrails, rails, Colang, dialog flow, input rails, output rails Solves:
-
Configure NeMo Guardrails for LLM safety
-
Implement Colang 2.0 dialog flows
-
Create input/output validation rails
guardrails-ai-validators
Keywords: Guardrails AI, validator, PII, toxicity, topic restriction, structured output Solves:
-
Use Guardrails AI validators for output validation
-
Detect and redact PII from LLM responses
-
Restrict LLM to specific topics
factuality-checking
Keywords: fact-check, hallucination, grounding, RAG verification, NLI Solves:
-
Verify LLM claims against source documents
-
Detect hallucinations in generated content
-
Implement grounding checks for RAG
red-teaming
Keywords: red team, adversarial, jailbreak, GOAT, prompt injection, DeepTeam Solves:
-
Run adversarial testing on LLM systems
-
Detect jailbreaking vulnerabilities
-
Test prompt injection resistance
owasp-llm-compliance
Keywords: OWASP LLM, LLM security, LLM vulnerabilities, LLM Top 10 Solves:
-
Implement OWASP Top 10 for LLMs mitigations
-
Audit LLM systems for security compliance
-
Design secure LLM architectures