AI Security Hardening
Secure LLM and AI systems against prompt injection, jailbreaks, data leakage, and supply chain threats in production environments.
When to Use This Skill
Use this skill when:
-
Deploying an LLM-powered application handling sensitive user data
-
Protecting against prompt injection attacks in AI agents
-
Implementing output filtering and content moderation
-
Securing model weights and API endpoints from theft
-
Achieving SOC2 or ISO 27001 compliance for AI systems
AI-Specific Threat Model
Threat Risk Control ───────────────────────────────────────────────────────────────────── Prompt injection System prompt override Input sanitization, separate context Data exfiltration PII in model outputs Output filtering, DLP scanning Jailbreaking Policy bypass Content moderation, guardrails Model theft Weight extraction via API Rate limiting, access controls Training data poisoning Backdoored fine-tuned model Dataset validation, provenance Supply chain attack Malicious model weights Signature verification, scanning Insecure output XSS/SQLi from LLM response Output encoding, parameterized queries
Prompt Injection Defense
import re from typing import Optional
INJECTION_PATTERNS = [ r"ignore\s+(all\s+)?(previous|prior|above)\s+instructions", r"you\s+are\s+now\s+", r"new\s+instructions?:", r"system\s+prompt", r"forget\s+everything", r"act\s+as\s+", r"jailbreak", r"dan\s+mode", r"<\ssystem\s>", r"[INST]", ]
def detect_prompt_injection(user_input: str) -> tuple[bool, Optional[str]]: """Return (is_suspicious, matched_pattern).""" normalized = user_input.lower().strip() for pattern in INJECTION_PATTERNS: if re.search(pattern, normalized, re.IGNORECASE): return True, pattern return False, None
def sanitize_user_input(user_input: str, max_length: int = 4000) -> str: """Sanitize input before passing to LLM.""" # Truncate user_input = user_input[:max_length]
# Remove null bytes and control characters
user_input = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]', '', user_input)
# Check for injection
suspicious, pattern = detect_prompt_injection(user_input)
if suspicious:
raise ValueError(f"Potential prompt injection detected: {pattern}")
return user_input
Guardrails with NeMo Guardrails
guardrails.yaml
from nemoguardrails import RailsConfig, LLMRails
config = RailsConfig.from_path("./guardrails-config") rails = LLMRails(config)
async def safe_llm_call(user_message: str) -> str: response = await rails.generate_async( messages=[{"role": "user", "content": user_message}] ) return response["content"]
guardrails-config/config.yml
models:
- type: main engine: openai model: gpt-4o-mini
rails: input: flows: - check jailbreak - check sensitive data output: flows: - check output for PII - check output for harmful content
Output Filtering & PII Scrubbing
import re from presidio_analyzer import AnalyzerEngine from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine() anonymizer = AnonymizerEngine()
PII_ENTITIES = ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "CREDIT_CARD", "US_SSN", "IBAN_CODE", "IP_ADDRESS", "LOCATION"]
def scrub_pii_from_output(text: str) -> str: """Remove PII from LLM output before returning to user.""" results = analyzer.analyze(text=text, entities=PII_ENTITIES, language="en") if not results: return text anonymized = anonymizer.anonymize(text=text, analyzer_results=results) return anonymized.text
def validate_output_safety(output: str) -> bool:
"""Check output doesn't contain prompt injection artifacts."""
dangerous_patterns = [
r"<\sscript\s>", # XSS
r"javascript:", # XSS
r";\s*(DROP|DELETE|INSERT)",# SQLi
r"${.*}", # template injection
r".*", # command injection in some contexts
]
for pattern in dangerous_patterns:
if re.search(pattern, output, re.IGNORECASE):
return False
return True
API Security for LLM Endpoints
from fastapi import FastAPI, HTTPException, Depends, Request from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials import jwt import time from collections import defaultdict
app = FastAPI() security = HTTPBearer()
Rate limiting (per API key)
request_counts = defaultdict(list)
def rate_limit(api_key: str, max_requests: int = 100, window_seconds: int = 60): now = time.time() requests = request_counts[api_key] # Remove old requests outside window request_counts[api_key] = [t for t in requests if now - t < window_seconds] if len(request_counts[api_key]) >= max_requests: raise HTTPException(status_code=429, detail="Rate limit exceeded") request_counts[api_key].append(now)
async def verify_token( credentials: HTTPAuthorizationCredentials = Depends(security) ) -> dict: try: payload = jwt.decode(credentials.credentials, SECRET_KEY, algorithms=["HS256"]) rate_limit(payload["sub"]) return payload except jwt.ExpiredSignatureError: raise HTTPException(status_code=401, detail="Token expired") except jwt.InvalidTokenError: raise HTTPException(status_code=401, detail="Invalid token")
@app.post("/v1/chat/completions") async def chat(request: Request, token: dict = Depends(verify_token)): body = await request.json()
# Input validation
user_msg = body.get("messages", [{}])[-1].get("content", "")
try:
safe_input = sanitize_user_input(user_msg)
except ValueError as e:
raise HTTPException(status_code=400, detail=str(e))
# Call LLM and scrub output
response = await call_llm(safe_input, token["scope"])
response["choices"][0]["message"]["content"] = scrub_pii_from_output(
response["choices"][0]["message"]["content"]
)
return response
Model Weight Security
Verify model weights with SHA-256 hash before loading
MODEL_DIR="./models/llama-3.1-8b" EXPECTED_HASH="sha256:abc123..."
Generate hash of downloaded model
actual_hash=$(find "$MODEL_DIR" -name "*.safetensors" | sort | xargs sha256sum | sha256sum) echo "Model hash: $actual_hash"
Compare (automate in CI/CD)
if [ "$actual_hash" != "$EXPECTED_HASH" ]; then echo "ERROR: Model hash mismatch — possible tampering!" exit 1 fi
Scan model files for embedded malware (ModelScan)
pip install modelscan modelscan scan -p "$MODEL_DIR"
Network Isolation for AI Services
Kubernetes NetworkPolicy — isolate LLM API
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: llm-api-isolation namespace: ai-services spec: podSelector: matchLabels: app: vllm policyTypes:
- Ingress
- Egress ingress:
- from:
- namespaceSelector: matchLabels: name: backend # only backend can call LLM ports:
- protocol: TCP port: 8000 egress:
- to:
- namespaceSelector: matchLabels: name: monitoring # metrics only ports:
- protocol: TCP port: 9090
Block egress to internet — prevent data exfiltration
(allow only internal cluster traffic)
Audit Logging
import structlog from datetime import datetime, timezone
audit_log = structlog.get_logger("ai.audit")
def log_llm_interaction( user_id: str, session_id: str, model: str, prompt_tokens: int, completion_tokens: int, was_filtered: bool, injection_detected: bool, ): audit_log.info( "llm_interaction", timestamp=datetime.now(timezone.utc).isoformat(), user_id=user_id, session_id=session_id, model=model, prompt_tokens=prompt_tokens, completion_tokens=completion_tokens, was_filtered=was_filtered, injection_detected=injection_detected, # DO NOT log prompt/completion content — PII risk )
Common Issues
Issue Cause Fix
False positive injection blocks Overly broad regex Tune patterns; use ML-based classifier for high-traffic
PII in model outputs Model trained on PII data Add Presidio scrubbing to output layer
API key leakage Keys in logs or responses Mask keys in logging; use vault for key storage
Model weight tampering Unverified downloads Always verify SHA-256; use modelscan
Rate limit bypass Per-IP not per-user Rate limit on authenticated user ID, not IP
Best Practices
-
Never log raw prompts or completions — they may contain PII or sensitive data.
-
Treat LLM output as untrusted input — always encode before rendering in HTML.
-
Use network policies to prevent LLM pods from making outbound internet calls.
-
Rotate API keys quarterly; use short-lived JWT tokens for service-to-service auth.
-
Run modelscan on any model downloaded from the internet before serving.
Related Skills
-
hashicorp-vault - Secrets management for API keys
-
network-security - Network-level controls
-
linux-hardening - Host hardening
-
agent-observability - AI audit logging
-
llm-gateway - Centralized access control