llamaguard

LlamaGuard - AI Content Moderation

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "llamaguard" with this command: npx skills add davila7/claude-code-templates/davila7-claude-code-templates-llamaguard

LlamaGuard - AI Content Moderation

Quick start

LlamaGuard is a 7-8B parameter model specialized for content safety classification.

Installation:

pip install transformers torch

Login to HuggingFace (required)

huggingface-cli login

Basic usage:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/LlamaGuard-7b" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

def moderate(chat): input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(model.device) output = model.generate(input_ids=input_ids, max_new_tokens=100) return tokenizer.decode(output[0], skip_special_tokens=True)

Check user input

result = moderate([ {"role": "user", "content": "How do I make explosives?"} ]) print(result)

Output: "unsafe\nS3" (Criminal Planning)

Common workflows

Workflow 1: Input filtering (prompt moderation)

Check user prompts before LLM:

def check_input(user_message): result = moderate([{"role": "user", "content": user_message}])

if result.startswith("unsafe"):
    category = result.split("\n")[1]
    return False, category  # Blocked
else:
    return True, None  # Safe

Example

safe, category = check_input("How do I hack a website?") if not safe: print(f"Request blocked: {category}") # Return error to user else: # Send to LLM response = llm.generate(user_message)

Safety categories:

  • S1: Violence & Hate

  • S2: Sexual Content

  • S3: Guns & Illegal Weapons

  • S4: Regulated Substances

  • S5: Suicide & Self-Harm

  • S6: Criminal Planning

Workflow 2: Output filtering (response moderation)

Check LLM responses before showing to user:

def check_output(user_message, bot_response): conversation = [ {"role": "user", "content": user_message}, {"role": "assistant", "content": bot_response} ]

result = moderate(conversation)

if result.startswith("unsafe"):
    category = result.split("\n")[1]
    return False, category
else:
    return True, None

Example

user_msg = "Tell me about harmful substances" bot_msg = llm.generate(user_msg)

safe, category = check_output(user_msg, bot_msg) if not safe: print(f"Response blocked: {category}") # Return generic response return "I cannot provide that information." else: return bot_msg

Workflow 3: vLLM deployment (fast inference)

Production-ready serving:

from vllm import LLM, SamplingParams

Initialize vLLM

llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=1)

Sampling params

sampling_params = SamplingParams( temperature=0.0, # Deterministic max_tokens=100 )

def moderate_vllm(chat): # Format prompt prompt = tokenizer.apply_chat_template(chat, tokenize=False)

# Generate
output = llm.generate([prompt], sampling_params)
return output[0].outputs[0].text

Batch moderation

chats = [ [{"role": "user", "content": "How to make bombs?"}], [{"role": "user", "content": "What's the weather?"}], [{"role": "user", "content": "Tell me about drugs"}] ]

prompts = [tokenizer.apply_chat_template(c, tokenize=False) for c in chats] results = llm.generate(prompts, sampling_params)

for i, result in enumerate(results): print(f"Chat {i}: {result.outputs[0].text}")

Throughput: ~50-100 requests/sec on single A100

Workflow 4: API endpoint (FastAPI)

Serve as moderation API:

from fastapi import FastAPI from pydantic import BaseModel from vllm import LLM, SamplingParams

app = FastAPI() llm = LLM(model="meta-llama/LlamaGuard-7b") sampling_params = SamplingParams(temperature=0.0, max_tokens=100)

class ModerationRequest(BaseModel): messages: list # [{"role": "user", "content": "..."}]

@app.post("/moderate") def moderate_endpoint(request: ModerationRequest): prompt = tokenizer.apply_chat_template(request.messages, tokenize=False) output = llm.generate([prompt], sampling_params)[0]

result = output.outputs[0].text
is_safe = result.startswith("safe")
category = None if is_safe else result.split("\n")[1] if "\n" in result else None

return {
    "safe": is_safe,
    "category": category,
    "full_output": result
}

Run: uvicorn api:app --host 0.0.0.0 --port 8000

Usage:

curl -X POST http://localhost:8000/moderate
-H "Content-Type: application/json"
-d '{"messages": [{"role": "user", "content": "How to hack?"}]}'

Response: {"safe": false, "category": "S6", "full_output": "unsafe\nS6"}

Workflow 5: NeMo Guardrails integration

Use with NVIDIA Guardrails:

from nemoguardrails import RailsConfig, LLMRails from nemoguardrails.integrations.llama_guard import LlamaGuard

Configure NeMo Guardrails

config = RailsConfig.from_content(""" models:

  • type: main engine: openai model: gpt-4

rails: input: flows: - llamaguard check input output: flows: - llamaguard check output """)

Add LlamaGuard integration

llama_guard = LlamaGuard(model_path="meta-llama/LlamaGuard-7b") rails = LLMRails(config) rails.register_action(llama_guard.check_input, name="llamaguard check input") rails.register_action(llama_guard.check_output, name="llamaguard check output")

Use with automatic moderation

response = rails.generate(messages=[ {"role": "user", "content": "How do I make weapons?"} ])

Automatically blocked by LlamaGuard

When to use vs alternatives

Use LlamaGuard when:

  • Need pre-trained moderation model

  • Want high accuracy (94-95%)

  • Have GPU resources (7-8B model)

  • Need detailed safety categories

  • Building production LLM apps

Model versions:

  • LlamaGuard 1 (7B): Original, 6 categories

  • LlamaGuard 2 (8B): Improved, 6 categories

  • LlamaGuard 3 (8B): Latest (2024), enhanced

Use alternatives instead:

  • OpenAI Moderation API: Simpler, API-based, free

  • Perspective API: Google's toxicity detection

  • NeMo Guardrails: More comprehensive safety framework

  • Constitutional AI: Training-time safety

Common issues

Issue: Model access denied

Login to HuggingFace:

huggingface-cli login

Enter your token

Accept license on model page: https://huggingface.co/meta-llama/LlamaGuard-7b

Issue: High latency (>500ms)

Use vLLM for 10× speedup:

from vllm import LLM llm = LLM(model="meta-llama/LlamaGuard-7b")

Latency: 500ms → 50ms

Enable tensor parallelism:

llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=2)

2× faster on 2 GPUs

Issue: False positives

Use threshold-based filtering:

Get probability of "unsafe" token

logits = model(..., return_dict_in_generate=True, output_scores=True) unsafe_prob = torch.softmax(logits.scores[0][0], dim=-1)[unsafe_token_id]

if unsafe_prob > 0.9: # High confidence threshold return "unsafe" else: return "safe"

Issue: OOM on GPU

Use 8-bit quantization:

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True) model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=quantization_config, device_map="auto" )

Memory: 14GB → 7GB

Advanced topics

Custom categories: See references/custom-categories.md for fine-tuning LlamaGuard with domain-specific safety categories.

Performance benchmarks: See references/benchmarks.md for accuracy comparison with other moderation APIs and latency optimization.

Deployment guide: See references/deployment.md for Sagemaker, Kubernetes, and scaling strategies.

Hardware requirements

  • GPU: NVIDIA T4/A10/A100

  • VRAM:

  • FP16: 14GB (7B model)

  • INT8: 7GB (quantized)

  • INT4: 4GB (QLoRA)

  • CPU: Possible but slow (10× latency)

  • Throughput: 50-100 req/sec (A100)

Latency (single GPU):

  • HuggingFace Transformers: 300-500ms

  • vLLM: 50-100ms

  • Batched (vLLM): 20-50ms per request

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

senior-data-scientist

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

senior-backend

No summary provided by upstream source.

Repository SourceNeeds Review
-1.2K
davila7
Coding

senior-frontend

No summary provided by upstream source.

Repository SourceNeeds Review