ROCm vLLM Deployment Skill

Production-ready automation for deploying vLLM inference services on AMD ROCm GPUs using Docker Compose.

Features

Environment Auto-Check - Detects and repairs missing dependencies
Model Parameter Detection - Auto-reads config.json for optimal settings
VRAM Estimation - Calculates memory requirements before deployment
Secure Token Handling - Never writes tokens to compose files
Structured Output - All logs and test results saved per-model
Deployment Reports - Human-readable summary for each deployment
Health Verification - Automated health checks and functional tests
Troubleshooting Guide - Common issues and solutions

Environment Prerequisites

Recommended (for production): Add to ~/.bash_profile:

# HuggingFace authentication token (required for gated models)
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Model cache directory (optional)
export HF_HOME="$HOME/models"

# Apply changes
source ~/.bash_profile

Not required for testing: The skill will proceed without these set:

HF_TOKEN: Optional — public models work without it; gated models fail at download with clear error
HF_HOME: Optional — defaults to /root/.cache/huggingface/hub

Environment Variable Detection

Priority Order:

Explicit parameter (highest) — Provided in task/request (e.g., hf_token: "xxx")
Environment variable — Already set in shell or from parent process
~/.bash_profile — Source to load variables
Default value (lowest) — HF_HOME defaults to /root/.cache/huggingface/hub

Variable	Required	If Missing
`HF_TOKEN`	Conditional	Continue without token (public models work; gated models fail at download with clear error)
`HF_HOME`	No	Warning + Default — Use `/root/.cache/huggingface/hub`

Philosophy: Fail fast for configuration errors, fail at download time for authentication errors.

Helper Scripts

Location: <skill-dir>/scripts/

check-env.sh

Validate and load environment variables before deployment.

Usage:

# Basic check (HF_TOKEN optional, HF_HOME optional with default)
./scripts/check-env.sh

# Strict mode (HF_HOME required, fails if not set)
./scripts/check-env.sh --strict

# Quiet mode (minimal output, for automation)
./scripts/check-env.sh --quiet

# Test with environment variables
HF_TOKEN="hf_xxx" HF_HOME="/models" ./scripts/check-env.sh

Exit Codes:

Code	Meaning
0	Environment check completed (variables loaded or defaulted)
2	Critical error (e.g., cannot source ~/.bash_profile)

Note: This script is optional. You can also directly run source ~/.bash_profile.

generate-report.sh

Generate human-readable deployment report after successful deployment.

Usage:

./scripts/generate-report.sh <model-id> <container-name> <port> <status> [model-load-time] [memory-used]

# Example:
./scripts/generate-report.sh \
  "Qwen-Qwen3-0.6B" \
  "vllm-qwen3-0-6b" \
  "8001" \
  "✅ Success" \
  "3.6" \
  "1.2"

Parameters:

Parameter	Required	Description
`model-id`	Yes	Model ID (with `/` replaced by `-`)
`container-name`	Yes	Docker container name
`port`	Yes	Host port for API endpoint
`status`	Yes	Deployment status (e.g., "✅ Success")
`model-load-time`	No	Model loading time in seconds
`memory-used`	No	Memory consumption in GiB

Output: $HOME/vllm-compose/<model-id>/DEPLOYMENT_REPORT.md

Exit Codes:

Code	Meaning
0	Report generated successfully
1	Missing required parameters
2	Output directory not found

Integration: This script is automatically called in Phase 7 of the deployment workflow.

Input Schema

Parameter	Type	Required	Default	Description
model_id	String	Yes	-	HuggingFace model ID
docker_image	String	No	rocm/vllm-dev:nightly	vLLM Docker image
tensor_parallel_size	Integer	No	1	Number of GPUs
port	Integer	No	9999	API server port
hf_home	String	No	`${HF_HOME}` or `/root/.cache/huggingface/hub`	Model cache directory
hf_token	Secret	Conditional	`${HF_TOKEN}`	HuggingFace token (optional for public models, required for gated models)
max_model_len	Integer	No	Auto-detect	Maximum sequence length
gpu_memory_utilization	Float	No	0.85	GPU memory utilization
auto_install	Boolean	No	true	Auto-install dependencies
log_level	String	No	INFO	Logging verbosity

Output Structure

All deployment artifacts MUST be saved to:

$HOME/vllm-compose/<model-id-slash-to-dash>/

Convert model ID to directory name by replacing / with -:

openai/gpt-oss-20b → $HOME/vllm-compose/openai-gpt-oss-20b/
Qwen/Qwen3-Coder-Next-FP8 → $HOME/vllm-compose/Qwen-Qwen3-Coder-Next-FP8/

Per-model directory structure:

$HOME/vllm-compose/<model-id>/
├── deployment.log          # Full deployment logs (stdout + stderr)
├── test-results.json       # Functional test results (JSON format)
├── docker-compose.yml      # Generated Docker Compose file
├── .env                    # HF_TOKEN environment (chmod 600, optional)
└── DEPLOYMENT_REPORT.md    # Human-readable deployment summary

File requirements:

deployment.log — Capture ALL container logs during deployment
test-results.json — Save API response from functional test request
DEPLOYMENT_REPORT.md — Generated in Phase 7
All three files MUST exist before marking deployment as complete

Execution Workflow

Phase 0: Environment Check & Auto-Repair

Step 0.1: Load Environment Variables

# Source ~/.bash_profile to load HF_HOME and HF_TOKEN
source ~/.bash_profile

# If HF_HOME is not defined, it defaults to /root/.cache/huggingface/hub

If HF_HOME is not defined in ~/.bash_profile, it defaults to /root/.cache/huggingface/hub.

Step 0.2: Create Output Directory

Create: $HOME/vllm-compose/<model-id>/

Step 0.3: Initialize Logging

All output → $HOME/vllm-compose/<model-id>/deployment.log

Step 0.4: System Checks

Detect OS and package manager
Check Python, pip, huggingface_hub
Check Docker, docker compose
Check ROCm tools (rocm-smi/amd-smi)
Check GPU access (/dev/kfd, /dev/dri)
Check disk space (20GB minimum)

Phase 1: Model Download

Use HF_HOME from Phase 0 (environment variable or default):

# Download model to HF_HOME
huggingface-cli download <model_id> --local-dir "$HF_HOME/hub/models--<org>--<model>"

# Or use snapshot_download via Python:
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='<model_id>', cache_dir='$HF_HOME')"

Authentication Handling:

Scenario	Behavior
Public model + no token	✅ Download succeeds
Public model + token provided	✅ Download succeeds
Gated model + no token	❌ Download fails with "authentication required" error
Gated model + invalid token	❌ Download fails with "invalid token" error
Gated model + valid token	✅ Download succeeds

On Authentication Failure:

echo "ERROR: Model download failed - authentication required"
echo "This model requires a valid HF_TOKEN."
echo ""
echo "Please add to ~/.bash_profile:"
echo "  export HF_TOKEN=\"hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\""
echo "Then run: source ~/.bash_profile"
exit 1

Locate model path in HF cache: $HF_HOME/hub/models--<org>--<model-name>/
Log download progress to deployment.log

Phase 2: Model Parameter Detection

Read config.json from model
Auto-detect: max_model_len, hidden_size, num_attention_heads, num_hidden_layers, vocab_size, dtype
Validate TP size divides attention heads
Estimate VRAM requirement

Phase 3: Docker Compose Configuration

Generate files in output directory:

docker-compose.yml → $HOME/vllm-compose/<model-id>/docker-compose.yml
- Mount HF_HOME as volume (read-only for models)
- NO hardcoded tokens in compose file
.env → $HOME/vllm-compose/<model-id>/.env (optional)
- Contains: HF_TOKEN=<value>
- Permissions: chmod 600
- Only created if user explicitly requests persistent token storage

Volume mount example:

volumes:
  - ${HF_HOME}:/root/.cache/huggingface/hub:ro
  - /dev/kfd:/dev/kfd
  - /dev/dri:/dev/dri

Important: Docker Compose reads ${HF_HOME} from the host environment at runtime. Before running docker compose, source ~/.bash_profile: source ~/.bash_profile

Phase 4: Container Launch

Important: Before deploying, pull the latest image to ensure updates:

docker pull rocm/vllm-dev:nightly

Note: Default port is 9999. Before running docker compose, check if port is available: ss -tlnp | grep :<port>. If port is in use, specify a different port in docker-compose.yml.

Pass HF_TOKEN at runtime: HF_TOKEN=$HF_TOKEN docker compose up -d
Wait for container initialization

Phase 5: Health Verification

Check container status
Test /health endpoint
Test /v1/models endpoint

Phase 6: Functional Testing

Run completion test via /v1/chat/completions API
Save response to: $HOME/vllm-compose/<model-id>/test-results.json
Verify response contains valid completion
Log deployment complete → Append to deployment.log
Deployment is complete only when both files exist:
- deployment.log
- test-results.json

Phase 7: Deployment Report

Generate human-readable deployment report using the helper script.

Step 7.1: Extract Deployment Metrics

# Parse deployment.log for metrics
MODEL_LOAD_TIME=$(grep -o "model loading took [0-9.]* seconds" deployment.log | grep -o '[0-9.]*' || echo "N/A")
MEMORY_USED=$(grep -o "took [0-9.]* GiB memory" deployment.log | grep -o '[0-9.]*' || echo "N/A")

Step 7.2: Generate Report

# Execute the report generation script
<skill-dir>/scripts/generate-report.sh \
  "<model-id>" \
  "<container-name>" \
  "<port>" \
  "<status>" \
  "$MODEL_LOAD_TIME" \
  "$MEMORY_USED"

# Example:
./scripts/generate-report.sh \
  "Qwen-Qwen3-0.6B" \
  "vllm-qwen3-0-6b" \
  "8001" \
  "✅ Success" \
  "3.6" \
  "1.2"

Output: $HOME/vllm-compose/<model-id>/DEPLOYMENT_REPORT.md

Report Contents:

Output structure verification (file checklist)
Deployment summary table (health, test, metrics)
Test results (request/response preview)
Environment configuration
Quick commands for operations

Completion Criteria:

DEPLOYMENT_REPORT.md exists in output directory
Report contains all required sections
All file checks show ✅

Security Best Practices

Never commit tokens to version control — Add .env to .gitignore
Use .env files with chmod 600 — Restrict access to owner only
Mask tokens in logs — Show only first 10 chars: ${TOKEN:0:10}...
Pass tokens at runtime — HF_TOKEN=$HF_TOKEN docker compose up -d
Store tokens in ~/.bash_profile — For production environments, set HF_TOKEN in user's shell config
Set token for gated models — HF_TOKEN is validated at download time; set in ~/.bash_profile for production

Troubleshooting

Environment Variables

Issue	Solution
`HF_TOKEN not set`	Add `export HF_TOKEN="hf_xxx"` to `~/.bash_profile`, then `source ~/.bash_profile`. Or provide via parameter.
`HF_HOME not set`	defaults to `/root/.cache/huggingface/hub`. For production, add `export HF_HOME="/path"` to `~/.bash_profile`.
`~/.bash_profile not found`	Create `~/.bash_profile` and add environment variables.
`Changes not taking effect`	Run `source ~/.bash_profile` or restart terminal.
`HF_TOKEN provided but download still fails`	Token may be invalid or lack access to the model. Verify token at https://huggingface.co/settings/tokens

Model Download

Issue	Solution
`Authentication required` (gated model)	Set `HF_TOKEN` in `~/.bash_profile` or provide via parameter. Ensure token has access to the model.
`Model not found`	Verify model ID is correct (case-sensitive). Check model exists on HuggingFace.
`Download timeout`	Check network connection. Large models may take time.

Deployment

Issue	Solution
hf CLI not found	`pip install huggingface_hub`
Docker Compose fails	Use `docker compose` (no hyphen)
GPU access fails	Add user to `render` group: `sudo usermod -aG render $USER`
Port in use	Change `port` parameter
OOM	Reduce `gpu_memory_utilization`

Cleanup

cd $HOME/vllm-compose/<model-id>
docker compose down

Status Check

Check deployment status and logs:

# View deployment directory
ls -la $HOME/vllm-compose/<model-id>/

# View live logs
tail -f $HOME/vllm-compose/<model-id>/deployment.log

# View test results
cat $HOME/vllm-compose/<model-id>/test-results.json

# Check container status
docker ps | grep <model-id>

# Verify environment variables
echo "HF_TOKEN: ${HF_TOKEN:0:10}..."
echo "HF_HOME: $HF_HOME"

Quick Start (Production)

Step 1: Add environment variables to ~/.bash_profile

# Required: HuggingFace token
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Recommended: Custom model storage path (production)
export HF_HOME="/data/models/huggingface"

# Apply changes
source ~/.bash_profile

Step 2: Verify environment is ready

# Source ~/.bash_profile to load variables
source ~/.bash_profile

# Expected output:
# === Environment Ready ===
# Summary:
#   HF_TOKEN: hf_xxxxxx...
#   HF_HOME:  /data/models/huggingface

Step 3: Run deployment

# The skill will automatically:
# 1. Source ~/.bash_profile to load HF_HOME and HF_TOKEN
# 2. Use HF_TOKEN and HF_HOME from environment (or ~/.bash_profile, or defaults)
# 3. Proceed without token for public models
# 4. Fail at download time with clear error if gated model requires token

Version History

Version	Changes
1.0.0	Initial release