rocm_vllm_deployment

Production-ready vLLM deployment on AMD ROCm GPUs. Combines environment auto-check, model parameter detection, Docker Compose deployment, health verification, and functional testing with comprehensive logging and security best practices.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "rocm_vllm_deployment" with this command: npx skills add alexhegit/rocm-vllm-deployment

ROCm vLLM Deployment Skill

Production-ready automation for deploying vLLM inference services on AMD ROCm GPUs using Docker Compose.

Features

  • Environment Auto-Check - Detects and repairs missing dependencies
  • Model Parameter Detection - Auto-reads config.json for optimal settings
  • VRAM Estimation - Calculates memory requirements before deployment
  • Secure Token Handling - Never writes tokens to compose files
  • Structured Output - All logs and test results saved per-model
  • Deployment Reports - Human-readable summary for each deployment
  • Health Verification - Automated health checks and functional tests
  • Troubleshooting Guide - Common issues and solutions

Environment Prerequisites

Recommended (for production): Add to ~/.bash_profile:

# HuggingFace authentication token (required for gated models)
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Model cache directory (optional)
export HF_HOME="$HOME/models"

# Apply changes
source ~/.bash_profile

Not required for testing: The skill will proceed without these set:

  • HF_TOKEN: Optional — public models work without it; gated models fail at download with clear error
  • HF_HOME: Optional — defaults to /root/.cache/huggingface/hub

Environment Variable Detection

Priority Order:

  1. Explicit parameter (highest) — Provided in task/request (e.g., hf_token: "xxx")
  2. Environment variable — Already set in shell or from parent process
  3. ~/.bash_profile — Source to load variables
  4. Default value (lowest) — HF_HOME defaults to /root/.cache/huggingface/hub
VariableRequiredIf Missing
HF_TOKENConditionalContinue without token (public models work; gated models fail at download with clear error)
HF_HOMENoWarning + Default — Use /root/.cache/huggingface/hub

Philosophy: Fail fast for configuration errors, fail at download time for authentication errors.


Helper Scripts

Location: <skill-dir>/scripts/

check-env.sh

Validate and load environment variables before deployment.

Usage:

# Basic check (HF_TOKEN optional, HF_HOME optional with default)
./scripts/check-env.sh

# Strict mode (HF_HOME required, fails if not set)
./scripts/check-env.sh --strict

# Quiet mode (minimal output, for automation)
./scripts/check-env.sh --quiet

# Test with environment variables
HF_TOKEN="hf_xxx" HF_HOME="/models" ./scripts/check-env.sh

Exit Codes:

CodeMeaning
0Environment check completed (variables loaded or defaulted)
2Critical error (e.g., cannot source ~/.bash_profile)

Note: This script is optional. You can also directly run source ~/.bash_profile.


generate-report.sh

Generate human-readable deployment report after successful deployment.

Usage:

./scripts/generate-report.sh <model-id> <container-name> <port> <status> [model-load-time] [memory-used]

# Example:
./scripts/generate-report.sh \
  "Qwen-Qwen3-0.6B" \
  "vllm-qwen3-0-6b" \
  "8001" \
  "✅ Success" \
  "3.6" \
  "1.2"

Parameters:

ParameterRequiredDescription
model-idYesModel ID (with / replaced by -)
container-nameYesDocker container name
portYesHost port for API endpoint
statusYesDeployment status (e.g., "✅ Success")
model-load-timeNoModel loading time in seconds
memory-usedNoMemory consumption in GiB

Output: $HOME/vllm-compose/<model-id>/DEPLOYMENT_REPORT.md

Exit Codes:

CodeMeaning
0Report generated successfully
1Missing required parameters
2Output directory not found

Integration: This script is automatically called in Phase 7 of the deployment workflow.


Input Schema

ParameterTypeRequiredDefaultDescription
model_idStringYes-HuggingFace model ID
docker_imageStringNorocm/vllm-dev:nightlyvLLM Docker image
tensor_parallel_sizeIntegerNo1Number of GPUs
portIntegerNo9999API server port
hf_homeStringNo${HF_HOME} or /root/.cache/huggingface/hubModel cache directory
hf_tokenSecretConditional${HF_TOKEN}HuggingFace token (optional for public models, required for gated models)
max_model_lenIntegerNoAuto-detectMaximum sequence length
gpu_memory_utilizationFloatNo0.85GPU memory utilization
auto_installBooleanNotrueAuto-install dependencies
log_levelStringNoINFOLogging verbosity

Output Structure

All deployment artifacts MUST be saved to:

$HOME/vllm-compose/<model-id-slash-to-dash>/

Convert model ID to directory name by replacing / with -:

  • openai/gpt-oss-20b$HOME/vllm-compose/openai-gpt-oss-20b/
  • Qwen/Qwen3-Coder-Next-FP8$HOME/vllm-compose/Qwen-Qwen3-Coder-Next-FP8/

Per-model directory structure:

$HOME/vllm-compose/<model-id>/
├── deployment.log          # Full deployment logs (stdout + stderr)
├── test-results.json       # Functional test results (JSON format)
├── docker-compose.yml      # Generated Docker Compose file
├── .env                    # HF_TOKEN environment (chmod 600, optional)
└── DEPLOYMENT_REPORT.md    # Human-readable deployment summary

File requirements:

  • deployment.log — Capture ALL container logs during deployment
  • test-results.json — Save API response from functional test request
  • DEPLOYMENT_REPORT.md — Generated in Phase 7
  • All three files MUST exist before marking deployment as complete

Execution Workflow

Phase 0: Environment Check & Auto-Repair

Step 0.1: Load Environment Variables

# Source ~/.bash_profile to load HF_HOME and HF_TOKEN
source ~/.bash_profile

# If HF_HOME is not defined, it defaults to /root/.cache/huggingface/hub

If HF_HOME is not defined in ~/.bash_profile, it defaults to /root/.cache/huggingface/hub.

Step 0.2: Create Output Directory

  • Create: $HOME/vllm-compose/<model-id>/

Step 0.3: Initialize Logging

  • All output → $HOME/vllm-compose/<model-id>/deployment.log

Step 0.4: System Checks

  • Detect OS and package manager
  • Check Python, pip, huggingface_hub
  • Check Docker, docker compose
  • Check ROCm tools (rocm-smi/amd-smi)
  • Check GPU access (/dev/kfd, /dev/dri)
  • Check disk space (20GB minimum)

Phase 1: Model Download

Use HF_HOME from Phase 0 (environment variable or default):

# Download model to HF_HOME
huggingface-cli download <model_id> --local-dir "$HF_HOME/hub/models--<org>--<model>"

# Or use snapshot_download via Python:
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='<model_id>', cache_dir='$HF_HOME')"

Authentication Handling:

ScenarioBehavior
Public model + no token✅ Download succeeds
Public model + token provided✅ Download succeeds
Gated model + no token❌ Download fails with "authentication required" error
Gated model + invalid token❌ Download fails with "invalid token" error
Gated model + valid token✅ Download succeeds

On Authentication Failure:

echo "ERROR: Model download failed - authentication required"
echo "This model requires a valid HF_TOKEN."
echo ""
echo "Please add to ~/.bash_profile:"
echo "  export HF_TOKEN=\"hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\""
echo "Then run: source ~/.bash_profile"
exit 1
  • Locate model path in HF cache: $HF_HOME/hub/models--<org>--<model-name>/
  • Log download progress to deployment.log

Phase 2: Model Parameter Detection

  • Read config.json from model
  • Auto-detect: max_model_len, hidden_size, num_attention_heads, num_hidden_layers, vocab_size, dtype
  • Validate TP size divides attention heads
  • Estimate VRAM requirement

Phase 3: Docker Compose Configuration

Generate files in output directory:

  • docker-compose.yml$HOME/vllm-compose/<model-id>/docker-compose.yml

    • Mount HF_HOME as volume (read-only for models)
    • NO hardcoded tokens in compose file
  • .env$HOME/vllm-compose/<model-id>/.env (optional)

    • Contains: HF_TOKEN=<value>
    • Permissions: chmod 600
    • Only created if user explicitly requests persistent token storage

Volume mount example:

volumes:
  - ${HF_HOME}:/root/.cache/huggingface/hub:ro
  - /dev/kfd:/dev/kfd
  - /dev/dri:/dev/dri

Important: Docker Compose reads ${HF_HOME} from the host environment at runtime. Before running docker compose, source ~/.bash_profile: source ~/.bash_profile

Phase 4: Container Launch

Important: Before deploying, pull the latest image to ensure updates:

docker pull rocm/vllm-dev:nightly

Note: Default port is 9999. Before running docker compose, check if port is available: ss -tlnp | grep :<port>. If port is in use, specify a different port in docker-compose.yml.

  • Pass HF_TOKEN at runtime: HF_TOKEN=$HF_TOKEN docker compose up -d
  • Wait for container initialization

Phase 5: Health Verification

  • Check container status
  • Test /health endpoint
  • Test /v1/models endpoint

Phase 6: Functional Testing

  • Run completion test via /v1/chat/completions API
  • Save response to: $HOME/vllm-compose/<model-id>/test-results.json
  • Verify response contains valid completion
  • Log deployment complete → Append to deployment.log
  • Deployment is complete only when both files exist:
    • deployment.log
    • test-results.json

Phase 7: Deployment Report

Generate human-readable deployment report using the helper script.

Step 7.1: Extract Deployment Metrics

# Parse deployment.log for metrics
MODEL_LOAD_TIME=$(grep -o "model loading took [0-9.]* seconds" deployment.log | grep -o '[0-9.]*' || echo "N/A")
MEMORY_USED=$(grep -o "took [0-9.]* GiB memory" deployment.log | grep -o '[0-9.]*' || echo "N/A")

Step 7.2: Generate Report

# Execute the report generation script
<skill-dir>/scripts/generate-report.sh \
  "<model-id>" \
  "<container-name>" \
  "<port>" \
  "<status>" \
  "$MODEL_LOAD_TIME" \
  "$MEMORY_USED"

# Example:
./scripts/generate-report.sh \
  "Qwen-Qwen3-0.6B" \
  "vllm-qwen3-0-6b" \
  "8001" \
  "✅ Success" \
  "3.6" \
  "1.2"

Output: $HOME/vllm-compose/<model-id>/DEPLOYMENT_REPORT.md

Report Contents:

  • Output structure verification (file checklist)
  • Deployment summary table (health, test, metrics)
  • Test results (request/response preview)
  • Environment configuration
  • Quick commands for operations

Completion Criteria:

  • DEPLOYMENT_REPORT.md exists in output directory
  • Report contains all required sections
  • All file checks show ✅

Security Best Practices

  1. Never commit tokens to version control — Add .env to .gitignore
  2. Use .env files with chmod 600 — Restrict access to owner only
  3. Mask tokens in logs — Show only first 10 chars: ${TOKEN:0:10}...
  4. Pass tokens at runtimeHF_TOKEN=$HF_TOKEN docker compose up -d
  5. Store tokens in ~/.bash_profile — For production environments, set HF_TOKEN in user's shell config
  6. Set token for gated models — HF_TOKEN is validated at download time; set in ~/.bash_profile for production

Troubleshooting

Environment Variables

IssueSolution
HF_TOKEN not setAdd export HF_TOKEN="hf_xxx" to ~/.bash_profile, then source ~/.bash_profile. Or provide via parameter.
HF_HOME not setdefaults to /root/.cache/huggingface/hub. For production, add export HF_HOME="/path" to ~/.bash_profile.
~/.bash_profile not foundCreate ~/.bash_profile and add environment variables.
Changes not taking effectRun source ~/.bash_profile or restart terminal.
HF_TOKEN provided but download still failsToken may be invalid or lack access to the model. Verify token at https://huggingface.co/settings/tokens

Model Download

IssueSolution
Authentication required (gated model)Set HF_TOKEN in ~/.bash_profile or provide via parameter. Ensure token has access to the model.
Model not foundVerify model ID is correct (case-sensitive). Check model exists on HuggingFace.
Download timeoutCheck network connection. Large models may take time.

Deployment

IssueSolution
hf CLI not foundpip install huggingface_hub
Docker Compose failsUse docker compose (no hyphen)
GPU access failsAdd user to render group: sudo usermod -aG render $USER
Port in useChange port parameter
OOMReduce gpu_memory_utilization

Cleanup

cd $HOME/vllm-compose/<model-id>
docker compose down

Status Check

Check deployment status and logs:

# View deployment directory
ls -la $HOME/vllm-compose/<model-id>/

# View live logs
tail -f $HOME/vllm-compose/<model-id>/deployment.log

# View test results
cat $HOME/vllm-compose/<model-id>/test-results.json

# Check container status
docker ps | grep <model-id>

# Verify environment variables
echo "HF_TOKEN: ${HF_TOKEN:0:10}..."
echo "HF_HOME: $HF_HOME"

Quick Start (Production)

Step 1: Add environment variables to ~/.bash_profile

# Required: HuggingFace token
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Recommended: Custom model storage path (production)
export HF_HOME="/data/models/huggingface"

# Apply changes
source ~/.bash_profile

Step 2: Verify environment is ready

# Source ~/.bash_profile to load variables
source ~/.bash_profile

# Expected output:
# === Environment Ready ===
# Summary:
#   HF_TOKEN: hf_xxxxxx...
#   HF_HOME:  /data/models/huggingface

Step 3: Run deployment

# The skill will automatically:
# 1. Source ~/.bash_profile to load HF_HOME and HF_TOKEN
# 2. Use HF_TOKEN and HF_HOME from environment (or ~/.bash_profile, or defaults)
# 3. Proceed without token for public models
# 4. Fail at download time with clear error if gated model requires token

Version History

VersionChanges
1.0.0Initial release

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

LLM Deploy

在 GPU 服务器上部署 LLM 模型服务(vLLM)。支持多服务器配置,自动检查 GPU 和端口占用,一键部署流行的开源大语言模型。

Registry SourceRecently Updated
055
Profile unavailable
General

Gpu Deploy

在 GPU 服务器上部署 vLLM 模型服务。支持多服务器配置,自动检查 GPU 和端口占用,一键部署流行的开源模型。

Registry SourceRecently Updated
0187
Profile unavailable
Automation

Openclaw Router

Intelligent Model Routing - Save 60% on AI Costs / 智能路由系统 - 节省 60% 成本

Registry SourceRecently Updated
0190
Profile unavailable