gpu-server-management

GPU Server Management

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "gpu-server-management" with this command: npx skills add bagelhole/devops-security-agent-skills/bagelhole-devops-security-agent-skills-gpu-server-management

GPU Server Management

Provision, configure, and monitor NVIDIA GPU servers for AI inference and training workloads.

When to Use This Skill

Use this skill when:

  • Setting up a new GPU server for LLM inference or model training

  • Installing or upgrading NVIDIA drivers and CUDA toolkit

  • Configuring Docker with NVIDIA Container Toolkit for GPU workloads

  • Partitioning A100/H100 GPUs with MIG for multi-tenant workloads

  • Troubleshooting GPU errors, driver issues, or thermal throttling

Prerequisites

  • Ubuntu 22.04 LTS (recommended) or RHEL 8/9

  • NVIDIA GPU (A10G, A100, H100, RTX 4090, or L40S recommended)

  • Root or sudo access

  • Internet access for package downloads

Driver Installation (Ubuntu)

Remove old drivers

sudo apt purge -y 'nvidia*' 'cuda*' 'libcuda*' sudo apt autoremove -y

Add NVIDIA package repository

distribution=$(. /etc/os-release; echo $ID$VERSION_ID) curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey |
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list |
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' |
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update

Install latest driver (560.x as of 2025)

sudo apt install -y nvidia-driver-560 cuda-toolkit-12-6

Install NVIDIA Container Toolkit (Docker GPU support)

sudo apt install -y nvidia-container-toolkit sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker

Verify

nvidia-smi nvcc --version docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi

Post-Install Configuration

Enable persistence mode (reduces driver initialization latency)

sudo nvidia-smi -pm 1

Set power limit (reduce heat/noise on inference servers)

sudo nvidia-smi -pl 350 # watts; check TDP for your GPU model

Disable ECC on inference servers (frees ~6% VRAM, less safe)

sudo nvidia-smi --ecc-config=0 # requires reboot

Enable P2P for multi-GPU NVLink training

sudo nvidia-smi topo -m # check NVLink topology

GPU Health Monitoring

Real-time monitoring (like htop for GPUs)

watch -n 1 nvidia-smi

Detailed stats

nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,
utilization.memory,memory.used,memory.free,power.draw,clocks.current.graphics
--format=csv --loop=1

DCGM — production monitoring daemon (for clusters)

sudo apt install -y datacenter-gpu-manager sudo systemctl start dcgm dcgmi discovery -l # list GPUs dcgmi diag -r 1 # quick health check dcgmi diag -r 3 # full diagnostic (takes ~20 min)

Check GPU errors (XID errors — important for stability)

sudo dmesg | grep -i "NVRM|nvidia|XID" nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total
--format=csv,noheader

Prometheus GPU Metrics (DCGM Exporter)

Deploy DCGM Exporter for Prometheus scraping

docker run -d
--name dcgm-exporter
--gpus all
--cap-add SYS_ADMIN
-p 9400:9400
--restart unless-stopped
nvcr.io/nvidia/k8s/dcgm-exporter:latest

Key metrics exposed:

DCGM_FI_DEV_GPU_UTIL - GPU utilization %

DCGM_FI_DEV_MEM_COPY_UTIL - Memory bandwidth utilization

DCGM_FI_DEV_FB_USED - Framebuffer memory used (MB)

DCGM_FI_DEV_SM_CLOCK - SM clock speed (MHz)

DCGM_FI_DEV_GPU_TEMP - Temperature (°C)

DCGM_FI_DEV_POWER_USAGE - Power draw (W)

DCGM_FI_DEV_XID_ERRORS - XID error count (0 = healthy)

MIG Partitioning (A100/H100)

MIG (Multi-Instance GPU) allows slicing one GPU into isolated smaller GPUs.

Enable MIG mode (requires reboot or restart of all processes)

sudo nvidia-smi -mig 1 sudo systemctl restart nvidia-persistenced

List available MIG profiles (A100 80GB example)

nvidia-smi mig -lgip

1g.10gb — 1 slice, 10GB (max 7 instances)

2g.20gb — 2 slices, 20GB (max 3 instances)

3g.40gb — 3 slices, 40GB (max 2 instances)

7g.80gb — full GPU, 80GB (max 1 instance)

Create MIG instances (e.g., 3× 2g.20gb + 1× 2g.20gb = multi-tenant)

sudo nvidia-smi mig -cgi 2g.20gb,2g.20gb,2g.20gb,2g.20gb -C

List created instances

nvidia-smi mig -lgi nvidia-smi mig -lcgi

Use in Docker

docker run --gpus '"device=MIG-GPU-xxx/0/0"' ...

Disable MIG

sudo nvidia-smi mig -i 0 -dci sudo nvidia-smi mig -i 0 -dgi sudo nvidia-smi -mig 0

Kernel & OS Tuning for GPU Servers

Increase file descriptor limits

echo '* soft nofile 1048576' | sudo tee -a /etc/security/limits.conf echo '* hard nofile 1048576' | sudo tee -a /etc/security/limits.conf

Disable transparent huge pages (reduces latency jitter)

echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

Persist via rc.local or systemd unit:

cat <<'EOF' | sudo tee /etc/rc.local #!/bin/bash echo never > /sys/kernel/mm/transparent_hugepage/enabled echo never > /sys/kernel/mm/transparent_hugepage/defrag nvidia-smi -pm 1 exit 0 EOF sudo chmod +x /etc/rc.local

PCIe performance mode

sudo nvidia-smi --auto-boost-default=0 sudo nvidia-smi --auto-boost-permission=0

Multi-GPU Topology Check

Check NVLink and PCIe topology

nvidia-smi topo -m

Output shows interconnect type:

NV4 = NVLink 4.0 (H100 SXM)

NV2 = NVLink 2.0 (A100 SXM)

PHB = PCIe bus (slower; avoid for tensor parallel training)

PIX = same PCIe switch (fast)

Bandwidth test between GPUs

/usr/local/cuda/samples/bin/x86_64/linux/release/p2pBandwidthLatencyTest

Common Issues

Issue Cause Fix

nvidia-smi: command not found

Driver not installed Follow driver installation steps above

Driver version mismatch CUDA/driver incompatibility Check compatibility matrix at developer.nvidia.com

GPU temperature >85°C Poor airflow or fan failure Check nvidia-smi -q -d TEMPERATURE ; reseat cooler

XID 79 errors GPU hardware error Run dcgmi diag -r 3 ; may need GPU replacement

failed to open device in container Container toolkit not configured Run nvidia-ctk runtime configure --runtime=docker

Low PCIe bandwidth Wrong slot or power limit Check `nvidia-smi -q

Best Practices

  • Always enable persistence mode (nvidia-smi -pm 1 ) — reduces first-request latency.

  • Monitor XID errors; persistent XID 79/94 indicates hardware failure.

  • For training: use NVLink-connected GPUs; for inference: PCIe is usually fine.

  • Set up DCGM alerts on temperature >80°C and power draw near TDP.

  • Use MIG for multi-tenant inference to provide GPU isolation between models.

Related Skills

  • vllm-server - LLM inference on GPUs

  • llm-fine-tuning - GPU training setup

  • linux-hardening - Secure the host OS

  • prometheus-grafana - Metrics dashboards

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Security

sops-encryption

No summary provided by upstream source.

Repository SourceNeeds Review
Security

linux-administration

No summary provided by upstream source.

Repository SourceNeeds Review
Security

linux-hardening

No summary provided by upstream source.

Repository SourceNeeds Review