gpu-kubernetes-operations | V50.AI

gpu-kubernetes-operations

GPU Kubernetes Operations

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "gpu-kubernetes-operations" with this command: npx skills add bagelhole/devops-security-agent-skills/bagelhole-devops-security-agent-skills-gpu-kubernetes-operations

GPU Kubernetes Operations

Run resilient and cost-efficient GPU clusters for production AI workloads.

Key Capabilities

NVIDIA device plugin and GPU operator lifecycle
MIG partitioning for multi-workload efficiency
GPU-aware autoscaling (KEDA/cluster autoscaler)
Node health checks and proactive remediation

Cluster Baseline

Dedicated GPU node pools with taints and tolerations
Runtime class and driver/toolkit compatibility checks
Local SSD or high-throughput network storage for model weights
DCGM metrics exported to Prometheus

Scheduling Patterns

Use node affinity by GPU type (A10/L4/A100/H100).
Separate latency-critical inference from batch training.
Pin model replicas with anti-affinity for availability.
Reserve headroom for failover and rolling updates.

Autoscaling Strategy

Scale on queue depth + GPU utilization, not CPU alone.
Warm spare replicas for large model cold-start mitigation.
Cap burst scaling to avoid quota exhaustion.

Reliability Checks

ECC error and Xid monitoring
GPU memory pressure alerts
Driver mismatch detection during upgrades
Pod preemption impact analysis

Cost Optimization

Prefer MIG slices for smaller inference services.
Schedule batch jobs in off-peak windows.
Route low-priority traffic to cheaper model tiers.

Related Skills

llm-inference-scaling - Autoscale inference workloads
model-serving-kubernetes - Production model serving patterns
gpu-server-management - Host-level GPU management fundamentals

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Open in GitHub Open in ClawHub

Related Skills

Related by shared tags or category signals.

Security

sops-encryption

No summary provided by upstream source.

Repository SourceNeeds Review

-31

Security

linux-administration

No summary provided by upstream source.

Repository SourceNeeds Review

-29

Security

linux-hardening

No summary provided by upstream source.

Repository SourceNeeds Review

-26