gpu-kubernetes-operations

GPU Kubernetes Operations

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "gpu-kubernetes-operations" with this command: npx skills add bagelhole/devops-security-agent-skills/bagelhole-devops-security-agent-skills-gpu-kubernetes-operations

GPU Kubernetes Operations

Run resilient and cost-efficient GPU clusters for production AI workloads.

Key Capabilities

  • NVIDIA device plugin and GPU operator lifecycle

  • MIG partitioning for multi-workload efficiency

  • GPU-aware autoscaling (KEDA/cluster autoscaler)

  • Node health checks and proactive remediation

Cluster Baseline

  • Dedicated GPU node pools with taints and tolerations

  • Runtime class and driver/toolkit compatibility checks

  • Local SSD or high-throughput network storage for model weights

  • DCGM metrics exported to Prometheus

Scheduling Patterns

  • Use node affinity by GPU type (A10/L4/A100/H100).

  • Separate latency-critical inference from batch training.

  • Pin model replicas with anti-affinity for availability.

  • Reserve headroom for failover and rolling updates.

Autoscaling Strategy

  • Scale on queue depth + GPU utilization, not CPU alone.

  • Warm spare replicas for large model cold-start mitigation.

  • Cap burst scaling to avoid quota exhaustion.

Reliability Checks

  • ECC error and Xid monitoring

  • GPU memory pressure alerts

  • Driver mismatch detection during upgrades

  • Pod preemption impact analysis

Cost Optimization

  • Prefer MIG slices for smaller inference services.

  • Schedule batch jobs in off-peak windows.

  • Route low-priority traffic to cheaper model tiers.

Related Skills

  • llm-inference-scaling - Autoscale inference workloads

  • model-serving-kubernetes - Production model serving patterns

  • gpu-server-management - Host-level GPU management fundamentals

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Security

sops-encryption

No summary provided by upstream source.

Repository SourceNeeds Review
Security

linux-administration

No summary provided by upstream source.

Repository SourceNeeds Review
Security

linux-hardening

No summary provided by upstream source.

Repository SourceNeeds Review