airunway-aks-setup

Set up AI Runway on AKS — from bare cluster to running model. Covers cluster verification, controller install, GPU assessment, provider setup, and first deployment. WHEN: "setup AI Runway", "onboard AKS cluster", "install AI Runway", "airunway setup", "deploy model to AKS", "GPU inference on AKS", "KAITO setup on AKS", "run LLM on AKS", "vLLM on AKS", "set up model serving on AKS", "AI Runway controller".

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "airunway-aks-setup" with this command: npx skills add microsoft/azure-skills/microsoft-azure-skills-airunway-aks-setup

AI Runway AKS Setup

This skill walks users from a bare Kubernetes cluster to a running AI model deployment. Follow each step in sequence unless the user provides skip-to-step N to resume from a specific phase.

Cost awareness: GPU node pools incur significant compute charges (A100-80GB can cost $3–5+/hr). Confirm the user understands cost implications before provisioning GPU resources.

Prerequisites

This skill assumes an AKS cluster already exists. If the user does not have a cluster, hand off to the azure-kubernetes skill first to provision one (with a GPU node pool unless CPU-only inference is acceptable), then return here.

Quick Reference

PropertyValue
Best forEnd-to-end AI Runway onboarding on AKS
CLI toolskubectl, make, curl
MCP toolsNone
Related skillsazure-kubernetes (cluster setup), azure-diagnostics (troubleshooting)

When to Use This Skill

Use this skill when the user wants to:

  • Set up AI Runway on an existing AKS cluster from scratch
  • Install the AI Runway controller and CRDs
  • Assess GPU hardware compatibility for model deployment
  • Choose and install an inference provider (KAITO, Dynamo, KubeRay)
  • Deploy their first AI model to AKS via AI Runway
  • Resume a partially-complete AI Runway setup from a specific step

MCP Tools

This skill uses no MCP tools. All cluster operations are performed directly via kubectl and make.

Rules

  1. Execute steps in sequence — load the reference for each step as you reach it
  2. Report cluster state at each step: ✓ healthy, ✗ missing/failed
  3. Ask for user confirmation before any install or deployment action
  4. If a step is already complete, report status and skip to the next step
  5. If the user provides skip-to-step N, start at step N; assume prior steps are complete

Steps

#StepReference
1Cluster Verification — context check, node inventory, GPU detectionstep-1-verify.md
2Controller Installation — CRD + controller deploymentstep-2-controller.md
3GPU Assessment — detect GPU models, flag dtype/attention constraintsstep-3-gpu.md
4Provider Setup — recommend and install inference providerstep-4-provider.md
5First Deployment — pick a model, deploy, verify Readystep-5-deploy.md
6Summary — recap, smoke test, next stepsstep-6-summary.md

Error Handling

Error / SymptomLikely CauseRemediation
No kubeconfig contextNot connected to a clusterRun az aks get-credentials or equivalent
Controller in CrashLoopBackOffConfig or RBAC issuekubectl logs -n airunway-system -l control-plane=controller-manager --previous
Provider not readyImage pull or RBAC issuekubectl logs <pod-name> -n <namespace> for the provider pod
ModelDeployment stuck in PendingGPU scheduling failure or provider not readykubectl describe modeldeployment <name> -n <namespace> events
bfloat16 errors at inferenceT4 or V100 lacks bfloat16 supportAdd --dtype float16 to serving args

For full error handling and rollback procedures, see troubleshooting.md.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

azure-ai

Use for Azure AI: Search, Speech, OpenAI, Document Intelligence. Helps with search, vector/hybrid search, speech-to-text, text-to-speech, transcription, OCR. WHEN: AI Search, query search, vector search, hybrid search, semantic search, speech-to-text, text-to-speech, transcribe, OCR, convert text to speech.

Repository Source
220K651Microsoft
General

azure-deploy

Execute Azure deployments for ALREADY-PREPARED applications that have existing .azure/deployment-plan.md and infrastructure files. DO NOT use this skill when the user asks to CREATE a new application — use azure-prepare instead. This skill runs azd up, azd deploy, terraform apply, and az deployment commands with built-in error recovery. Requires .azure/deployment-plan.md from azure-prepare and validated status from azure-validate. WHEN: "run azd up", "run azd deploy", "execute deployment", "push to production", "push to cloud", "go live", "ship it", "bicep deploy", "terraform apply", "publish to Azure", "launch on Azure". DO NOT USE WHEN: "create and deploy", "build and deploy", "create a new app", "set up infrastructure", "create and deploy to Azure using Terraform" — use azure-prepare for these.

Repository Source
219.9K651Microsoft
General

azure-prepare

Prepare Azure apps for deployment (infra Bicep/Terraform, azure.yaml, Dockerfiles). Use for create/modernize or create+deploy; not cross-cloud migration (use azure-cloud-migrate). WHEN: "create app", "build web app", "create API", "create serverless HTTP API", "create frontend", "create back end", "build a service", "modernize application", "update application", "add authentication", "add caching", "host on Azure", "create and deploy", "deploy to Azure", "deploy to Azure using Terraform", "deploy to Azure App Service", "deploy to Azure App Service using Terraform", "deploy to Azure Container Apps", "deploy to Azure Container Apps using Terraform", "generate Terraform", "generate Bicep", "function app", "timer trigger", "service bus trigger", "event-driven function", "containerized Node.js app", "social media app", "static portfolio website", "todo list with frontend and API", "prepare my Azure application to use Key Vault", "managed identity".

Repository Source
219.8K651Microsoft
General

azure-diagnostics

Debug Azure production issues on Azure using AppLens, Azure Monitor, resource health, and safe triage. WHEN: debug production issues, troubleshoot container apps, troubleshoot functions, troubleshoot AKS, kubectl cannot connect, kube-system/CoreDNS failures, pod pending, crashloop, node not ready, upgrade failures, analyze logs, KQL, insights, image pull failures, cold start issues, health probe failures, resource health, root cause of errors.

Repository Source
219.8K651Microsoft