Vast.ai Workflow
Overview
Use this skill for provider-level Vast.ai automation only. Keep it project-agnostic and parameter-driven.
Scope Boundaries
- Handle only Vast API workflow logic: offers, create, poll, SSH attach, lifecycle, billing.
- Do not embed project-specific repo paths, branch names, labels, or training commands.
- If the user asks for workload-specific execution (for example Reparo training), switch to the corresponding workload skill after infrastructure provisioning.
Dialog-First Required Fields
Before any create/order call, confirm required runtime fields in dialog. If the user does not provide them, propose defaults and ask for confirmation.
Required fields and default suggestions:
api_key_source: defaultVAST_API_KEYenvironment variable.- If missing, suggest loading from a local file path the user confirms (for example
keys/.vast_env).
- If missing, suggest loading from a local file path the user confirms (for example
instance_type(offer filter): defaultgpu_name="RTX 4090".- Ask whether to lock by exact GPU, VRAM minimum, region, max price, and reliability.
image: defaultpytorch/pytorch:latest.disk_gb: default64.count: default1.label: do not ask by default.- Derive from Vast account identity when possible (nickname or email local-part).
- Fallback label:
vast-user.
Suggested minimal question set:
- Which action now (
offers,create,status,ssh,destroy,billing)? - Confirm runtime fields (
image,disk_gb, instance filter). - Confirm API key source (
VAST_API_KEYenv vs file path).
If the user approves defaults, proceed without extra questions.
Workflow
1) Preflight
- Check active instances first.
- If any instances are unintentionally active, ask whether to stop/destroy before creating a new one.
- Ensure API key is loaded and never echo it.
2) Find offers
- Query offers (
/bundlesis commonly reliable in practice). - Apply user filters plus mandatory
rentable=trueandrented=false. - Sort by user objective (price, performance, reliability).
- Use user policy for selection:
- Default: choose cheapest valid offer.
- Optional conservative mode: choose second cheapest.
3) Create instance
- Create from a single selected offer (
PUT /asks/{id}). - Use confirmed
image, resolvedlabel, anddisk_gb. - Parse and store the returned instance ID.
- If create fails (
no_such_askor already taken), re-query and retry once with next candidate.
Label resolution order:
- Load profile/account endpoint data and use nickname when present.
- If nickname missing, use the email local-part when available.
- If neither field is available, use
vast-user.
4) Readiness and access
- Poll instance status until running.
- Add SSH key to account and attach to instance.
- Retry SSH readiness for up to 2 minutes before declaring failure.
5) Lifecycle and billing
- Support start/stop/restart when available.
- Use destroy promptly when requested to prevent costs.
- Verify final state and check usage/invoices when needed.
6) Runtime preflight (optional, when user plans to run workloads)
- Validate runtime inside the instance instead of trusting image names/tags.
- Confirm GPU visibility and basic CUDA readiness before starting long jobs.
- Verify required Python package versions explicitly and report them in logs.
- If workload depends on optional CUDA extensions (for example
flash_attnorxformers), check availability first and fail fast on mismatch. - Prefer explicit environment checks over silent fallbacks to slower backends.
Request Templates
Use explicit, reproducible requests and validate JSON before chaining calls.
curl -sS -L -G "https://console.vast.ai/api/v0/<endpoint>/" \
--data-urlencode "api_key=$VAST_API_KEY" \
--data-urlencode "<param>=<value>"
curl -sS -L -X PUT "https://console.vast.ai/api/v0/asks/$OFFER_ID/?api_key=$VAST_API_KEY" \
-H "Content-Type: application/json" \
-d "{\"image\":\"$IMAGE\",\"disk\":$DISK_GB,\"label\":\"$LABEL\"}"
Error Handling
401/403: API key missing, invalid, or not authorized for requested org/action.429: rate limit; retry with backoff.4xx: invalid endpoint or params; re-check request shape and required fields.5xx: provider-side issue; retry with backoff and re-validate state.- Empty/parse failures: retry once, then save response to a temp file and parse from file.
Resources
references/api.md: concise endpoint map and safe calling checklist.- Treat
references/api.mdas the working source for endpoint details and refresh it against official docs regularly.