jarvislabs-gpu

Agent guide for running and monitoring GPU experiments with the jl CLI on JarvisLabs.ai.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "jarvislabs-gpu" with this command: npx skills add gladiator07/jarvislabs-gpu

JarvisLabs GPU CLI (jl) — Agent Guide

Install The CLI

Install the JarvisLabs CLI as a tool:

uv tool install jarvislabs

Alternative Python install:

pip install jarvislabs

Getting Started

Verify auth with jl status --json before doing anything. If not logged in, use jl setup --token <token> --yes. You can also authenticate via export JL_API_KEY="...".

Use --help on any command to discover flags (e.g., jl run --help, jl create --help). If something goes wrong, use jl run logs, jl run status, and jl exec to diagnose — don't guess.

Mental Model

  • Machine commands (jl create/list/get/pause/resume/destroy/rename/ssh/exec/upload/download) = GPU instance lifecycle and access.
  • jl run = managed job on an instance. Uploads code, sets up a Python environment, runs your script in the background with log tracking.
  • jl exec = run any command on an instance. Use for system checks (nvidia-smi, ps, df), debugging failed runs, inspecting files, or any raw shell access. No environment setup, no tracking. This is your escape hatch when jl run doesn't cover your use case.

Instances

Creating

jl create --gpu L4 --storage 40 --yes --json
jl create --gpu L4 --spot --yes --json
jl create --vm --cpu --yes --json

--gpu is required for GPU instances. Use --spot only for GPU containers, not GPU VMs or CPU VMs. CPU VMs are created with --vm --cpu; omit --vcpus/--ram to use the smallest available CPU plan from the backend. Run jl create --help for all available flags.

Lifecycle

Instances have three states that matter: Running (billing active), Paused (compute billing stopped, storage billing continues, data persists), Destroyed (everything deleted).

jl pause <id> --yes --json      # stop compute billing, keep data
jl resume <id> --yes --json     # restart a paused instance
jl destroy <id> --yes --json    # permanently delete

Resume caveats:

  • Resume is region-locked — an instance always resumes in its original region. GPU swaps are only possible within that region.
  • Resume may return a new machine_id. Always use the returned ID for subsequent operations.
  • Spot resume is explicit: pass --spot when you want a paused GPU container to resume as spot. Without --spot, resume is on-demand.
  • CPU VM resume uses the CPU VM backend path. Pass both --vcpus and --ram if you want to change CPU size on resume.
  • Run jl resume --help for all available flags (GPU swap, storage expansion, rename, etc.).

SSH, exec, upload, and download only work on Running instances.

Regions & GPUs

Valid region codes for new instances: IN2, EU1.

IN1 is winding down. New instances and filesystems can no longer be created in IN1. Existing IN1 instances can still be resumed, paused, destroyed, and renamed; existing IN1 filesystems can still be listed, resized, and removed. Guide users with IN1 resources to the migration doc: https://docs.jarvislabs.ai/in1-migration.

If --region is omitted, the CLI picks a region based on GPU availability.

ConstraintDetail
EU1H100 and H200 only, single-GPU launches only right now, 100 GB minimum storage (auto-bumped)
VM templateIN2 and EU1 only, requires at least one SSH key, 100 GB minimum storage

Run jl gpus to check GPU availability and pricing. Output shows GPU Containers and GPU VMs tables with separate availability for each. Spot prices are shown only for GPU containers.

Run jl resources when you also need CPU VM availability and pricing. It shows GPU containers, GPU VMs, and CPU VMs, with one shared available/unavailable legend at the end.

How to read jl gpus --json availability:

  • num_free_devices: free GPUs on that server. These can be used for normal creates, and also for spot creates when spot_price is present.
  • effective_num_free_devices: GPUs available for on-demand creates on that server, including GPUs currently used by spot instances that can be preempted.
  • These counts are per server. They are not the complete regional GPU capacity.
  • workload_type tells which launch type the row belongs to:
    • "container" means use it for normal GPU container creates.
    • "vm" means use it for GPU VM creates.
    • null means the same row applies to both containers and VMs.

Ports & Services

Container instances expose default HTTP ports (each gets its own HTTPS URL):

PortService
8889JupyterLab (url field)
7007IDE (vs_url field)
6006Available on generic templates like pytorch (endpoints[0])

VM instances (jl create --gpu ... --vm) get SSH-only access. VMs require at least one SSH key registered (jl ssh-key add). Use ssh_command from jl get <id> --json.

To expose a service (FastAPI, Gradio, etc.), bind to 0.0.0.0:6006 — it's accessible via endpoints[0] on generic templates. Use --http-ports "7860,8080" at creation or resume to expose custom ports. Custom port URLs appear in endpoints after the default 6006 entry.

Run jl get <id> --json to find all service URLs (url, vs_url, endpoints).

Managed Runs

How jl run works

jl run uploads your code to an instance, sets up a Python environment, and runs your script in the background with log and exit code tracking. You need either --on <machine_id> (existing instance) or --gpu <type> (creates a fresh instance).

run_id is tracked locally under ~/.jl/runs/. All run management commands (logs, status, stop, list) depend on these local records. Start and monitor runs from the same machine.

Run targets

TargetWhat happens
train.pyUploads to <home>/train.py, runs in <home>/ with shared venv at $HOME/.venv
. or ./project with --script train.pyRsyncs the directory to <home>/<dirname>/, runs inside it with project venv at <home>/<dirname>/.venv
No target, command after --No upload. Runs from ~. If $HOME/.venv exists (from a previous file run), its bin/ is prepended to PATH so python and pip resolve to venv versions. Otherwise uses system Python.

Only .py and .sh file targets are supported. For other file types, use a directory target or jl upload + jl exec. Directory targets require rsync installed locally.

Note: File targets with the same basename overwrite each other on the remote (e.g., foo/train.py and bar/train.py both land at /home/train.py). Use directory targets for projects with nested structure.

Pass script arguments after --:

jl run train.py --on <id> --json --yes -- --epochs 50 --lr 0.001

Environment & setup

jl run manages a Python venv on the remote instance. Template packages (torch, etc.) are inherited via --system-site-packages — no need to install them. Venvs persist under the remote home directory across pause/resume.

Venv locations:

  • File targets: shared instance-level venv at $HOME/.venv. All file runs share it — deps installed for one script are available to all.
  • Directory targets: per-project venv at <home>/<dirname>/.venv. Isolated per project.
  • Command mode: no venv is created. If $HOME/.venv exists from a previous file run, python and pip automatically resolve to it via PATH prepend.

How dependencies get installed:

  • Directory targets — auto-detected. If your directory has requirements.txt or pyproject.toml (with [project]), deps are installed automatically. No flag needed.
  • File targets — no auto-detection. Pass --requirements requirements.txt if you need extra packages.
  • --requirements <file> — overrides auto-detection. Uploads and installs the specified file instead.
  • --setup <command> — runs a shell command before your script (e.g., --setup "pip install flash-attn"). Runs inside the venv for file/dir targets, raw for command mode.
# Directory — auto-detects requirements.txt
jl run . --script train.py --on <id> --json --yes

# Single file — pass requirements explicitly
jl run train.py --on <id> --requirements requirements.txt --json --yes

# Extra setup command
jl run . --script train.py --on <id> --setup "pip install flash-attn" --json --yes

Command mode — when you pass a raw command after -- with no file or directory target. Useful when code already exists on the instance (e.g., uploaded via jl upload, written via jl exec, or left by a previous run). If $HOME/.venv exists from a prior file run, its bin/ is prepended to PATH so python and pip resolve to venv versions. You still get jl run log tracking (logs, status, stop), which is the main advantage over jl exec. --requirements is not supported in command mode.

Important: Command mode runs from ~ (the remote shell home). Use absolute paths or cd explicitly for scripts in specific directories.

jl run --on <id> --json --yes -- python3 /home/train.py
jl run --on <id> --json --yes -- sh -lc 'cd /home && torchrun --nproc_per_node=2 train.py'

Running on an existing instance

jl run train.py --on <machine_id> --json --yes
jl run . --script train.py --on <machine_id> --requirements requirements.txt --json --yes

Lifecycle flags (--keep, --pause, --destroy) are not allowed with --on — the instance is not touched after the run.

Running on a fresh instance

jl run . --script train.py --gpu L4 --keep --json --yes
jl run . --script train.py --gpu L4 --spot --keep --json --yes

Creates a new instance, uploads code, runs the script. Additional flags: --spot (fresh GPU containers only), --vm (VM instead of container, auto-bumps storage to 100GB, disallows --template and --http-ports), --template (default: pytorch; run jl templates --json to list available), --storage (default: 40GB), --num-gpus (default: 1), --region, --http-ports.

Lifecycle rules for fresh instances:

  • With --json or --no-follow: --keep is required. The CLI rejects --pause and --destroy because it returns immediately and cannot apply lifecycle actions later. Use --keep and have the agent pause or destroy the instance after the run completes.
  • Without --json or --no-follow (human mode): the CLI stays attached, streams logs, and applies lifecycle when the run finishes. Default lifecycle is --pause.

Use separate jl create when you need to inspect GPU availability, reuse machines across runs, or attach filesystems/scripts.

Monitoring & Control

Reading logs

The primary monitoring command:

jl run logs <run_id> --tail 50

Always use --tail N — without it, the entire log file is returned and can be enormous.

The output includes a header and footer with run state (in non-follow, non-JSON mode):

--- run r_abc | machine 123 | running ---

step=100 loss=2.31
step=200 loss=2.11

--- still running | log: /home/jl-runs/r_abc/output.log ---

When done, the footer shows the final state:

--- succeeded | exit code: 0 | log: /home/jl-runs/r_abc/output.log ---

Or on failure:

--- failed | exit code: 1 | log: /home/jl-runs/r_abc/output.log ---

If the instance is paused, missing, or SSH is unavailable, jl run logs fails before printing any output. Use jl run status <run_id> --json to check those states.

The agent monitoring loop

  1. Start detached: jl run ... --json --yes — extract run_id and machine_id from JSON
  2. Early check (catch fast failures): sleep 15 && jl run logs <run_id> --tail 30 — if footer says failed, fix and retry immediately
  3. Steady-state polling: sleep 120 && jl run logs <run_id> --tail 50
  4. Read log body for loss values, errors, or progress
  5. Check footer:
    • still running → repeat step 3
    • succeeded | exit code: 0 → download results
    • failed | exit code: N → read error, fix, start a new run

Cadence: 60-120s (short experiments), 180-300s (long training), 300-600s (very long runs).

Checking status

jl run status <run_id> --json

Returns run state, machine_id, exit_code, lifecycle_policy, launch_command, and more. Without --refresh, jl run list shows state as "saved" (a sentinel, not a real run state). Use --refresh or --status to get live state.

Stopping a run

jl run stop <run_id> --json

Kills the entire process group (training script + all child processes). Escalates to SIGKILL if the process doesn't exit after TERM.

System checks via exec

jl exec <id> -- nvidia-smi
jl exec <id> -- ps -ef
jl exec <id> -- df -h

Prefer raw output for jl exec and jl run logs — easier to read and parse. Use --json when you need machine-readable state: create, get, list, run start, run status.

Exit code of the remote command is propagated. For pipes or shell syntax, wrap in sh -lc:

jl exec <id> -- sh -lc 'grep "loss" /path/to/log | tail -5'

File Transfer & Persistence

Upload and download

jl upload <id> ./local /remote          # upload file or directory
jl download <id> /remote ./local         # download file
jl download <id> /remote ./local -r      # download directory

Default destinations: upload without dest → remote home directory. Download without dest → ./<basename> in current local directory.

What persists across pause/resume

The remote home directory (/home/ on containers, /home/<user>/ on VMs) persists. Everything else is ephemeral.

Persists:

  • Files and directories under the home directory
  • $HOME/.venv (shared venv for file runs) and <project>/.venv (per-project venv for directory runs)
  • Attached filesystems (mounted at /home/jl_fs/)
  • Run metadata under <home>/jl-runs/<run_id>/

Lost on pause:

  • System-level installs (apt-get, global pip packages outside the home directory)
  • Files outside the home directory (/tmp, /root, etc.)

Use --setup for system-level reinstalls (e.g., apt-get). Python packages in the venv persist across pause/resume. For recurring system setup, use startup scripts (jl scripts add).

Remote file paths

<home> is /home/ on containers, /home/<user>/ on VMs.

  • Uploaded files (via jl run): <home>/<filename> (e.g., train.py/home/train.py)
  • Uploaded directories (via jl run): <home>/<directory_name>/
  • Uploaded files (via jl upload): <home>/<filename>
  • Shared venv (file runs): <home>/.venv/
  • Project venv (directory runs): <home>/<directory_name>/.venv/
  • Run metadata: <home>/jl-runs/<run_id>/

Filesystems & supporting commands

Attach a filesystem at creation with --fs-id <id>. Attach a startup script with --script-id <id> (and --script-args). These flags work on both jl create and jl resume.

jl templates --json                    # list available templates
jl ssh-key list --json                 # list registered SSH keys
jl ssh-key add <pubkey-file> --name x  # add SSH key (required for VMs)
jl scripts list --json                 # list startup scripts
jl filesystem list --json              # list filesystems
jl filesystem create --name x --storage 100 --json  # create filesystem

Filesystem caveats:

  • Region-bound: A filesystem created in IN2 is only visible to IN2 instances.
  • ID changes on edit: Expanding a filesystem (jl filesystem edit) may return a new fs_id. Always use the returned ID.
  • The CLI validates that fs_id exists before creating/resuming, but does not validate region match. Ensure they match yourself.

Agent Workflow (End-to-End)

# 1. Check GPUs and create instance
jl gpus --json
jl create --gpu L4 --storage 50 --yes --json

# 2. Start detached run
jl run . --script train.py --on <machine_id> --requirements requirements.txt --json --yes

# 3. Early check (catch import/syntax/pip failures fast)
sleep 15 && jl run logs <run_id> --tail 30

# 4. Steady-state monitoring (repeat until footer shows succeeded or failed)
sleep 120 && jl run logs <run_id> --tail 50

# 5. Download results (use /home/<user>/ for VMs instead of /home/)
jl download <machine_id> /home/results ./results -r

# 6. Cleanup
jl pause <machine_id> --yes --json

For fresh instances without a pre-created instance:

# Creates instance inline, runs detached — agent must clean up after
jl run . --script train.py --gpu L4 --keep --json --yes
# ... monitor with jl run logs ...
jl pause <machine_id> --yes --json

Error Handling

When --json is active, CLI validation and API failures are emitted as {"error": "..."} to stdout.

Not all non-zero exits use that shape. jl exec --json returns its own structured payload with stdout, stderr, and exit_code fields.

Agent rule:

  • First inspect the JSON shape
  • If it has an error key, treat it as a CLI failure
  • Otherwise inspect command-specific fields (exit_code, state, run_exit_code)

Anti-Patterns

  • Do not use jl run logs --follow — blocks forever, will timeout. --json is also incompatible with --follow.
  • Always use --json when starting runs — it returns immediately. Without --json, the CLI streams logs and blocks.
  • Do not read full logs without --tail N — can return megabytes of output.
  • Do not poll every few seconds — use 60-600s intervals based on expected run duration.
  • Do not use lifecycle flags (--keep, --pause, --destroy) with --on — they are rejected. Only for fresh instances.
  • Do not use --pause or --destroy with --json for fresh instances — rejected. Use --keep --json and clean up yourself.
  • Do not use jl exec for long-running tasks — it blocks until the command finishes. Use jl run which runs in the background with log tracking.
  • Do not trust jl run list without --refresh — state shows as "saved" (stale). Use --refresh or --status for live state.
  • Do not assume machine_id is stable after jl resume — it may return a new ID. Always use the returned ID.
  • Do not forget to pause/destroy instances after experiments — they cost money.

Command Discovery

Every command supports --help for full flag details:

jl create --help       jl run --help       jl ssh-key --help
jl resume --help       jl run logs --help  jl filesystem --help

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

Ra Pay

Send compliant fiat USD payments via Ra Pay CLI — the first CLI-native AI payment platform

Registry SourceRecently Updated
6050Profile unavailable
Coding

letcairn.work

Project management for AI agents using markdown files. Install and use the cairn CLI to create projects, manage tasks, track status, and coordinate human-AI collaboration through a shared workspace of markdown files.

Registry SourceRecently Updated
1.7K0Profile unavailable
Coding

BotBoard

Manage BotBoard tasks from OpenClaw or any CLI-based agent. Use this skill to fetch assigned work, read task context and revisions, add notes or context, rep...

Registry SourceRecently Updated
1850Profile unavailable
Coding

KlickAnalytics CLI

Demonstrates and teaches the KlickAnalytics CLI (`ka`) — a powerful analytics tool and agent-ready interface for financial markets intelligence. Use this ski...

Registry SourceRecently Updated
2330Profile unavailable