JarvisLabs GPU CLI (jl) — Agent Guide
Install The CLI
Install the JarvisLabs CLI as a tool:
uv tool install jarvislabs
Alternative Python install:
pip install jarvislabs
Getting Started
Verify auth with jl status --json before doing anything. If not logged in, use jl setup --token <token> --yes. You can also authenticate via export JL_API_KEY="...".
Use --help on any command to discover flags (e.g., jl run --help, jl create --help). If something goes wrong, use jl run logs, jl run status, and jl exec to diagnose — don't guess.
Mental Model
- Machine commands (
jl create/list/get/pause/resume/destroy/rename/ssh/exec/upload/download) = GPU instance lifecycle and access. jl run= managed job on an instance. Uploads code, sets up a Python environment, runs your script in the background with log tracking.jl exec= run any command on an instance. Use for system checks (nvidia-smi, ps, df), debugging failed runs, inspecting files, or any raw shell access. No environment setup, no tracking. This is your escape hatch whenjl rundoesn't cover your use case.
Instances
Creating
jl create --gpu L4 --storage 40 --yes --json
jl create --gpu L4 --spot --yes --json
jl create --vm --cpu --yes --json
--gpu is required for GPU instances. Use --spot only for GPU containers, not GPU VMs or CPU VMs. CPU VMs are created with --vm --cpu; omit --vcpus/--ram to use the smallest available CPU plan from the backend. Run jl create --help for all available flags.
Lifecycle
Instances have three states that matter: Running (billing active), Paused (compute billing stopped, storage billing continues, data persists), Destroyed (everything deleted).
jl pause <id> --yes --json # stop compute billing, keep data
jl resume <id> --yes --json # restart a paused instance
jl destroy <id> --yes --json # permanently delete
Resume caveats:
- Resume is region-locked — an instance always resumes in its original region. GPU swaps are only possible within that region.
- Resume may return a new machine_id. Always use the returned ID for subsequent operations.
- Spot resume is explicit: pass
--spotwhen you want a paused GPU container to resume as spot. Without--spot, resume is on-demand. - CPU VM resume uses the CPU VM backend path. Pass both
--vcpusand--ramif you want to change CPU size on resume. - Run
jl resume --helpfor all available flags (GPU swap, storage expansion, rename, etc.).
SSH, exec, upload, and download only work on Running instances.
Regions & GPUs
Valid region codes for new instances: IN2, EU1.
IN1 is winding down. New instances and filesystems can no longer be created in
IN1. ExistingIN1instances can still be resumed, paused, destroyed, and renamed; existingIN1filesystems can still be listed, resized, and removed. Guide users withIN1resources to the migration doc: https://docs.jarvislabs.ai/in1-migration.
If --region is omitted, the CLI picks a region based on GPU availability.
| Constraint | Detail |
|---|---|
| EU1 | H100 and H200 only, single-GPU launches only right now, 100 GB minimum storage (auto-bumped) |
| VM template | IN2 and EU1 only, requires at least one SSH key, 100 GB minimum storage |
Run jl gpus to check GPU availability and pricing. Output shows GPU Containers and GPU VMs tables with separate availability for each. Spot prices are shown only for GPU containers.
Run jl resources when you also need CPU VM availability and pricing. It shows GPU containers, GPU VMs, and CPU VMs, with one shared available/unavailable legend at the end.
How to read jl gpus --json availability:
num_free_devices: free GPUs on that server. These can be used for normal creates, and also for spot creates whenspot_priceis present.effective_num_free_devices: GPUs available for on-demand creates on that server, including GPUs currently used by spot instances that can be preempted.- These counts are per server. They are not the complete regional GPU capacity.
workload_typetells which launch type the row belongs to:"container"means use it for normal GPU container creates."vm"means use it for GPU VM creates.nullmeans the same row applies to both containers and VMs.
Ports & Services
Container instances expose default HTTP ports (each gets its own HTTPS URL):
| Port | Service |
|---|---|
| 8889 | JupyterLab (url field) |
| 7007 | IDE (vs_url field) |
| 6006 | Available on generic templates like pytorch (endpoints[0]) |
VM instances (jl create --gpu ... --vm) get SSH-only access. VMs require at least one SSH key registered (jl ssh-key add). Use ssh_command from jl get <id> --json.
To expose a service (FastAPI, Gradio, etc.), bind to 0.0.0.0:6006 — it's accessible via endpoints[0] on generic templates. Use --http-ports "7860,8080" at creation or resume to expose custom ports. Custom port URLs appear in endpoints after the default 6006 entry.
Run jl get <id> --json to find all service URLs (url, vs_url, endpoints).
Managed Runs
How jl run works
jl run uploads your code to an instance, sets up a Python environment, and runs your script in the background with log and exit code tracking. You need either --on <machine_id> (existing instance) or --gpu <type> (creates a fresh instance).
run_id is tracked locally under ~/.jl/runs/. All run management commands (logs, status, stop, list) depend on these local records. Start and monitor runs from the same machine.
Run targets
| Target | What happens |
|---|---|
train.py | Uploads to <home>/train.py, runs in <home>/ with shared venv at $HOME/.venv |
. or ./project with --script train.py | Rsyncs the directory to <home>/<dirname>/, runs inside it with project venv at <home>/<dirname>/.venv |
No target, command after -- | No upload. Runs from ~. If $HOME/.venv exists (from a previous file run), its bin/ is prepended to PATH so python and pip resolve to venv versions. Otherwise uses system Python. |
Only .py and .sh file targets are supported. For other file types, use a directory target or jl upload + jl exec. Directory targets require rsync installed locally.
Note: File targets with the same basename overwrite each other on the remote (e.g., foo/train.py and bar/train.py both land at /home/train.py). Use directory targets for projects with nested structure.
Pass script arguments after --:
jl run train.py --on <id> --json --yes -- --epochs 50 --lr 0.001
Environment & setup
jl run manages a Python venv on the remote instance. Template packages (torch, etc.) are inherited via --system-site-packages — no need to install them. Venvs persist under the remote home directory across pause/resume.
Venv locations:
- File targets: shared instance-level venv at
$HOME/.venv. All file runs share it — deps installed for one script are available to all. - Directory targets: per-project venv at
<home>/<dirname>/.venv. Isolated per project. - Command mode: no venv is created. If
$HOME/.venvexists from a previous file run,pythonandpipautomatically resolve to it via PATH prepend.
How dependencies get installed:
- Directory targets — auto-detected. If your directory has
requirements.txtorpyproject.toml(with[project]), deps are installed automatically. No flag needed. - File targets — no auto-detection. Pass
--requirements requirements.txtif you need extra packages. --requirements <file>— overrides auto-detection. Uploads and installs the specified file instead.--setup <command>— runs a shell command before your script (e.g.,--setup "pip install flash-attn"). Runs inside the venv for file/dir targets, raw for command mode.
# Directory — auto-detects requirements.txt
jl run . --script train.py --on <id> --json --yes
# Single file — pass requirements explicitly
jl run train.py --on <id> --requirements requirements.txt --json --yes
# Extra setup command
jl run . --script train.py --on <id> --setup "pip install flash-attn" --json --yes
Command mode — when you pass a raw command after -- with no file or directory target. Useful when code already exists on the instance (e.g., uploaded via jl upload, written via jl exec, or left by a previous run). If $HOME/.venv exists from a prior file run, its bin/ is prepended to PATH so python and pip resolve to venv versions. You still get jl run log tracking (logs, status, stop), which is the main advantage over jl exec. --requirements is not supported in command mode.
Important: Command mode runs from ~ (the remote shell home). Use absolute paths or cd explicitly for scripts in specific directories.
jl run --on <id> --json --yes -- python3 /home/train.py
jl run --on <id> --json --yes -- sh -lc 'cd /home && torchrun --nproc_per_node=2 train.py'
Running on an existing instance
jl run train.py --on <machine_id> --json --yes
jl run . --script train.py --on <machine_id> --requirements requirements.txt --json --yes
Lifecycle flags (--keep, --pause, --destroy) are not allowed with --on — the instance is not touched after the run.
Running on a fresh instance
jl run . --script train.py --gpu L4 --keep --json --yes
jl run . --script train.py --gpu L4 --spot --keep --json --yes
Creates a new instance, uploads code, runs the script. Additional flags: --spot (fresh GPU containers only), --vm (VM instead of container, auto-bumps storage to 100GB, disallows --template and --http-ports), --template (default: pytorch; run jl templates --json to list available), --storage (default: 40GB), --num-gpus (default: 1), --region, --http-ports.
Lifecycle rules for fresh instances:
- With
--jsonor--no-follow:--keepis required. The CLI rejects--pauseand--destroybecause it returns immediately and cannot apply lifecycle actions later. Use--keepand have the agent pause or destroy the instance after the run completes. - Without
--jsonor--no-follow(human mode): the CLI stays attached, streams logs, and applies lifecycle when the run finishes. Default lifecycle is--pause.
Use separate jl create when you need to inspect GPU availability, reuse machines across runs, or attach filesystems/scripts.
Monitoring & Control
Reading logs
The primary monitoring command:
jl run logs <run_id> --tail 50
Always use --tail N — without it, the entire log file is returned and can be enormous.
The output includes a header and footer with run state (in non-follow, non-JSON mode):
--- run r_abc | machine 123 | running ---
step=100 loss=2.31
step=200 loss=2.11
--- still running | log: /home/jl-runs/r_abc/output.log ---
When done, the footer shows the final state:
--- succeeded | exit code: 0 | log: /home/jl-runs/r_abc/output.log ---
Or on failure:
--- failed | exit code: 1 | log: /home/jl-runs/r_abc/output.log ---
If the instance is paused, missing, or SSH is unavailable, jl run logs fails before printing any output. Use jl run status <run_id> --json to check those states.
The agent monitoring loop
- Start detached:
jl run ... --json --yes— extractrun_idandmachine_idfrom JSON - Early check (catch fast failures):
sleep 15 && jl run logs <run_id> --tail 30— if footer saysfailed, fix and retry immediately - Steady-state polling:
sleep 120 && jl run logs <run_id> --tail 50 - Read log body for loss values, errors, or progress
- Check footer:
still running→ repeat step 3succeeded | exit code: 0→ download resultsfailed | exit code: N→ read error, fix, start a new run
Cadence: 60-120s (short experiments), 180-300s (long training), 300-600s (very long runs).
Checking status
jl run status <run_id> --json
Returns run state, machine_id, exit_code, lifecycle_policy, launch_command, and more. Without --refresh, jl run list shows state as "saved" (a sentinel, not a real run state). Use --refresh or --status to get live state.
Stopping a run
jl run stop <run_id> --json
Kills the entire process group (training script + all child processes). Escalates to SIGKILL if the process doesn't exit after TERM.
System checks via exec
jl exec <id> -- nvidia-smi
jl exec <id> -- ps -ef
jl exec <id> -- df -h
Prefer raw output for jl exec and jl run logs — easier to read and parse. Use --json when you need machine-readable state: create, get, list, run start, run status.
Exit code of the remote command is propagated. For pipes or shell syntax, wrap in sh -lc:
jl exec <id> -- sh -lc 'grep "loss" /path/to/log | tail -5'
File Transfer & Persistence
Upload and download
jl upload <id> ./local /remote # upload file or directory
jl download <id> /remote ./local # download file
jl download <id> /remote ./local -r # download directory
Default destinations: upload without dest → remote home directory. Download without dest → ./<basename> in current local directory.
What persists across pause/resume
The remote home directory (/home/ on containers, /home/<user>/ on VMs) persists. Everything else is ephemeral.
Persists:
- Files and directories under the home directory
$HOME/.venv(shared venv for file runs) and<project>/.venv(per-project venv for directory runs)- Attached filesystems (mounted at
/home/jl_fs/) - Run metadata under
<home>/jl-runs/<run_id>/
Lost on pause:
- System-level installs (
apt-get, global pip packages outside the home directory) - Files outside the home directory (
/tmp,/root, etc.)
Use --setup for system-level reinstalls (e.g., apt-get). Python packages in the venv persist across pause/resume. For recurring system setup, use startup scripts (jl scripts add).
Remote file paths
<home> is /home/ on containers, /home/<user>/ on VMs.
- Uploaded files (via
jl run):<home>/<filename>(e.g.,train.py→/home/train.py) - Uploaded directories (via
jl run):<home>/<directory_name>/ - Uploaded files (via
jl upload):<home>/<filename> - Shared venv (file runs):
<home>/.venv/ - Project venv (directory runs):
<home>/<directory_name>/.venv/ - Run metadata:
<home>/jl-runs/<run_id>/
Filesystems & supporting commands
Attach a filesystem at creation with --fs-id <id>. Attach a startup script with --script-id <id> (and --script-args). These flags work on both jl create and jl resume.
jl templates --json # list available templates
jl ssh-key list --json # list registered SSH keys
jl ssh-key add <pubkey-file> --name x # add SSH key (required for VMs)
jl scripts list --json # list startup scripts
jl filesystem list --json # list filesystems
jl filesystem create --name x --storage 100 --json # create filesystem
Filesystem caveats:
- Region-bound: A filesystem created in IN2 is only visible to IN2 instances.
- ID changes on edit: Expanding a filesystem (
jl filesystem edit) may return a newfs_id. Always use the returned ID. - The CLI validates that
fs_idexists before creating/resuming, but does not validate region match. Ensure they match yourself.
Agent Workflow (End-to-End)
# 1. Check GPUs and create instance
jl gpus --json
jl create --gpu L4 --storage 50 --yes --json
# 2. Start detached run
jl run . --script train.py --on <machine_id> --requirements requirements.txt --json --yes
# 3. Early check (catch import/syntax/pip failures fast)
sleep 15 && jl run logs <run_id> --tail 30
# 4. Steady-state monitoring (repeat until footer shows succeeded or failed)
sleep 120 && jl run logs <run_id> --tail 50
# 5. Download results (use /home/<user>/ for VMs instead of /home/)
jl download <machine_id> /home/results ./results -r
# 6. Cleanup
jl pause <machine_id> --yes --json
For fresh instances without a pre-created instance:
# Creates instance inline, runs detached — agent must clean up after
jl run . --script train.py --gpu L4 --keep --json --yes
# ... monitor with jl run logs ...
jl pause <machine_id> --yes --json
Error Handling
When --json is active, CLI validation and API failures are emitted as {"error": "..."} to stdout.
Not all non-zero exits use that shape. jl exec --json returns its own structured payload with stdout, stderr, and exit_code fields.
Agent rule:
- First inspect the JSON shape
- If it has an
errorkey, treat it as a CLI failure - Otherwise inspect command-specific fields (
exit_code,state,run_exit_code)
Anti-Patterns
- Do not use
jl run logs --follow— blocks forever, will timeout.--jsonis also incompatible with--follow. - Always use
--jsonwhen starting runs — it returns immediately. Without--json, the CLI streams logs and blocks. - Do not read full logs without
--tail N— can return megabytes of output. - Do not poll every few seconds — use 60-600s intervals based on expected run duration.
- Do not use lifecycle flags (
--keep,--pause,--destroy) with--on— they are rejected. Only for fresh instances. - Do not use
--pauseor--destroywith--jsonfor fresh instances — rejected. Use--keep --jsonand clean up yourself. - Do not use
jl execfor long-running tasks — it blocks until the command finishes. Usejl runwhich runs in the background with log tracking. - Do not trust
jl run listwithout--refresh— state shows as"saved"(stale). Use--refreshor--statusfor live state. - Do not assume
machine_idis stable afterjl resume— it may return a new ID. Always use the returned ID. - Do not forget to pause/destroy instances after experiments — they cost money.
Command Discovery
Every command supports --help for full flag details:
jl create --help jl run --help jl ssh-key --help
jl resume --help jl run logs --help jl filesystem --help