ssh-lab

# ssh-lab — Remote Server SSH Workbench

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ssh-lab" with this command: npx skills add Enderfga/ssh-lab

ssh-lab — Remote Server SSH Workbench

Structured SSH operations for managing remote GPU servers. Standardizes what's universal (GPU/disk/process), delegates interpretation to the LLM.

When to Use

  • User asks about GPU status, server health, remote processes
  • Need to check if training is running, disk is full, GPU is idle
  • Running commands on remote servers
  • Tailing remote log files, listing checkpoints
  • File transfer between local and remote (rsync)
  • Comparing hosts side-by-side (GPU, disk, processes)
  • Setting up alerts for GPU idle, disk full, process died
  • Any mention of: ssh-lab, gpu status, server status, nvidia-smi, remote server

Setup

cd /path/to/ssh-lab && npm run build

After building, use the CLI via any of:

ssh-lab status all          # if npm link'd or installed globally
npx ssh-lab status all      # via npx
node dist/cli.js status all # direct invocation

The CLI reads ~/.ssh/config automatically — zero configuration needed for existing SSH hosts.

Commands

List hosts

ssh-lab hosts [--json]

Server status (GPU + disk + process)

ssh-lab status [host|all] [--json] [--timeout ms] [--heartbeat] [--quiet]

Returns structured probe data: GPU utilization/VRAM/temp, disk usage, active processes. Includes automatic alerts: GPU idle, disk full (>90%), high temp (>85°C). --heartbeat gives compact one-liner output for agent integration. --quiet suppresses all-clear output (for cron).

Run remote command

ssh-lab run <host|all> <command...> [--json] [--timeout ms]

Supports all for parallel execution across all hosts.

Tail remote log

ssh-lab tail <host> <path> [-n lines] [--json]

Default: last 50 lines. Smart truncation at 50KB.

List remote directory

ssh-lab ls <host> <path> [--sort time|size|name] [--json]

Lists files with size, date, permissions. Great for checkpoint directories.

Disk usage

ssh-lab df <host|all> [path] [--json]

Shows disk usage for all filesystems. Optional path runs du -sh on a specific directory.

File sync (rsync)

ssh-lab sync <host> <src> <dst> [--direction up|down] [--dry-run] [--exclude pat1,pat2]

Direction: down (remote→local, default), up (local→remote). Always use --dry-run first!

Watch file (snapshot for polling)

ssh-lab watch <host> <path> [-n lines] [--prev-hash hash] [--json]

Captures file state + metadata. Returns changed: true/false when --prev-hash is provided. Agent calls this repeatedly via cron/heartbeat — the tool itself is stateless.

Compare hosts (side-by-side)

ssh-lab compare <host1> <host2> [--probes gpu,disk,process] [--json]
ssh-lab compare all [--probes gpu]
ssh-lab compare GMI4 GMI5 GMI6 --probes gpu,disk

Side-by-side comparison of GPU, disk, and process data across hosts. Probes default to gpu,disk,process. Supports any number of hosts. Uses bounded concurrency pool for parallel data collection.

Doctor (diagnostics)

ssh-lab doctor <host> [--json] [--timeout ms]

Runs connectivity and health diagnostics on a host. Checks SSH reachability, authentication, command execution, and basic system health.

Alert management

# List configured rules
ssh-lab alert list [--json]

# Add a rule
ssh-lab alert add <kind> <host> [--threshold N] [--process-pattern regex]

# Remove a rule
ssh-lab alert remove <rule-id>

# Evaluate alerts against live status
ssh-lab alert check [host|all] [--json] [--quiet]

Alert kinds: gpu_idle, disk_full, process_died, ssh_unreachable, oom_detected, high_temp

Examples:

ssh-lab alert add disk_full all --threshold 90
ssh-lab alert add process_died GMI4 --process-pattern "torchrun|deepspeed"
ssh-lab alert check all --json

Add custom host

ssh-lab add <name> <user@host> [--port N] [--tags a,b] [--notes "desc"]

Output Modes

All commands support three output modes:

  • summary (default): Human-readable text with formatting
  • json (--json or -j): Structured JSON for programmatic use
  • raw (--raw or -r): Raw command output only

Timeout Tiers

Commands use per-type timeout defaults (user --timeout always overrides):

  • quick (5s): hosts, add, doctor — connectivity checks
  • standard (15s): status, run, ls, df, alert, compare — normal probes
  • stream (30s): tail, watch — streaming/log tailing
  • transfer (60s): sync — rsync bulk transfers

Architecture

  • Probe system: Pluggable probes (gpu, disk, process) collect structured data
  • SSH via child_process: Uses native ssh — inherits user's config, keys, ProxyJump
  • ControlMaster reuse: Persistent SSH connections via /tmp/ssh-lab-%r@%h:%p sockets; auto-cleanup detects stale sockets (ssh -O check), gracefully exits (ssh -O exit), then removes dead socket files before retry
  • Retry + error classification: Transient failures retried with exponential backoff; errors classified (AUTH_FAILED/TIMEOUT/etc.) with hints
  • Tiered timeouts: Per-command defaults — quick (5s) for doctor/add, standard (15s) for status/run/ls/df/compare/alert, stream (30s) for tail/watch, transfer (60s) for sync. --timeout always overrides
  • Parallel execution: status all / run all queries all hosts concurrently via withPool (default: 5)
  • Alert system: 6 built-in alert rules with configurable thresholds
  • Zero npm dependencies: Only devDependencies (typescript, @types/node)
  • SSH config hosts use alias: For hosts from ~/.ssh/config, SSH alias is used directly so all ProxyJump/Match/Include rules are respected

Heartbeat Integration

In HEARTBEAT.md:

- [ ] Run: ssh-lab status all --heartbeat
  Check for: GPU idle, disk >90%, processes died
- [ ] Run: ssh-lab alert check all --json
  Act on any critical firings

Config

Custom hosts stored in ~/.config/ssh-lab/config.json (auto-created, XDG-compliant). SSH hosts from ~/.ssh/config discovered automatically via ssh -G. Override config path with SSH_LAB_CONFIG env var.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Ai Competitor Analyzer

提供AI驱动的竞争对手分析,支持批量自动处理,提升企业和专业团队分析效率与专业度。

Registry SourceRecently Updated
General

Ai Data Visualization

提供自动化AI分析与多格式批量处理,显著提升数据可视化效率,节省成本,适用企业和个人用户。

Registry SourceRecently Updated
General

Ai Cost Optimizer

提供基于预算和任务需求的AI模型成本优化方案,计算节省并指导OpenClaw配置与模型切换策略。

Registry SourceRecently Updated