Run Experiment
Deploy and run ML experiment: $ARGUMENTS
Workflow
Step 1: Detect Environment
Read the project's CLAUDE.md to determine the experiment environment:
- Local GPU: Look for local CUDA/MPS setup info
- Remote server: Look for SSH alias, conda env, code directory
If no server info is found in CLAUDE.md, ask the user.
Step 2: Pre-flight Check
Check GPU availability on the target machine:
Remote:
ssh <server> nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader
Local:
nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader
# or for Mac MPS:
python -c "import torch; print('MPS available:', torch.backends.mps.is_available())"
Free GPU = memory.used < 500 MiB.
Step 3: Sync Code (Remote Only)
Check the project's CLAUDE.md for a code_sync setting. If not specified, default to rsync.
Option A: rsync (default)
Only sync necessary files — NOT data, checkpoints, or large files:
rsync -avz --include='*.py' --exclude='*' <local_src>/ <server>:<remote_dst>/
Option B: git (when code_sync: git is set in CLAUDE.md)
Push local changes to remote repo, then pull on the server:
# 1. Push from local
git add -A && git commit -m "sync: experiment deployment" && git push
# 2. Pull on server
ssh <server> "cd <remote_dst> && git pull"
Benefits: version-tracked, multi-server sync with one push, no rsync include/exclude rules needed.
Step 4: Deploy
Remote (via SSH + screen)
For each experiment, create a dedicated screen session with GPU binding:
ssh <server> "screen -dmS <exp_name> bash -c '\
eval \"\$(<conda_path>/conda shell.bash hook)\" && \
conda activate <env> && \
CUDA_VISIBLE_DEVICES=<gpu_id> python <script> <args> 2>&1 | tee <log_file>'"
Local
# Linux with CUDA
CUDA_VISIBLE_DEVICES=<gpu_id> python <script> <args> 2>&1 | tee <log_file>
# Mac with MPS (PyTorch uses MPS automatically)
python <script> <args> 2>&1 | tee <log_file>
For local long-running jobs, use run_in_background: true to keep the conversation responsive.
Step 5: Verify Launch
Remote:
ssh <server> "screen -ls"
Local: Check process is running and GPU is allocated.
Step 6: Feishu Notification (if configured)
After deployment is verified, check ~/.claude/feishu.json:
- Send
experiment_donenotification: which experiments launched, which GPUs, estimated time - If config absent or mode
"off": skip entirely (no-op)
Key Rules
- ALWAYS check GPU availability first — never blindly assign GPUs
- Each experiment gets its own screen session + GPU (remote) or background process (local)
- Use
teeto save logs for later inspection - Run deployment commands with
run_in_background: trueto keep conversation responsive - Report back: which GPU, which screen/process, what command, estimated time
- If multiple experiments, launch them in parallel on different GPUs
CLAUDE.md Example
Users should add their server info to their project's CLAUDE.md:
## Remote Server
- SSH: `ssh my-gpu-server`
- GPU: 4x A100 (80GB each)
- Conda: `eval "$(/opt/conda/bin/conda shell.bash hook)" && conda activate research`
- Code dir: `/home/user/experiments/`
- code_sync: rsync # default. Or set to "git" for git push/pull workflow
## Local Environment
- Mac MPS / Linux CUDA
- Conda env: `ml` (Python 3.10 + PyTorch)