sglang-jax SkyPilot Dev Skill
This skill handles running tests and development tasks for the sgl-jax project on remote TPU clusters via SkyPilot.
Note: This project requires TPU for all JAX tests. Never run JAX/TPU tests locally.
- Prerequisites
-
The TPU cluster must already be provisioned. Check that .cluster_name_tpu exists and is non-empty in the project root.
-
If the file does not exist or is empty, provision a cluster first (see Section 3).
Execution Instructions: Before running the launch script, find its absolute path in the scripts/ directory alongside this skill definition. Use file search tools (e.g., glob ) to locate launch_tpu.sh before executing it.
- Project Layout
Component Path
Main source python/sgl_jax/
SRT module python/sgl_jax/srt/
Tests test/
Test suite runner test/srt/run_suite.py
- Cluster Management
Provisioning
Common TPU types for this project: tpu-v6e-4, tpu-v6e-8, tpu-v4-8
bash <absolute_path_to_launch_tpu.sh> <accelerator_type> <experiment_name>
Example:
bash <absolute_path_to_launch_tpu.sh> tpu-v6e-4 dp-test
The launch script automatically writes the cluster name to .cluster_name_tpu .
Teardown
sky down $(cat .cluster_name_tpu) -y
- Running Tests
All test execution uses SSH + tmux for reliable process management and log inspection. Never use sky exec for running tests.
General Workflow
-
SSH into the cluster
-
Update code on the remote via git
-
Start test tasks in named tmux sessions
-
Monitor and retrieve results
Step 1: SSH into the cluster
Get the cluster IP
CLUSTER_IP=$(sky status --ip $(cat .cluster_name_tpu))
SSH directly to the remote
ssh -i ~/.ssh/sky-key gcpuser@$CLUSTER_IP
Step 2: Update code on the remote via git
On the remote machine, use git to pull the latest changes:
cd ~/sky_workdir
Pull latest changes from the remote repository
git fetch origin git pull origin <branch-name>
Or if you need to switch branches
git checkout <branch-name> git pull origin <branch-name>
Install/update dependencies after pulling
uv sync --extra tpu
Important: Always commit and push your local changes before running tests on the remote. The remote should pull from the git repository, not sync from your local filesystem.
Step 3: Run tests in tmux sessions
Mode A: Test File Testing
Run test files directly under the test/ directory. Use this for unit tests, integration tests, and regression tests that don't require a live server.
Run the full test suite:
On the remote machine:
tmux new-session -d -s test-suite -c ~/sky_workdir
tmux send-keys -t test-suite
"uv run --extra tpu python test/srt/run_suite.py" Enter
Follow the output
tmux attach -t test-suite
Run a single test file:
On the remote machine:
tmux new-session -d -s test -c ~/sky_workdir
tmux send-keys -t test
"uv run --extra tpu python -m pytest test/srt/<test_file.py> -v" Enter
Follow the output
tmux attach -t test
Run tests matching a pattern:
On the remote machine:
tmux new-session -d -s test -c ~/sky_workdir
tmux send-keys -t test
"uv run --extra tpu python -m pytest test/srt/ -k <pattern> -v" Enter
Follow the output
tmux attach -t test
Common pytest flags:
Flag Purpose
-v
Verbose output
-k <pattern>
Run tests matching pattern
-x
Stop on first failure
--tb=short
Short traceback
-s
Show stdout (useful for JAX logs)
Mode B: Service Debug Testing
Use this when you need to start a server process, wait for it to be ready, then run a second process (debug script, benchmark, or accuracy test) against the live server.
Step 1: Start the server in a named tmux session
On the remote machine:
tmux new-session -d -s server -c ~/sky_workdir
tmux send-keys -t server
"uv run --extra tpu python <SERVER_COMMAND> [ARGS]" Enter
Step 2: Wait for the server to be ready
On the remote machine — poll until the service is ready:
until <READINESS_CHECK>; do echo "Waiting for server..."; sleep 5 done echo "Server is ready."
Common readiness checks:
Method Command
HTTP health endpoint curl -sf http://localhost:<PORT>/health
Log line appeared tmux capture-pane -pt server | grep -q "server started"
Port is open nc -z localhost <PORT>
Step 3: Run the client in a separate tmux session
On the remote machine:
tmux new-session -d -s client -c ~/sky_workdir
tmux send-keys -t client
"uv run --extra tpu python <CLIENT_COMMAND> [ARGS]" Enter
Follow client output
tmux attach -t client
Step 4: Inspect and clean up
List all sessions
tmux ls
View server logs (without attaching)
tmux capture-pane -pt server -S -1000
Kill a session when done
tmux kill-session -t server tmux kill-session -t client
Tmux session naming convention:
Session name Purpose
server
The main inference/serving process
client
Debug / benchmark / accuracy test runner
<feature>-server
Use feature-prefixed names when running multiple experiments simultaneously
- Operational Notes
-
SSH Access: Use direct SSH with ssh -i ~/.ssh/sky-key gcpuser@$CLUSTER_IP to connect to the cluster
-
Code Updates: Always use git pull on the remote to update code; commit and push local changes first
-
Tmux Sessions: All test execution happens in named tmux sessions for reliability and log inspection
-
Workdir: The remote workdir is at ~/sky_workdir/ on the TPU VM
-
Logs: View logs using tmux capture-pane or by attaching to the session
-
Interruption: Detach from tmux with Ctrl+B D ; sessions persist after disconnection
-
agent.md: Check the project's agent.md for the active cluster name when working on parallel development tasks
- Functional-Point-Specific Testing
(To be populated with data-parallelism feature-specific test commands and validation steps.)