Slurm Cluster Management
Help developers submit, manage, and troubleshoot GPU-accelerated workloads on SRP's Slurm clusters. Supports training, inference, and data processing jobs using Apptainer containers.
When to Use This Skill
Use this skill when:
-
Submitting GPU training or inference jobs to Slurm clusters
-
Managing running or queued jobs
-
Monitoring cluster resources and job status
-
Debugging job failures or performance issues
-
Writing Slurm job scripts with Apptainer containers
-
Checking GPU availability and utilization
SRP Slurm Clusters
Oracle OKE Cluster (H100 GPUs)
SSH Access:
ssh -p 2222 <your-ldap-username>@129.80.180.16
Example:
ssh -p 2222 zhuguangbin@129.80.180.16
GPU Type: H100 Partition: h100 (must specify in job scripts) Use Cases: Large model training, high-performance inference
DO DOKS Cluster (H200 GPUs)
SSH Access:
ssh -p 2222 <your-ldap-username>@129.212.240.50
Example:
ssh -p 2222 zhuguangbin@129.212.240.50
GPU Type: H200 Partition: Specify in job scripts Use Cases: Latest GPU workloads, large-scale training
Data Access
Both clusters use JuiceFS for unified data access:
-
Path: /data0/ or /data/srp/
-
Same permissions and directory structure as development machines
-
Shared across all cluster nodes and with A10 dev machines
Monitoring
Oracle OKE Cluster Dashboards:
-
Cluster Overview: https://grafana.g.yesy.site/d/edrg5th9t1edcb/slinky-slurm
-
Workload Monitoring: https://grafana.g.yesy.site/d/f2c83374-71e2-42c6-92a1-10505b584cf2/workload
-
Job-Level Stats: https://grafana.g.yesy.site/d/HRLkiLS7k/slurmjobstats
DO DOKS Cluster Dashboards:
-
Cluster Overview: https://grafana.g2.yesy.site/d/edrg5th9t1edcb/slinky-slurm
-
Workload Monitoring: https://grafana.g2.yesy.site/d/workload/workload
-
Job-Level Stats: https://grafana.g2.yesy.site/d/slurm/slurm
Metrics Available:
-
Cluster resource utilization
-
GPU availability and usage
-
Job queue status
-
Per-job resource consumption
-
Historical workload patterns
Essential Slurm Commands
Job Submission
Submit batch job script
sbatch job_script.sh
Submit with ssubmit wrapper (recommended)
ssubmit -j job_name -p h100 -g 1 -c 10 -m 32G -t 2:00:00 -cmd "python train.py"
Interactive job allocation
salloc --partition=h100 --gres=gpu:1 --time=01:00:00
Run command directly
srun --partition=h100 --gres=gpu:1 python test.py
Job Management
View your jobs
squeue -u $USER
View all jobs
squeue
View specific job details
scontrol show job <job_id>
Cancel job
scancel <job_id>
Cancel all your jobs
scancel -u $USER
Cancel jobs by name
scancel --name=job_name
Cluster Information
View partitions and nodes
sinfo
View detailed node info
sinfo -N -l
Check GPU availability
sinfo -o "%20N %10c %10m %25f %10G"
View specific partition
sinfo -p h100
Job History
View completed jobs
sacct
View specific job details
sacct -j <job_id> --format=JobID,JobName,Partition,AllocCPUS,State,ExitCode
View jobs from last week
sacct --starttime=now-7days --format=JobID,JobName,Elapsed,State,ExitCode
Job Script Structure
Modern Slurm Script (Simplified)
The new Slinky Slurm clusters use prolog/epilog for notifications, so scripts are much simpler:
#!/bin/bash #SBATCH --output=logs/%x_%j.out #SBATCH --error=logs/%x_%j.err #SBATCH --job-name=my-training-job #SBATCH --partition=h100 #SBATCH --gres=gpu:H100:1 #SBATCH --nodes=1 #SBATCH --cpus-per-task=10 #SBATCH --mem=32GB #SBATCH --time=02:00:00 #SBATCH --mail-type=ALL #SBATCH --mail-user=slurm-notification@srp.one
set -x
#==============================
Environment Setup
#============================== export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) export MASTER_PORT=$(shuf -i 1000-65535 -n 1)
export LOGLEVEL=INFO export NCCL_DEBUG=INFO export PYTHONFAULTHANDLER=1
Set your tokens (replace with actual values)
export HF_TOKEN=your_huggingface_token_here export WANDB_API_KEY=your_wandb_api_key_here export WANDB_PROJECT=${SLURM_JOB_NAME} export WANDB_NAME=${SLURM_JOB_NAME}-$(date +%Y%m%d%H%M%S)
#==============================
Pre-task initialization
#============================== echo "Running pre-task initialization..."
Your setup commands here
#==============================
Main Job Execution
#==============================
echo "Starting main task..."
srun -v -l --jobid $SLURM_JOBID --job-name=${SLURM_JOB_NAME}
--output $SLURM_SUBMIT_DIR/logs/%x_%j_%s_%t_%N.out
--error $SLURM_SUBMIT_DIR/logs/%x_%j_%s_%t_%N.err
apptainer run --fakeroot --writable-tmpfs --nv
/data0/apptainer/pytorch_24.01-py3.sif bash -ex << 'EOF'
==== YOUR JOB COMMANDS START ====
echo "Training started at $(date)"
python train.py
--model gpt2
--batch-size 32
--epochs 10
--output-dir /data0/models/
nvidia-smi
echo "Training completed at $(date)"
==== YOUR JOB COMMANDS END ====
EOF
Key SBATCH Parameters
Parameter Description Example
--job-name
Job name (shows in squeue) my-training
--partition
Cluster partition h100
--gres
GPU resources gpu:H100:1 (1 GPU)gpu:H100:2 (2 GPUs)
--nodes
Number of nodes 1 (single node)2 (distributed)
--cpus-per-task
CPUs per task 10
--mem
Memory per node 32GB
--time
Max runtime 02:00:00 (2 hours)
--output
stdout log file logs/%x_%j.out
--error
stderr log file logs/%x_%j.err
--mail-type
Email notification ALL , FAIL , END
Log File Placeholders:
-
%x
-
Job name
-
%j
-
Job ID
-
%s
-
Step ID
-
%t
-
Task ID
-
%N
-
Node name
Multi-Node Distributed Training
#!/bin/bash #SBATCH --job-name=distributed-training #SBATCH --partition=h100 #SBATCH --nodes=2 #SBATCH --gres=gpu:H100:2 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=10 #SBATCH --mem=64GB #SBATCH --time=04:00:00
set -x
Distributed training setup
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) export MASTER_PORT=12345 export WORLD_SIZE=$((SLURM_NNODES * SLURM_NTASKS_PER_NODE))
srun apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif
python -m torch.distributed.launch
--nproc_per_node=$SLURM_NTASKS_PER_NODE
--nnodes=$SLURM_NNODES
--node_rank=$SLURM_NODEID
--master_addr=$MASTER_ADDR
--master_port=$MASTER_PORT
train_distributed.py
Using Apptainer Containers
Available Container Images
Location: /data0/apptainer/
Common Images:
-
pytorch_24.01-py3.sif
-
PyTorch 24.01 with Python 3
-
ray_2.52.0-py310-gpu.sif
-
Ray 2.52.0 with Python 3.10
-
Custom images built for specific projects
Apptainer Command Patterns
Run container with GPU support
apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif python script.py
Shell into container
apptainer shell --nv /data0/apptainer/pytorch_24.01-py3.sif
Execute single command
apptainer exec --nv /data0/apptainer/pytorch_24.01-py3.sif nvidia-smi
With additional flags
apptainer run --fakeroot --writable-tmpfs --nv <image.sif> <command>
Common Flags:
-
--nv
-
Enable NVIDIA GPU support
-
--fakeroot
-
Fake root user privileges (for installing packages)
-
--writable-tmpfs
-
Create writable temporary filesystem
-
--bind <src>:<dst>
-
Mount additional directories
Interactive Container Session
Start interactive job with Apptainer
sapptainer -c 20 -m 200G -g 1 -p h100 -i /data0/apptainer/pytorch_24.01-py3.sif
Parameters:
-c: CPUs
-m: Memory
-g: GPUs
-p: Partition
-i: Container image
Using ssubmit Wrapper
SRP provides ssubmit wrapper for simplified job submission:
Basic usage
ssubmit -j job_name -p h100 -g 1 -c 10 -m 32G -t 2:00:00
-cmd "python train.py"
With custom script
ssubmit -j my-job -p h100 -g 2 -s job_script.sh
Interactive mode
ssubmit -j interactive -p h100 -g 1 -i
Parameters:
-
-j
-
Job name
-
-p
-
Partition (h100, compute)
-
-g
-
Number of GPUs
-
-c
-
Number of CPUs
-
-m
-
Memory (e.g., 32G)
-
-t
-
Time limit (HH:MM:SS)
-
-cmd
-
Command to run
-
-s
-
Script file to execute
-
-i
-
Interactive mode
Reference: https://github.com/SerendipityOneInc/llm-jobs/blob/main/slurm/ssubmit-examples/README.md
Feishu Notifications
Slurm clusters automatically send Feishu notifications for job events via prolog/epilog:
Notification Types:
-
✅ Job started
-
✅ Job completed successfully
-
❌ Job failed with error code
-
⏱️ Job timeout
-
🛑 Job cancelled
Notification Channel: slurm-notification@srp.one
What's Included:
-
Job ID, name, partition
-
Node allocation
-
Start and end time
-
Exit status
-
Resource usage summary
-
Log file locations
No Action Needed: Notifications are automatic - no need to add notification code to your scripts.
Best Practices
Resource Allocation
Request What You Need:
-
Don't over-request CPUs/memory - it delays scheduling
-
Start with minimal resources, scale up if needed
GPU Utilization:
-
Use nvidia-smi to verify GPU is being used
-
Monitor GPU memory with nvidia-smi dmon
Time Limits:
-
Set realistic time limits (slightly above expected)
-
Jobs exceeding time limit are killed
Partitions:
-
Always specify partition explicitly
-
Use h100 for Oracle, appropriate partition for DO
Job Organization
Organize logs by date
#SBATCH --output=logs/%Y%m%d/%x_%j.out #SBATCH --error=logs/%Y%m%d/%x_%j.err
Or by job name
#SBATCH --output=logs/%x/%j.out #SBATCH --error=logs/%x/%j.err
Checkpoint and Resume
Save checkpoints periodically
import torch import os
checkpoint_dir = "/data0/checkpoints" checkpoint_path = os.path.join(checkpoint_dir, f"model_epoch_{epoch}.pt")
torch.save({ 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss, }, checkpoint_path)
Resume from checkpoint
if os.path.exists(checkpoint_path): checkpoint = torch.load(checkpoint_path) model.load_state_dict(checkpoint['model_state_dict']) start_epoch = checkpoint['epoch'] + 1
Error Handling
Set bash options for safety
set -e # Exit on error set -u # Error on undefined variable set -x # Print commands (useful for debugging) set -o pipefail # Exit on pipe failure
Add error traps
trap 'echo "Error on line $LINENO"; exit 1' ERR
Monitoring and Debugging
Check Job Status
Detailed job info
scontrol show job <job_id>
Watch job queue
watch -n 5 squeue -u $USER
Check why job is pending
squeue -j <job_id> --start
View Logs
Tail logs while job runs
tail -f logs/job_name_12345.out
View last 100 lines
tail -n 100 logs/job_name_12345.out
Search for errors
grep -i error logs/job_name_12345.err
GPU Monitoring
Inside running job container
nvidia-smi
Continuous monitoring
nvidia-smi dmon
Detailed GPU utilization
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used,memory.free --format=csv -l 5
Resource Usage
Check job efficiency
seff <job_id>
Detailed accounting
sacct -j <job_id> --format=JobID,JobName,Elapsed,CPUTime,MaxRSS,State
Common Issues and Solutions
Issue Cause Solution
Job pending forever No available resources Check sinfo for available GPUs; adjust resource requests
"Out of memory" error Insufficient memory request Increase --mem in job script
GPU not detected Missing --gres or --nv
Add --gres=gpu:X to sbatch, --nv to apptainer
Container not found Wrong image path Verify path in /data0/apptainer/
Permission denied File permissions issue Check file ownership and permissions
Module not found Missing Python packages Install in container or use different image
NCCL timeout Network issues in distributed training Check NCCL env vars, verify nodes can communicate
Killed job (OOM) Memory exceeded Reduce batch size or increase --mem
Quick Reference
Essential Commands
Submit job
sbatch job.sh
Check queue
squeue -u $USER
Job details
scontrol show job <job_id>
Cancel job
scancel <job_id>
View logs
tail -f logs/job_*.out
Cluster info
sinfo -p h100
Job history
sacct --starttime=today
Example Workflows
- Quick GPU Test
Submit test job
sbatch << 'EOF' #!/bin/bash #SBATCH --job-name=gpu-test #SBATCH --partition=h100 #SBATCH --gres=gpu:1 #SBATCH --time=00:10:00 #SBATCH --output=test_%j.out
srun apptainer exec --nv /data0/apptainer/pytorch_24.01-py3.sif
nvidia-smi
EOF
- Training with Checkpoints
#!/bin/bash #SBATCH --job-name=training-with-checkpoint #SBATCH --partition=h100 #SBATCH --gres=gpu:H100:1 #SBATCH --time=04:00:00 #SBATCH --signal=B:USR1@60
checkpoint_handler() { echo "Received signal, saving checkpoint..." # Signal Python process to save checkpoint pkill -USR1 -f train.py }
trap checkpoint_handler USR1
srun apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif
python train.py
--checkpoint-dir /data0/checkpoints
--resume-if-exists
- Batch Processing
#!/bin/bash #SBATCH --job-name=batch-inference #SBATCH --partition=h100 #SBATCH --gres=gpu:H100:1 #SBATCH --array=0-9 #SBATCH --time=01:00:00
Process 10 shards in parallel
SHARD_ID=$SLURM_ARRAY_TASK_ID
srun apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif
python inference.py
--input /data0/input/shard_${SHARD_ID}.json
--output /data0/output/shard_${SHARD_ID}.json
Resources
Official Documentation
-
Slurm Commands: https://slurm.schedmd.com/man_index.html
-
Slurm Quick Start: https://slurm.schedmd.com/quickstart.html
-
Apptainer User Guide: https://apptainer.org/docs/user/latest/
SRP Resources
-
Deployment Guide: https://starquest.feishu.cn/wiki/TZASwm86nivXLTkMV6kcoJF4n2I
-
Oracle OKE Grafana:
-
Cluster: https://grafana.g.yesy.site/d/edrg5th9t1edcb/slinky-slurm
-
Workload: https://grafana.g.yesy.site/d/f2c83374-71e2-42c6-92a1-10505b584cf2/workload
-
Job Stats: https://grafana.g.yesy.site/d/HRLkiLS7k/slurmjobstats
-
DO DOKS Grafana:
-
Cluster: https://grafana.g2.yesy.site/d/edrg5th9t1edcb/slinky-slurm
-
Job Stats: https://grafana.g2.yesy.site/d/slurm/slurm
-
ssubmit Examples: https://github.com/SerendipityOneInc/llm-jobs/blob/main/slurm/ssubmit-examples/README.md
Implementation Steps
When helping users with Slurm jobs:
Understand Requirements:
-
What workload type? (training, inference, data processing)
-
GPU requirements (quantity, memory)
-
Expected runtime
-
Data input/output locations
Choose Cluster:
-
Oracle OKE (H100) for most workloads
-
DO DOKS (H200) for cutting-edge GPU needs
Write Job Script:
-
Use modern simplified template (no notification code)
-
Specify appropriate resources
-
Use Apptainer container with --nv flag
-
Set up proper logging
Submit and Monitor:
-
Submit with sbatch or ssubmit
-
Monitor with squeue and Grafana
-
Check logs for errors
-
Verify GPU utilization
Debug Issues:
-
Check Feishu notifications for failure reasons
-
Review log files
-
Use scontrol for detailed job info
-
Consult troubleshooting table
Optimize:
-
Adjust batch sizes based on GPU memory
-
Use job arrays for parallel processing
-
Implement checkpointing for long runs
-
Monitor resource usage with sacct and seff