Slurm Cluster Management

Help developers submit, manage, and troubleshoot GPU-accelerated workloads on SRP's Slurm clusters. Supports training, inference, and data processing jobs using Apptainer containers.

When to Use This Skill

Use this skill when:

Submitting GPU training or inference jobs to Slurm clusters
Managing running or queued jobs
Monitoring cluster resources and job status
Debugging job failures or performance issues
Writing Slurm job scripts with Apptainer containers
Checking GPU availability and utilization

SRP Slurm Clusters

Oracle OKE Cluster (H100 GPUs)

SSH Access:

ssh -p 2222 <your-ldap-username>@129.80.180.16

Example:

ssh -p 2222 zhuguangbin@129.80.180.16

GPU Type: H100 Partition: h100 (must specify in job scripts) Use Cases: Large model training, high-performance inference

DO DOKS Cluster (H200 GPUs)

SSH Access:

ssh -p 2222 <your-ldap-username>@129.212.240.50

Example:

ssh -p 2222 zhuguangbin@129.212.240.50

GPU Type: H200 Partition: Specify in job scripts Use Cases: Latest GPU workloads, large-scale training

Data Access

Both clusters use JuiceFS for unified data access:

Path: /data0/ or /data/srp/
Same permissions and directory structure as development machines
Shared across all cluster nodes and with A10 dev machines

Monitoring

Oracle OKE Cluster Dashboards:

Cluster Overview: https://grafana.g.yesy.site/d/edrg5th9t1edcb/slinky-slurm
Workload Monitoring: https://grafana.g.yesy.site/d/f2c83374-71e2-42c6-92a1-10505b584cf2/workload
Job-Level Stats: https://grafana.g.yesy.site/d/HRLkiLS7k/slurmjobstats

DO DOKS Cluster Dashboards:

Cluster Overview: https://grafana.g2.yesy.site/d/edrg5th9t1edcb/slinky-slurm
Workload Monitoring: https://grafana.g2.yesy.site/d/workload/workload
Job-Level Stats: https://grafana.g2.yesy.site/d/slurm/slurm

Metrics Available:

Cluster resource utilization
GPU availability and usage
Job queue status
Per-job resource consumption
Historical workload patterns

Essential Slurm Commands

Job Submission

Submit batch job script

sbatch job_script.sh

Submit with ssubmit wrapper (recommended)

ssubmit -j job_name -p h100 -g 1 -c 10 -m 32G -t 2:00:00 -cmd "python train.py"

Interactive job allocation

salloc --partition=h100 --gres=gpu:1 --time=01:00:00

Run command directly

srun --partition=h100 --gres=gpu:1 python test.py

Job Management

View your jobs

squeue -u $USER

View all jobs

squeue

View specific job details

scontrol show job <job_id>

Cancel job

scancel <job_id>

Cancel all your jobs

scancel -u $USER

Cancel jobs by name

scancel --name=job_name

Cluster Information

View partitions and nodes

sinfo

View detailed node info

sinfo -N -l

Check GPU availability

sinfo -o "%20N %10c %10m %25f %10G"

View specific partition

sinfo -p h100

Job History

View completed jobs

sacct

View specific job details

sacct -j <job_id> --format=JobID,JobName,Partition,AllocCPUS,State,ExitCode

View jobs from last week

sacct --starttime=now-7days --format=JobID,JobName,Elapsed,State,ExitCode

Job Script Structure

Modern Slurm Script (Simplified)

The new Slinky Slurm clusters use prolog/epilog for notifications, so scripts are much simpler:

#!/bin/bash #SBATCH --output=logs/%x_%j.out #SBATCH --error=logs/%x_%j.err #SBATCH --job-name=my-training-job #SBATCH --partition=h100 #SBATCH --gres=gpu:H100:1 #SBATCH --nodes=1 #SBATCH --cpus-per-task=10 #SBATCH --mem=32GB #SBATCH --time=02:00:00 #SBATCH --mail-type=ALL #SBATCH --mail-user=slurm-notification@srp.one

set -x

#==============================

Environment Setup

#============================== export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) export MASTER_PORT=$(shuf -i 1000-65535 -n 1)

export LOGLEVEL=INFO export NCCL_DEBUG=INFO export PYTHONFAULTHANDLER=1

Set your tokens (replace with actual values)

export HF_TOKEN=your_huggingface_token_here export WANDB_API_KEY=your_wandb_api_key_here export WANDB_PROJECT=${SLURM_JOB_NAME} export WANDB_NAME=${SLURM_JOB_NAME}-$(date +%Y%m%d%H%M%S)

#==============================

Pre-task initialization

#============================== echo "Running pre-task initialization..."

Your setup commands here

#==============================

Main Job Execution

#============================== echo "Starting main task..." srun -v -l --jobid $SLURM_JOBID --job-name=${SLURM_JOB_NAME}
--output $SLURM_SUBMIT_DIR/logs/%x_%j_%s_%t_%N.out
--error $SLURM_SUBMIT_DIR/logs/%x_%j_%s_%t_%N.err
apptainer run --fakeroot --writable-tmpfs --nv
/data0/apptainer/pytorch_24.01-py3.sif bash -ex << 'EOF'

==== YOUR JOB COMMANDS START ====

echo "Training started at $(date)"

python train.py
--model gpt2
--batch-size 32
--epochs 10
--output-dir /data0/models/

nvidia-smi

echo "Training completed at $(date)"

==== YOUR JOB COMMANDS END ====

EOF

Key SBATCH Parameters

Parameter Description Example

--job-name

Job name (shows in squeue) my-training

--partition

Cluster partition h100

--gres

GPU resources gpu:H100:1 (1 GPU)gpu:H100:2 (2 GPUs)

--nodes

Number of nodes 1 (single node)2 (distributed)

--cpus-per-task

CPUs per task 10

--mem

Memory per node 32GB

--time

Max runtime 02:00:00 (2 hours)

--output

stdout log file logs/%x_%j.out

--error

stderr log file logs/%x_%j.err

--mail-type

Email notification ALL , FAIL , END

Log File Placeholders:

%x
Job name
%j
Job ID
%s
Step ID
%t
Task ID
%N
Node name

Multi-Node Distributed Training

#!/bin/bash #SBATCH --job-name=distributed-training #SBATCH --partition=h100 #SBATCH --nodes=2 #SBATCH --gres=gpu:H100:2 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=10 #SBATCH --mem=64GB #SBATCH --time=04:00:00

set -x

Distributed training setup

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) export MASTER_PORT=12345 export WORLD_SIZE=$((SLURM_NNODES * SLURM_NTASKS_PER_NODE))

srun apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif
python -m torch.distributed.launch
--nproc_per_node=$SLURM_NTASKS_PER_NODE
--nnodes=$SLURM_NNODES
--node_rank=$SLURM_NODEID
--master_addr=$MASTER_ADDR
--master_port=$MASTER_PORT
train_distributed.py

Using Apptainer Containers

Available Container Images

Location: /data0/apptainer/

Common Images:

pytorch_24.01-py3.sif
PyTorch 24.01 with Python 3
ray_2.52.0-py310-gpu.sif
Ray 2.52.0 with Python 3.10
Custom images built for specific projects

Apptainer Command Patterns

Run container with GPU support

apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif python script.py

Shell into container

apptainer shell --nv /data0/apptainer/pytorch_24.01-py3.sif

Execute single command

apptainer exec --nv /data0/apptainer/pytorch_24.01-py3.sif nvidia-smi

With additional flags

apptainer run --fakeroot --writable-tmpfs --nv <image.sif> <command>

Common Flags:

--nv
Enable NVIDIA GPU support
--fakeroot
Fake root user privileges (for installing packages)
--writable-tmpfs
Create writable temporary filesystem
--bind <src>:<dst>
Mount additional directories

Interactive Container Session

Start interactive job with Apptainer

sapptainer -c 20 -m 200G -g 1 -p h100 -i /data0/apptainer/pytorch_24.01-py3.sif

Parameters:

-c: CPUs

-m: Memory

-g: GPUs

-p: Partition

-i: Container image

Using ssubmit Wrapper

SRP provides ssubmit wrapper for simplified job submission:

Basic usage

ssubmit -j job_name -p h100 -g 1 -c 10 -m 32G -t 2:00:00
-cmd "python train.py"

With custom script

ssubmit -j my-job -p h100 -g 2 -s job_script.sh

Interactive mode

ssubmit -j interactive -p h100 -g 1 -i

Parameters:

-j
Job name
-p
Partition (h100, compute)
-g
Number of GPUs
-c
Number of CPUs
-m
Memory (e.g., 32G)
-t
Time limit (HH:MM:SS)
-cmd
Command to run
-s
Script file to execute
-i
Interactive mode

Reference: https://github.com/SerendipityOneInc/llm-jobs/blob/main/slurm/ssubmit-examples/README.md

Feishu Notifications

Slurm clusters automatically send Feishu notifications for job events via prolog/epilog:

Notification Types:

✅ Job started
✅ Job completed successfully
❌ Job failed with error code
⏱️ Job timeout
🛑 Job cancelled

Notification Channel: slurm-notification@srp.one

What's Included:

Job ID, name, partition
Node allocation
Start and end time
Exit status
Resource usage summary
Log file locations

No Action Needed: Notifications are automatic - no need to add notification code to your scripts.

Best Practices

Resource Allocation

Request What You Need:

Don't over-request CPUs/memory - it delays scheduling
Start with minimal resources, scale up if needed

GPU Utilization:

Use nvidia-smi to verify GPU is being used
Monitor GPU memory with nvidia-smi dmon

Time Limits:

Set realistic time limits (slightly above expected)
Jobs exceeding time limit are killed

Partitions:

Always specify partition explicitly
Use h100 for Oracle, appropriate partition for DO

Job Organization

Organize logs by date

#SBATCH --output=logs/%Y%m%d/%x_%j.out #SBATCH --error=logs/%Y%m%d/%x_%j.err

Or by job name

#SBATCH --output=logs/%x/%j.out #SBATCH --error=logs/%x/%j.err

Checkpoint and Resume

Save checkpoints periodically

import torch import os

checkpoint_dir = "/data0/checkpoints" checkpoint_path = os.path.join(checkpoint_dir, f"model_epoch_{epoch}.pt")

torch.save({ 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss, }, checkpoint_path)

Resume from checkpoint

if os.path.exists(checkpoint_path): checkpoint = torch.load(checkpoint_path) model.load_state_dict(checkpoint['model_state_dict']) start_epoch = checkpoint['epoch'] + 1

Error Handling

Set bash options for safety

set -e # Exit on error set -u # Error on undefined variable set -x # Print commands (useful for debugging) set -o pipefail # Exit on pipe failure

Add error traps

trap 'echo "Error on line $LINENO"; exit 1' ERR

Monitoring and Debugging

Check Job Status

Detailed job info

scontrol show job <job_id>

Watch job queue

watch -n 5 squeue -u $USER

Check why job is pending

squeue -j <job_id> --start

View Logs

Tail logs while job runs

tail -f logs/job_name_12345.out

View last 100 lines

tail -n 100 logs/job_name_12345.out

Search for errors

grep -i error logs/job_name_12345.err

GPU Monitoring

Inside running job container

nvidia-smi

Continuous monitoring

nvidia-smi dmon

Detailed GPU utilization

nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used,memory.free --format=csv -l 5

Resource Usage

Check job efficiency

seff <job_id>

Detailed accounting

sacct -j <job_id> --format=JobID,JobName,Elapsed,CPUTime,MaxRSS,State

Common Issues and Solutions

Issue Cause Solution

Job pending forever No available resources Check sinfo for available GPUs; adjust resource requests

"Out of memory" error Insufficient memory request Increase --mem in job script

GPU not detected Missing --gres or --nv

Add --gres=gpu:X to sbatch, --nv to apptainer

Container not found Wrong image path Verify path in /data0/apptainer/

Permission denied File permissions issue Check file ownership and permissions

Module not found Missing Python packages Install in container or use different image

NCCL timeout Network issues in distributed training Check NCCL env vars, verify nodes can communicate

Killed job (OOM) Memory exceeded Reduce batch size or increase --mem

Quick Reference

Essential Commands

Submit job

sbatch job.sh

Check queue

squeue -u $USER

Job details

scontrol show job <job_id>

Cancel job

scancel <job_id>

View logs

tail -f logs/job_*.out

Cluster info

sinfo -p h100

Job history

sacct --starttime=today

Example Workflows

Quick GPU Test

Submit test job

sbatch << 'EOF' #!/bin/bash #SBATCH --job-name=gpu-test #SBATCH --partition=h100 #SBATCH --gres=gpu:1 #SBATCH --time=00:10:00 #SBATCH --output=test_%j.out

srun apptainer exec --nv /data0/apptainer/pytorch_24.01-py3.sif
nvidia-smi EOF

Training with Checkpoints

#!/bin/bash #SBATCH --job-name=training-with-checkpoint #SBATCH --partition=h100 #SBATCH --gres=gpu:H100:1 #SBATCH --time=04:00:00 #SBATCH --signal=B:USR1@60

checkpoint_handler() { echo "Received signal, saving checkpoint..." # Signal Python process to save checkpoint pkill -USR1 -f train.py }

trap checkpoint_handler USR1

srun apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif
python train.py
--checkpoint-dir /data0/checkpoints
--resume-if-exists

Batch Processing

#!/bin/bash #SBATCH --job-name=batch-inference #SBATCH --partition=h100 #SBATCH --gres=gpu:H100:1 #SBATCH --array=0-9 #SBATCH --time=01:00:00

Process 10 shards in parallel

SHARD_ID=$SLURM_ARRAY_TASK_ID

srun apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif
python inference.py
--input /data0/input/shard_${SHARD_ID}.json
--output /data0/output/shard_${SHARD_ID}.json

Resources

Official Documentation

Slurm Commands: https://slurm.schedmd.com/man_index.html
Slurm Quick Start: https://slurm.schedmd.com/quickstart.html
Apptainer User Guide: https://apptainer.org/docs/user/latest/

SRP Resources

Deployment Guide: https://starquest.feishu.cn/wiki/TZASwm86nivXLTkMV6kcoJF4n2I
Oracle OKE Grafana:
Cluster: https://grafana.g.yesy.site/d/edrg5th9t1edcb/slinky-slurm
Workload: https://grafana.g.yesy.site/d/f2c83374-71e2-42c6-92a1-10505b584cf2/workload
Job Stats: https://grafana.g.yesy.site/d/HRLkiLS7k/slurmjobstats
DO DOKS Grafana:
Cluster: https://grafana.g2.yesy.site/d/edrg5th9t1edcb/slinky-slurm
Workload: https://grafana.g2.yesy.site/d/workload/workload
Job Stats: https://grafana.g2.yesy.site/d/slurm/slurm
ssubmit Examples: https://github.com/SerendipityOneInc/llm-jobs/blob/main/slurm/ssubmit-examples/README.md

Implementation Steps

When helping users with Slurm jobs:

Understand Requirements:

What workload type? (training, inference, data processing)
GPU requirements (quantity, memory)
Expected runtime
Data input/output locations

Choose Cluster:

Oracle OKE (H100) for most workloads
DO DOKS (H200) for cutting-edge GPU needs

Write Job Script:

Use modern simplified template (no notification code)
Specify appropriate resources
Use Apptainer container with --nv flag
Set up proper logging

Submit and Monitor:

Submit with sbatch or ssubmit
Monitor with squeue and Grafana
Check logs for errors
Verify GPU utilization

Debug Issues:

Check Feishu notifications for failure reasons
Review log files
Use scontrol for detailed job info
Consult troubleshooting table

Optimize:

Adjust batch sizes based on GPU memory
Use job arrays for parallel processing
Implement checkpointing for long runs
Monitor resource usage with sacct and seff

slurm

Safety Notice

Copy this and send it to your AI assistant to learn

Example:

Example:

Submit batch job script

Submit with ssubmit wrapper (recommended)

Interactive job allocation

Run command directly

View your jobs

View all jobs

View specific job details

Cancel job

Cancel all your jobs

Cancel jobs by name

View partitions and nodes

View detailed node info

Check GPU availability

View specific partition

View completed jobs

View specific job details

View jobs from last week

Environment Setup

Set your tokens (replace with actual values)

Pre-task initialization

Your setup commands here

Main Job Execution

==== YOUR JOB COMMANDS START ====

==== YOUR JOB COMMANDS END ====

Distributed training setup

Run container with GPU support

Shell into container

Execute single command

With additional flags

Start interactive job with Apptainer

Parameters:

-c: CPUs

-m: Memory

-g: GPUs

-p: Partition

-i: Container image

Basic usage

With custom script

Interactive mode

Organize logs by date

Or by job name

Save checkpoints periodically

Resume from checkpoint

Set bash options for safety

Add error traps

Detailed job info

Watch job queue

Check why job is pending

Tail logs while job runs

View last 100 lines

Search for errors

Inside running job container

Continuous monitoring

Detailed GPU utilization

Check job efficiency

Detailed accounting

Submit job

Check queue

Job details

Cancel job

View logs

Cluster info

Job history

Submit test job

Process 10 shards in parallel

Source Transparency

Related Skills

lark-docs

lark-messages

mac-setup

k8s-management