slurm

Slurm Cluster Management

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "slurm" with this command: npx skills add serendipityoneinc/srp-claude-code-marketplace/serendipityoneinc-srp-claude-code-marketplace-slurm

Slurm Cluster Management

Help developers submit, manage, and troubleshoot GPU-accelerated workloads on SRP's Slurm clusters. Supports training, inference, and data processing jobs using Apptainer containers.

When to Use This Skill

Use this skill when:

  • Submitting GPU training or inference jobs to Slurm clusters

  • Managing running or queued jobs

  • Monitoring cluster resources and job status

  • Debugging job failures or performance issues

  • Writing Slurm job scripts with Apptainer containers

  • Checking GPU availability and utilization

SRP Slurm Clusters

Oracle OKE Cluster (H100 GPUs)

SSH Access:

ssh -p 2222 <your-ldap-username>@129.80.180.16

Example:

ssh -p 2222 zhuguangbin@129.80.180.16

GPU Type: H100 Partition: h100 (must specify in job scripts) Use Cases: Large model training, high-performance inference

DO DOKS Cluster (H200 GPUs)

SSH Access:

ssh -p 2222 <your-ldap-username>@129.212.240.50

Example:

ssh -p 2222 zhuguangbin@129.212.240.50

GPU Type: H200 Partition: Specify in job scripts Use Cases: Latest GPU workloads, large-scale training

Data Access

Both clusters use JuiceFS for unified data access:

  • Path: /data0/ or /data/srp/

  • Same permissions and directory structure as development machines

  • Shared across all cluster nodes and with A10 dev machines

Monitoring

Oracle OKE Cluster Dashboards:

DO DOKS Cluster Dashboards:

Metrics Available:

  • Cluster resource utilization

  • GPU availability and usage

  • Job queue status

  • Per-job resource consumption

  • Historical workload patterns

Essential Slurm Commands

Job Submission

Submit batch job script

sbatch job_script.sh

Submit with ssubmit wrapper (recommended)

ssubmit -j job_name -p h100 -g 1 -c 10 -m 32G -t 2:00:00 -cmd "python train.py"

Interactive job allocation

salloc --partition=h100 --gres=gpu:1 --time=01:00:00

Run command directly

srun --partition=h100 --gres=gpu:1 python test.py

Job Management

View your jobs

squeue -u $USER

View all jobs

squeue

View specific job details

scontrol show job <job_id>

Cancel job

scancel <job_id>

Cancel all your jobs

scancel -u $USER

Cancel jobs by name

scancel --name=job_name

Cluster Information

View partitions and nodes

sinfo

View detailed node info

sinfo -N -l

Check GPU availability

sinfo -o "%20N %10c %10m %25f %10G"

View specific partition

sinfo -p h100

Job History

View completed jobs

sacct

View specific job details

sacct -j <job_id> --format=JobID,JobName,Partition,AllocCPUS,State,ExitCode

View jobs from last week

sacct --starttime=now-7days --format=JobID,JobName,Elapsed,State,ExitCode

Job Script Structure

Modern Slurm Script (Simplified)

The new Slinky Slurm clusters use prolog/epilog for notifications, so scripts are much simpler:

#!/bin/bash #SBATCH --output=logs/%x_%j.out #SBATCH --error=logs/%x_%j.err #SBATCH --job-name=my-training-job #SBATCH --partition=h100 #SBATCH --gres=gpu:H100:1 #SBATCH --nodes=1 #SBATCH --cpus-per-task=10 #SBATCH --mem=32GB #SBATCH --time=02:00:00 #SBATCH --mail-type=ALL #SBATCH --mail-user=slurm-notification@srp.one

set -x

#==============================

Environment Setup

#============================== export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) export MASTER_PORT=$(shuf -i 1000-65535 -n 1)

export LOGLEVEL=INFO export NCCL_DEBUG=INFO export PYTHONFAULTHANDLER=1

Set your tokens (replace with actual values)

export HF_TOKEN=your_huggingface_token_here export WANDB_API_KEY=your_wandb_api_key_here export WANDB_PROJECT=${SLURM_JOB_NAME} export WANDB_NAME=${SLURM_JOB_NAME}-$(date +%Y%m%d%H%M%S)

#==============================

Pre-task initialization

#============================== echo "Running pre-task initialization..."

Your setup commands here

#==============================

Main Job Execution

#============================== echo "Starting main task..." srun -v -l --jobid $SLURM_JOBID --job-name=${SLURM_JOB_NAME}
--output $SLURM_SUBMIT_DIR/logs/%x_%j_%s_%t_%N.out
--error $SLURM_SUBMIT_DIR/logs/%x_%j_%s_%t_%N.err
apptainer run --fakeroot --writable-tmpfs --nv
/data0/apptainer/pytorch_24.01-py3.sif bash -ex << 'EOF'

==== YOUR JOB COMMANDS START ====

echo "Training started at $(date)"

python train.py
--model gpt2
--batch-size 32
--epochs 10
--output-dir /data0/models/

nvidia-smi

echo "Training completed at $(date)"

==== YOUR JOB COMMANDS END ====

EOF

Key SBATCH Parameters

Parameter Description Example

--job-name

Job name (shows in squeue) my-training

--partition

Cluster partition h100

--gres

GPU resources gpu:H100:1 (1 GPU)gpu:H100:2 (2 GPUs)

--nodes

Number of nodes 1 (single node)2 (distributed)

--cpus-per-task

CPUs per task 10

--mem

Memory per node 32GB

--time

Max runtime 02:00:00 (2 hours)

--output

stdout log file logs/%x_%j.out

--error

stderr log file logs/%x_%j.err

--mail-type

Email notification ALL , FAIL , END

Log File Placeholders:

  • %x

  • Job name

  • %j

  • Job ID

  • %s

  • Step ID

  • %t

  • Task ID

  • %N

  • Node name

Multi-Node Distributed Training

#!/bin/bash #SBATCH --job-name=distributed-training #SBATCH --partition=h100 #SBATCH --nodes=2 #SBATCH --gres=gpu:H100:2 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=10 #SBATCH --mem=64GB #SBATCH --time=04:00:00

set -x

Distributed training setup

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) export MASTER_PORT=12345 export WORLD_SIZE=$((SLURM_NNODES * SLURM_NTASKS_PER_NODE))

srun apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif
python -m torch.distributed.launch
--nproc_per_node=$SLURM_NTASKS_PER_NODE
--nnodes=$SLURM_NNODES
--node_rank=$SLURM_NODEID
--master_addr=$MASTER_ADDR
--master_port=$MASTER_PORT
train_distributed.py

Using Apptainer Containers

Available Container Images

Location: /data0/apptainer/

Common Images:

  • pytorch_24.01-py3.sif

  • PyTorch 24.01 with Python 3

  • ray_2.52.0-py310-gpu.sif

  • Ray 2.52.0 with Python 3.10

  • Custom images built for specific projects

Apptainer Command Patterns

Run container with GPU support

apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif python script.py

Shell into container

apptainer shell --nv /data0/apptainer/pytorch_24.01-py3.sif

Execute single command

apptainer exec --nv /data0/apptainer/pytorch_24.01-py3.sif nvidia-smi

With additional flags

apptainer run --fakeroot --writable-tmpfs --nv <image.sif> <command>

Common Flags:

  • --nv

  • Enable NVIDIA GPU support

  • --fakeroot

  • Fake root user privileges (for installing packages)

  • --writable-tmpfs

  • Create writable temporary filesystem

  • --bind <src>:<dst>

  • Mount additional directories

Interactive Container Session

Start interactive job with Apptainer

sapptainer -c 20 -m 200G -g 1 -p h100 -i /data0/apptainer/pytorch_24.01-py3.sif

Parameters:

-c: CPUs

-m: Memory

-g: GPUs

-p: Partition

-i: Container image

Using ssubmit Wrapper

SRP provides ssubmit wrapper for simplified job submission:

Basic usage

ssubmit -j job_name -p h100 -g 1 -c 10 -m 32G -t 2:00:00
-cmd "python train.py"

With custom script

ssubmit -j my-job -p h100 -g 2 -s job_script.sh

Interactive mode

ssubmit -j interactive -p h100 -g 1 -i

Parameters:

  • -j

  • Job name

  • -p

  • Partition (h100, compute)

  • -g

  • Number of GPUs

  • -c

  • Number of CPUs

  • -m

  • Memory (e.g., 32G)

  • -t

  • Time limit (HH:MM:SS)

  • -cmd

  • Command to run

  • -s

  • Script file to execute

  • -i

  • Interactive mode

Reference: https://github.com/SerendipityOneInc/llm-jobs/blob/main/slurm/ssubmit-examples/README.md

Feishu Notifications

Slurm clusters automatically send Feishu notifications for job events via prolog/epilog:

Notification Types:

  • ✅ Job started

  • ✅ Job completed successfully

  • ❌ Job failed with error code

  • ⏱️ Job timeout

  • 🛑 Job cancelled

Notification Channel: slurm-notification@srp.one

What's Included:

  • Job ID, name, partition

  • Node allocation

  • Start and end time

  • Exit status

  • Resource usage summary

  • Log file locations

No Action Needed: Notifications are automatic - no need to add notification code to your scripts.

Best Practices

Resource Allocation

Request What You Need:

  • Don't over-request CPUs/memory - it delays scheduling

  • Start with minimal resources, scale up if needed

GPU Utilization:

  • Use nvidia-smi to verify GPU is being used

  • Monitor GPU memory with nvidia-smi dmon

Time Limits:

  • Set realistic time limits (slightly above expected)

  • Jobs exceeding time limit are killed

Partitions:

  • Always specify partition explicitly

  • Use h100 for Oracle, appropriate partition for DO

Job Organization

Organize logs by date

#SBATCH --output=logs/%Y%m%d/%x_%j.out #SBATCH --error=logs/%Y%m%d/%x_%j.err

Or by job name

#SBATCH --output=logs/%x/%j.out #SBATCH --error=logs/%x/%j.err

Checkpoint and Resume

Save checkpoints periodically

import torch import os

checkpoint_dir = "/data0/checkpoints" checkpoint_path = os.path.join(checkpoint_dir, f"model_epoch_{epoch}.pt")

torch.save({ 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss, }, checkpoint_path)

Resume from checkpoint

if os.path.exists(checkpoint_path): checkpoint = torch.load(checkpoint_path) model.load_state_dict(checkpoint['model_state_dict']) start_epoch = checkpoint['epoch'] + 1

Error Handling

Set bash options for safety

set -e # Exit on error set -u # Error on undefined variable set -x # Print commands (useful for debugging) set -o pipefail # Exit on pipe failure

Add error traps

trap 'echo "Error on line $LINENO"; exit 1' ERR

Monitoring and Debugging

Check Job Status

Detailed job info

scontrol show job <job_id>

Watch job queue

watch -n 5 squeue -u $USER

Check why job is pending

squeue -j <job_id> --start

View Logs

Tail logs while job runs

tail -f logs/job_name_12345.out

View last 100 lines

tail -n 100 logs/job_name_12345.out

Search for errors

grep -i error logs/job_name_12345.err

GPU Monitoring

Inside running job container

nvidia-smi

Continuous monitoring

nvidia-smi dmon

Detailed GPU utilization

nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used,memory.free --format=csv -l 5

Resource Usage

Check job efficiency

seff <job_id>

Detailed accounting

sacct -j <job_id> --format=JobID,JobName,Elapsed,CPUTime,MaxRSS,State

Common Issues and Solutions

Issue Cause Solution

Job pending forever No available resources Check sinfo for available GPUs; adjust resource requests

"Out of memory" error Insufficient memory request Increase --mem in job script

GPU not detected Missing --gres or --nv

Add --gres=gpu:X to sbatch, --nv to apptainer

Container not found Wrong image path Verify path in /data0/apptainer/

Permission denied File permissions issue Check file ownership and permissions

Module not found Missing Python packages Install in container or use different image

NCCL timeout Network issues in distributed training Check NCCL env vars, verify nodes can communicate

Killed job (OOM) Memory exceeded Reduce batch size or increase --mem

Quick Reference

Essential Commands

Submit job

sbatch job.sh

Check queue

squeue -u $USER

Job details

scontrol show job <job_id>

Cancel job

scancel <job_id>

View logs

tail -f logs/job_*.out

Cluster info

sinfo -p h100

Job history

sacct --starttime=today

Example Workflows

  1. Quick GPU Test

Submit test job

sbatch << 'EOF' #!/bin/bash #SBATCH --job-name=gpu-test #SBATCH --partition=h100 #SBATCH --gres=gpu:1 #SBATCH --time=00:10:00 #SBATCH --output=test_%j.out

srun apptainer exec --nv /data0/apptainer/pytorch_24.01-py3.sif
nvidia-smi EOF

  1. Training with Checkpoints

#!/bin/bash #SBATCH --job-name=training-with-checkpoint #SBATCH --partition=h100 #SBATCH --gres=gpu:H100:1 #SBATCH --time=04:00:00 #SBATCH --signal=B:USR1@60

checkpoint_handler() { echo "Received signal, saving checkpoint..." # Signal Python process to save checkpoint pkill -USR1 -f train.py }

trap checkpoint_handler USR1

srun apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif
python train.py
--checkpoint-dir /data0/checkpoints
--resume-if-exists

  1. Batch Processing

#!/bin/bash #SBATCH --job-name=batch-inference #SBATCH --partition=h100 #SBATCH --gres=gpu:H100:1 #SBATCH --array=0-9 #SBATCH --time=01:00:00

Process 10 shards in parallel

SHARD_ID=$SLURM_ARRAY_TASK_ID

srun apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif
python inference.py
--input /data0/input/shard_${SHARD_ID}.json
--output /data0/output/shard_${SHARD_ID}.json

Resources

Official Documentation

SRP Resources

Implementation Steps

When helping users with Slurm jobs:

Understand Requirements:

  • What workload type? (training, inference, data processing)

  • GPU requirements (quantity, memory)

  • Expected runtime

  • Data input/output locations

Choose Cluster:

  • Oracle OKE (H100) for most workloads

  • DO DOKS (H200) for cutting-edge GPU needs

Write Job Script:

  • Use modern simplified template (no notification code)

  • Specify appropriate resources

  • Use Apptainer container with --nv flag

  • Set up proper logging

Submit and Monitor:

  • Submit with sbatch or ssubmit

  • Monitor with squeue and Grafana

  • Check logs for errors

  • Verify GPU utilization

Debug Issues:

  • Check Feishu notifications for failure reasons

  • Review log files

  • Use scontrol for detailed job info

  • Consult troubleshooting table

Optimize:

  • Adjust batch sizes based on GPU memory

  • Use job arrays for parallel processing

  • Implement checkpointing for long runs

  • Monitor resource usage with sacct and seff

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

lark-docs

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

lark-messages

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

mac-setup

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

k8s-management

No summary provided by upstream source.

Repository SourceNeeds Review