sagemaker-training-job

Submit ML training jobs to AWS SageMaker — package code, upload to S3, launch on GPU/CPU instances, poll status, download artifacts. Use when training machine learning models that need more compute than the local machine (GPU training, large datasets, parallel experiments). Supports PyTorch, TensorFlow, scikit-learn, XGBoost/LightGBM. Handles spot instances for cost savings. Triggers on "train on SageMaker", "GPU training", "submit training job", "cloud training", "SageMaker", "remote training".

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "sagemaker-training-job" with this command: npx skills add zyyhhxx/sagemaker-training-job

SageMaker Training

Submit ML training jobs to AWS SageMaker from the command line. Supports PyTorch, TensorFlow, scikit-learn, and XGBoost with managed spot training for cost savings.

Prerequisites

  • boto3 Python package installed (pip install boto3). sagemaker recommended.
  • AWS credentials available — EC2 instance profile (recommended), or aws configure / env vars (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
  • S3 bucket for training artifacts
  • Two IAM roles configured — see references/setup.md for exact policies:
    • Role A (Caller): SageMaker job management + S3 access + ECR image pull
    • Role B (Execution): S3 data access + CloudWatch logs + ECR images

Security Notes

  • AWS credentials are never logged, embedded in scripts, or uploaded to S3. boto3 resolves credentials from the standard chain (instance profile → env → config file).
  • Source packaging excludes .git, .env, venv, __pycache__, and other non-essential files. Use --source-dir to explicitly scope what gets packaged. Always review --dry-run output before submitting to production.
  • IAM scope: Both caller and execution role policies should be scoped to your specific S3 bucket and SageMaker execution role ARN. See references/setup.md.

Quick Start

1. Write a training script

Follow the SageMaker training script contract: read data from SM_CHANNEL_TRAIN, save model to SM_MODEL_DIR. See references/training-scripts.md for templates.

2. Submit a training job

python3 scripts/sagemaker_train.py \
  --job-name my-experiment-001 \
  --script ./train.py \
  --role arn:aws:iam::ACCOUNT:role/SageMakerRole \
  --bucket my-sagemaker-bucket \
  --instance-type ml.g5.xlarge \
  --spot \
  --framework pytorch \
  --input-data s3://my-bucket/data/train/ \
  --hyperparameters '{"epochs":"50","lr":"0.001"}' \
  --output-dir ./results

The script packages your code, uploads to S3, submits the job, polls until complete, and downloads model artifacts to --output-dir.

3. Check cost

# Estimate before running
python3 scripts/sagemaker_cost.py --instance-type ml.g5.xlarge --duration 3600 --spot

# Check actual cost after job completes
python3 scripts/sagemaker_cost.py --job-name my-experiment-001

4. List recent jobs

python3 scripts/sagemaker_list.py --max 5
python3 scripts/sagemaker_list.py --status Failed

Key Options

FlagPurposeDefault
--spotManaged spot training (up to 70% savings)off
--instance-typeCompute instanceml.g5.xlarge
--max-runtimeKill job after N seconds3600
--frameworkpytorch, tensorflow, sklearn, xgboostpytorch
--image-uriCustom Docker image (overrides framework)auto
--requirementsrequirements.txt for extra depsnone
--dry-runPrint config without submittingoff
--no-waitSubmit and exit without pollingoff
--resume JOBReconnect to a running/completed job (skip submission)
--source-dirDirectory with all training codescript's parent
--input-dataS3 input(s), format: channel:s3://...none
--envJSON environment variables{}

Instance Selection

For tabular/Kaggle workloads:

  • Gradient boosting (LightGBM/XGBoost): ml.m5.2xlarge (CPU, $0.54/hr)
  • Small neural nets: ml.g4dn.xlarge (T4, $0.74/hr) — cheapest GPU
  • Standard deep learning: ml.g5.xlarge (A10G, $1.41/hr) — best price/performance
  • Heavy training: ml.p3.2xlarge (V100, $4.28/hr)

Always use --spot for non-urgent training — typical savings of 30-70%.

Workflow Integration

For autonomous agents running training jobs in a loop:

  1. Prepare data locally or upload to S3
  2. Write training script following the contract in references/training-scripts.md
  3. Use --dry-run first to validate config
  4. Submit with sagemaker_train.py — it blocks until completion by default
  5. Results download automatically to --output-dir
  6. Parse metrics from the output for experiment tracking

For parallel experiments, use --no-wait and poll with sagemaker_list.py.

Smoke Test

Verify the entire pipeline works end-to-end (~$0.01, takes ~3 min):

python3 scripts/sagemaker_smoke_test.py \
  --role arn:aws:iam::ACCOUNT:role/SageMakerTrainingExecutionRole \
  --bucket my-sagemaker-bucket

This runs a local pre-flight, submits a minimal job to SageMaker, verifies the downloaded model artifact, and checks cost. Use --keep to preserve output files.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

Claude Chrome

Use Claude Code with Chrome browser extension for web browsing and automation tasks. Alternative to OpenClaw's built-in browser tools.

Registry SourceRecently Updated
Coding

App Builder

Build, edit, and deploy Instant-backed apps using npx instant-cli, create-instant-app (Next.js + Codex), GitHub (gh), and Vercel (vercel). Use when asked to create a new app, modify an existing app, fix bugs, add features, or deploy/update an app. Projects live under ~/apps; always work inside the relevant app folder.

Registry SourceRecently Updated
Coding

Opengraph Io

Extract web data, capture screenshots, scrape content, and generate AI images via OpenGraph.io. Use when working with URLs (unfurling, previews, metadata), capturing webpage screenshots, scraping HTML content, asking questions about webpages, or generating images (diagrams, icons, social cards, QR codes). Triggers: 'get the OG tags', 'screenshot this page', 'scrape this URL', 'generate a diagram', 'create a social card', 'what does this page say about'.

Registry SourceRecently Updated
Coding

Xlsx Pro

Compétence pour manipuler les fichiers Excel (.xlsx, .xlsm, .csv, .tsv). Utiliser quand l'utilisateur veut : ouvrir, lire, éditer ou créer un fichier tableur ; ajouter des colonnes, calculer des formules, formater, créer des graphiques, nettoyer des données ; convertir entre formats tabulaires. Le livrable doit être un fichier tableur. NE PAS utiliser si le livrable est un document Word, HTML, script Python standalone, ou intégration Google Sheets.

Registry SourceRecently Updated
2.1K0Profile unavailable