SageMaker Training
Submit ML training jobs to AWS SageMaker from the command line. Supports PyTorch, TensorFlow, scikit-learn, and XGBoost with managed spot training for cost savings.
Prerequisites
boto3Python package installed (pip install boto3).sagemakerrecommended.- AWS credentials available — EC2 instance profile (recommended), or
aws configure/ env vars (AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY) - S3 bucket for training artifacts
- Two IAM roles configured — see
references/setup.mdfor exact policies:- Role A (Caller): SageMaker job management + S3 access + ECR image pull
- Role B (Execution): S3 data access + CloudWatch logs + ECR images
Security Notes
- AWS credentials are never logged, embedded in scripts, or uploaded to S3. boto3 resolves credentials from the standard chain (instance profile → env → config file).
- Source packaging excludes
.git,.env,venv,__pycache__, and other non-essential files. Use--source-dirto explicitly scope what gets packaged. Always review--dry-runoutput before submitting to production. - IAM scope: Both caller and execution role policies should be scoped to your
specific S3 bucket and SageMaker execution role ARN. See
references/setup.md.
Quick Start
1. Write a training script
Follow the SageMaker training script contract: read data from SM_CHANNEL_TRAIN,
save model to SM_MODEL_DIR. See references/training-scripts.md for templates.
2. Submit a training job
python3 scripts/sagemaker_train.py \
--job-name my-experiment-001 \
--script ./train.py \
--role arn:aws:iam::ACCOUNT:role/SageMakerRole \
--bucket my-sagemaker-bucket \
--instance-type ml.g5.xlarge \
--spot \
--framework pytorch \
--input-data s3://my-bucket/data/train/ \
--hyperparameters '{"epochs":"50","lr":"0.001"}' \
--output-dir ./results
The script packages your code, uploads to S3, submits the job, polls until
complete, and downloads model artifacts to --output-dir.
3. Check cost
# Estimate before running
python3 scripts/sagemaker_cost.py --instance-type ml.g5.xlarge --duration 3600 --spot
# Check actual cost after job completes
python3 scripts/sagemaker_cost.py --job-name my-experiment-001
4. List recent jobs
python3 scripts/sagemaker_list.py --max 5
python3 scripts/sagemaker_list.py --status Failed
Key Options
| Flag | Purpose | Default |
|---|---|---|
--spot | Managed spot training (up to 70% savings) | off |
--instance-type | Compute instance | ml.g5.xlarge |
--max-runtime | Kill job after N seconds | 3600 |
--framework | pytorch, tensorflow, sklearn, xgboost | pytorch |
--image-uri | Custom Docker image (overrides framework) | auto |
--requirements | requirements.txt for extra deps | none |
--dry-run | Print config without submitting | off |
--no-wait | Submit and exit without polling | off |
--resume JOB | Reconnect to a running/completed job (skip submission) | — |
--source-dir | Directory with all training code | script's parent |
--input-data | S3 input(s), format: channel:s3://... | none |
--env | JSON environment variables | {} |
Instance Selection
For tabular/Kaggle workloads:
- Gradient boosting (LightGBM/XGBoost):
ml.m5.2xlarge(CPU, $0.54/hr) - Small neural nets:
ml.g4dn.xlarge(T4, $0.74/hr) — cheapest GPU - Standard deep learning:
ml.g5.xlarge(A10G, $1.41/hr) — best price/performance - Heavy training:
ml.p3.2xlarge(V100, $4.28/hr)
Always use --spot for non-urgent training — typical savings of 30-70%.
Workflow Integration
For autonomous agents running training jobs in a loop:
- Prepare data locally or upload to S3
- Write training script following the contract in
references/training-scripts.md - Use
--dry-runfirst to validate config - Submit with
sagemaker_train.py— it blocks until completion by default - Results download automatically to
--output-dir - Parse metrics from the output for experiment tracking
For parallel experiments, use --no-wait and poll with sagemaker_list.py.
Smoke Test
Verify the entire pipeline works end-to-end (~$0.01, takes ~3 min):
python3 scripts/sagemaker_smoke_test.py \
--role arn:aws:iam::ACCOUNT:role/SageMakerTrainingExecutionRole \
--bucket my-sagemaker-bucket
This runs a local pre-flight, submits a minimal job to SageMaker, verifies
the downloaded model artifact, and checks cost. Use --keep to preserve output files.