ml-model-training

Train machine learning models with proper data handling and evaluation.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ml-model-training" with this command: npx skills add secondsky/claude-skills/secondsky-claude-skills-ml-model-training

ML Model Training

Train machine learning models with proper data handling and evaluation.

Training Workflow

  • Data Preparation → 2. Feature Engineering → 3. Model Selection → 4. Training → 5. Evaluation

Data Preparation

import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, LabelEncoder

Load and clean data

df = pd.read_csv('data.csv') df = df.dropna()

Encode categorical variables

le = LabelEncoder() df['category'] = le.fit_transform(df['category'])

Split data (70/15/15)

X = df.drop('target', axis=1) y = df['target'] X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3) X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)

Scale features

scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_val = scaler.transform(X_val) X_test = scaler.transform(X_test)

Scikit-learn Training

from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, accuracy_score

model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train)

y_pred = model.predict(X_val) print(classification_report(y_val, y_pred))

PyTorch Training

import torch import torch.nn as nn

class Model(nn.Module): def init(self, input_dim): super().init() self.layers = nn.Sequential( nn.Linear(input_dim, 64), nn.ReLU(), nn.Dropout(0.3), nn.Linear(64, 32), nn.ReLU(), nn.Linear(32, 1), nn.Sigmoid() )

def forward(self, x):
    return self.layers(x)

model = Model(X_train.shape[1]) optimizer = torch.optim.Adam(model.parameters(), lr=0.001) criterion = nn.BCELoss()

for epoch in range(100): model.train() optimizer.zero_grad() output = model(X_train_tensor) loss = criterion(output, y_train_tensor) loss.backward() optimizer.step()

Evaluation Metrics

Task Metrics

Classification Accuracy, Precision, Recall, F1, AUC-ROC

Regression MSE, RMSE, MAE, R²

Complete Framework Examples

PyTorch: See references/pytorch-training.md for complete training with:

  • Custom model classes with BatchNorm and Dropout

  • Training/validation loops with early stopping

  • Learning rate scheduling

  • Model checkpointing

  • Full evaluation with classification report

TensorFlow/Keras: See references/tensorflow-keras.md for:

  • Sequential model architecture

  • Callbacks (EarlyStopping, ReduceLROnPlateau, ModelCheckpoint, TensorBoard)

  • Training history visualization

  • TFLite conversion for mobile deployment

  • Custom training loops

Best Practices

Do:

  • Use cross-validation for robust evaluation

  • Track experiments with MLflow

  • Save model checkpoints regularly

  • Monitor for overfitting

  • Document hyperparameters

  • Use 70/15/15 train/val/test split

Don't:

  • Train without a validation set

  • Ignore class imbalance

  • Skip feature scaling

  • Use test set for hyperparameter tuning

  • Forget to set random seeds

Known Issues Prevention

  1. Data Leakage

Problem: Scaling or transforming data before splitting leads to test set information leaking into training.

Solution: Always split data first, then fit transformers only on training data:

✅ Correct: Fit on train, transform train/val/test

scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_val = scaler.transform(X_val) # Only transform X_test = scaler.transform(X_test) # Only transform

❌ Wrong: Fitting on all data

X_all = scaler.fit_transform(X) # Leaks test info!

  1. Class Imbalance Ignored

Problem: Training on imbalanced datasets (e.g., 95% class A, 5% class B) leads to models that predict only the majority class.

Solution: Use class weights or resampling:

from sklearn.utils.class_weight import compute_class_weight

Compute class weights

class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train) model = RandomForestClassifier(class_weight='balanced')

Or use SMOTE for oversampling minority class

from imblearn.over_sampling import SMOTE smote = SMOTE() X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

  1. Overfitting Due to No Regularization

Problem: Complex models memorize training data, perform poorly on validation/test sets.

Solution: Add regularization techniques:

Dropout in PyTorch

nn.Dropout(0.3)

L2 regularization in scikit-learn

RandomForestClassifier(max_depth=10, min_samples_split=20)

Early stopping in Keras

from tensorflow.keras.callbacks import EarlyStopping early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True) model.fit(X_train, y_train, validation_data=(X_val, y_val), callbacks=[early_stop])

  1. Not Setting Random Seeds

Problem: Results are not reproducible across runs, making debugging and comparison impossible.

Solution: Set all random seeds:

import random import numpy as np import torch

random.seed(42) np.random.seed(42) torch.manual_seed(42) if torch.cuda.is_available(): torch.cuda.manual_seed_all(42)

  1. Using Test Set for Hyperparameter Tuning

Problem: Optimizing hyperparameters on test set leads to overfitting to test data.

Solution: Use validation set for tuning, test set only for final evaluation:

from sklearn.model_selection import GridSearchCV

✅ Correct: Tune on train+val, evaluate on test

param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 15]} grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5) grid_search.fit(X_train, y_train) # Cross-validation on training set best_model = grid_search.best_estimator_

Final evaluation on held-out test set

final_score = best_model.score(X_test, y_test)

When to Load References

Load reference files when you need:

  • PyTorch implementation details: Load references/pytorch-training.md for complete training loops with early stopping, learning rate scheduling, and checkpointing

  • TensorFlow/Keras patterns: Load references/tensorflow-keras.md for callback usage, custom training loops, and mobile deployment with TFLite

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

tailwind-v4-shadcn

No summary provided by upstream source.

Repository SourceNeeds Review
General

aceternity-ui

No summary provided by upstream source.

Repository SourceNeeds Review
General

playwright

No summary provided by upstream source.

Repository SourceNeeds Review