Scikit-learn Best Practices

Expert guidelines for scikit-learn development, focusing on machine learning workflows, model development, evaluation, and best practices.

Code Style and Structure

Write concise, technical responses with accurate Python examples
Prioritize reproducibility in machine learning workflows
Use functional programming for data pipelines
Use object-oriented programming for custom estimators
Prefer vectorized operations over explicit loops
Follow PEP 8 style guidelines

Machine Learning Workflow

Data Preparation

Always split data before any preprocessing: train/validation/test
Use train_test_split() with random_state for reproducibility
Stratify splits for imbalanced classification: stratify=y
Keep test set completely separate until final evaluation

Feature Engineering

Scale features appropriately for distance-based algorithms
Use StandardScaler for normally distributed features
Use MinMaxScaler for bounded features
Use RobustScaler for data with outliers
Encode categorical variables: OneHotEncoder, OrdinalEncoder, LabelEncoder
Handle missing values: SimpleImputer, KNNImputer

Pipelines

Always use Pipeline to chain preprocessing and modeling
Prevents data leakage by fitting transformers only on training data
Makes code cleaner and more reproducible
Enables easy deployment and serialization

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

Column Transformers

Use ColumnTransformer for different preprocessing per feature type
Combine numeric and categorical preprocessing in single pipeline

Model Selection and Tuning

Cross-Validation

Use cross-validation for reliable performance estimates
cross_val_score() for quick evaluation
cross_validate() for multiple metrics
Use appropriate CV strategy:
- KFold for regression
- StratifiedKFold for classification
- TimeSeriesSplit for temporal data
- GroupKFold for grouped data

Hyperparameter Tuning

Use GridSearchCV for exhaustive search
Use RandomizedSearchCV for large parameter spaces
Always tune on training/validation data, never test data
Set n_jobs=-1 for parallel processing

Model Evaluation

Classification Metrics

Use appropriate metrics for your problem:
- accuracy_score for balanced classes
- precision_score, recall_score, f1_score for imbalanced
- roc_auc_score for ranking ability
Use classification_report() for comprehensive overview
Examine confusion_matrix() for error analysis

Regression Metrics

mean_squared_error (MSE) for general use
mean_absolute_error (MAE) for interpretability
r2_score for explained variance

Evaluation Best Practices

Report confidence intervals, not just point estimates
Use multiple metrics to understand model behavior
Compare against meaningful baselines
Evaluate on held-out test set only once, at the end

Handling Imbalanced Data

Use stratified splitting and cross-validation
Consider class weights: class_weight='balanced'
Use appropriate metrics (F1, AUC-PR, not accuracy)
Adjust decision threshold based on business needs

Feature Selection

Use SelectKBest with statistical tests
Use RFE (Recursive Feature Elimination)
Use model-based selection: SelectFromModel
Examine feature importances from tree-based models

Model Persistence

Use joblib for saving and loading models
Save entire pipelines, not just models
Version control model artifacts
Document model metadata

Performance Optimization

Use n_jobs=-1 for parallel processing where available
Consider warm_start=True for iterative training
Use sparse matrices for high-dimensional sparse data
Consider incremental learning with partial_fit() for large data

Key Conventions

Import from submodules: from sklearn.ensemble import RandomForestClassifier
Set random_state for reproducibility
Use pipelines to prevent data leakage
Document model choices and hyperparameters

scikit-learn-best-practices

Safety Notice

Copy this and send it to your AI assistant to learn

Scikit-learn Best Practices

Code Style and Structure

Machine Learning Workflow

Data Preparation

Feature Engineering

Pipelines

Column Transformers

Model Selection and Tuning

Cross-Validation

Hyperparameter Tuning

Model Evaluation

Classification Metrics

Regression Metrics

Evaluation Best Practices

Handling Imbalanced Data

Feature Selection

Model Persistence

Performance Optimization

Key Conventions

Source Transparency

Related Skills

fastapi-python

nextjs-react-typescript

chrome-extension-development

odoo-development