scikit-learn

Scikit-learn Machine Learning

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "scikit-learn" with this command: npx skills add eyadsibai/ltk/eyadsibai-ltk-scikit-learn

Scikit-learn Machine Learning

Industry-standard Python library for classical machine learning.

When to Use

  • Classification or regression tasks

  • Clustering or dimensionality reduction

  • Preprocessing and feature engineering

  • Model evaluation and cross-validation

  • Hyperparameter tuning

  • Building ML pipelines

Algorithm Selection

Classification

Algorithm Best For Strengths

Logistic Regression Baseline, interpretable Fast, probabilistic

Random Forest General purpose Handles non-linear, feature importance

Gradient Boosting Best accuracy State-of-art for tabular

SVM High-dimensional data Works well with few samples

KNN Simple problems No training, instance-based

Regression

Algorithm Best For Notes

Linear Regression Baseline Interpretable coefficients

Ridge/Lasso Regularization needed L2 vs L1 penalty

Random Forest Non-linear relationships Robust to outliers

Gradient Boosting Best accuracy XGBoost, LightGBM wrappers

Clustering

Algorithm Best For Key Parameter

KMeans Spherical clusters n_clusters (must specify)

DBSCAN Arbitrary shapes eps (density)

Agglomerative Hierarchical n_clusters or distance threshold

Gaussian Mixture Soft clustering n_components

Dimensionality Reduction

Method Preserves Use Case

PCA Global variance Feature reduction

t-SNE Local structure 2D/3D visualization

UMAP Both local/global Visualization + downstream

Pipeline Concepts

Key concept: Pipelines prevent data leakage by ensuring transformations are fit only on training data.

Component Purpose

Pipeline Sequential steps (transform → model)

ColumnTransformer Apply different transforms to different columns

FeatureUnion Combine multiple feature extraction methods

Common preprocessing flow:

  • Impute missing values (SimpleImputer)

  • Scale numeric features (StandardScaler, MinMaxScaler)

  • Encode categoricals (OneHotEncoder, OrdinalEncoder)

  • Optional: feature selection or polynomial features

Model Evaluation

Cross-Validation Strategies

Strategy Use Case

KFold General purpose

StratifiedKFold Imbalanced classification

TimeSeriesSplit Temporal data

LeaveOneOut Very small datasets

Metrics

Task Metric When to Use

Classification Accuracy Balanced classes

F1-score Imbalanced classes

ROC-AUC Ranking, threshold tuning

Precision/Recall Domain-specific costs

Regression RMSE Penalize large errors

MAE Robust to outliers

R² Explained variance

Hyperparameter Tuning

Method Pros Cons

GridSearchCV Exhaustive Slow for many params

RandomizedSearchCV Faster May miss optimal

HalvingGridSearchCV Efficient Requires sklearn 0.24+

Key concept: Always tune on validation set, evaluate final model on held-out test set.

Best Practices

Practice Why

Split data first Prevent leakage

Use pipelines Reproducible, no leakage

Scale for distance-based KNN, SVM, PCA need scaled features

Stratify imbalanced Preserve class distribution

Cross-validate Reliable performance estimates

Check learning curves Diagnose over/underfitting

Common Pitfalls

Pitfall Solution

Fitting scaler on all data Use pipeline or fit only on train

Using accuracy for imbalanced Use F1, ROC-AUC, or balanced accuracy

Too many hyperparameters Start simple, add complexity

Ignoring feature importance Use feature_importances_ or permutation importance

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

document-processing

No summary provided by upstream source.

Repository SourceNeeds Review
General

stripe-payments

No summary provided by upstream source.

Repository SourceNeeds Review
General

file-organization

No summary provided by upstream source.

Repository SourceNeeds Review
General

literature-review

No summary provided by upstream source.

Repository SourceNeeds Review