ai-ml-data-science

Data Science Engineering Suite - Quick Reference

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ai-ml-data-science" with this command: npx skills add vasilyu1983/ai-agents-public/vasilyu1983-ai-agents-public-ai-ml-data-science

Data Science Engineering Suite - Quick Reference

This skill turns raw data and questions into validated, documented models ready for production:

  • EDA workflows: Structured exploration with drift detection

  • Feature engineering: Reproducible feature pipelines with leakage prevention and train/serve parity

  • Model selection: Baselines first; strong tabular defaults; escalate complexity only when justified

  • Evaluation & reporting: Slice analysis, uncertainty, model cards, production metrics

  • SQL transformation: SQLMesh for staging/intermediate/marts layers

  • MLOps: CI/CD, CT (continuous training), CM (continuous monitoring)

  • Production patterns: Data contracts, lineage, feedback loops, streaming features

Modern emphasis (2026): Feature stores, automated retraining, drift monitoring (Evidently), train-serve parity, and agentic ML loops (plan -> execute -> evaluate -> improve). Tools: LightGBM, CatBoost, scikit-learn, PyTorch, Polars (lazy eval for larger-than-RAM datasets), lakeFS for data versioning.

Quick Reference

Task Tool/Framework Command When to Use

EDA & Profiling Pandas, Great Expectations df.describe() , ge.validate()

Initial data exploration and quality checks

Feature Engineering Pandas, Polars, Feature Stores df.transform() , Feast materialization Creating lag, rolling, categorical features

Model Training Gradient boosting, linear models, scikit-learn lgb.train() , model.fit()

Strong baselines for tabular ML

Hyperparameter Tuning Optuna, Ray Tune optuna.create_study() , tune.run()

Optimizing model parameters

SQL Transformation SQLMesh sqlmesh plan , sqlmesh run

Building staging/intermediate/marts layers

Experiment Tracking MLflow, W&B mlflow.log_metric() , wandb.log()

Versioning experiments and models

Model Evaluation scikit-learn, custom metrics metrics.roc_auc_score() , slice analysis Validating model performance

Data Lake & Lakehouse

For comprehensive data lake/lakehouse patterns (beyond SQLMesh transformation), see data-lake-platform:

  • Table formats: Apache Iceberg, Delta Lake, Apache Hudi

  • Query engines: ClickHouse, DuckDB, Apache Doris, StarRocks

  • Alternative transformation: dbt (alternative to SQLMesh)

  • Ingestion: dlt, Airbyte (connectors)

  • Streaming: Apache Kafka patterns

  • Orchestration: Dagster, Airflow

This skill focuses on ML feature engineering and modeling. Use data-lake-platform for general-purpose data infrastructure.

Related Skills

For adjacent topics, reference:

  • ai-mlops - APIs, batch jobs, monitoring, drift, data ingestion (dlt)

  • ai-llm - LLM prompting, fine-tuning, evaluation

  • ai-rag - RAG pipelines, chunking, retrieval

  • ai-llm-inference - LLM inference optimization, quantization

  • ai-ml-timeseries - Time series forecasting, backtesting

  • qa-testing-strategy - Test-driven development, coverage

  • data-sql-optimization - SQL optimization, index patterns (complements SQLMesh)

  • data-lake-platform - Data lake/lakehouse infrastructure (ClickHouse, Iceberg, Kafka)

Decision Tree: Choosing Data Science Approach

User needs ML for: [Problem Type]

  • Tabular data?

    • Small-medium (<1M rows)? -> LightGBM (fast, efficient)
    • Large and complex (>1M rows)? -> LightGBM first, then NN if needed
    • High-dim sparse (text, counts)? -> Linear models, then shallow NN
  • Time series?

    • Seasonality? -> LightGBM, then see ai-ml-timeseries
    • Long-term dependencies? -> Transformers (see ai-ml-timeseries)
  • Text or mixed modalities?

    • LLMs/Transformers -> See ai-llm
  • SQL transformations?

    • SQLMesh (staging/intermediate/marts layers)

Rule of thumb: For tabular data, tree-based gradient boosting is a strong baseline, but must be validated against alternatives and constraints.

Core Concepts (Vendor-Agnostic)

  • Problem framing: define success metrics, baselines, and decision thresholds before modeling.

  • Leakage prevention: ensure all features are available at prediction time; split by time/group when appropriate.

  • Uncertainty: report confidence intervals and stability (fold variance, bootstrap) rather than single-point metrics.

  • Reproducibility: version code/data/features, fix seeds, and record the environment.

  • Operational handoff: define monitoring, retraining triggers, and rollback criteria with MLOps.

Implementation Practices (Tooling Examples)

  • Track experiments and artifacts (run id, commit hash, data version).

  • Add data validation gates in pipelines (schema + distribution + freshness).

  • Prefer reproducible, testable feature code (shared transforms, point-in-time correctness).

  • Use datasheets/model cards and eval reports as deployment prerequisites (Datasheets for Datasets: https://arxiv.org/abs/1803.09010; Model Cards: https://arxiv.org/abs/1810.03993).

Do / Avoid

Do

  • Do start with baselines and a simple model to expose leakage and data issues early.

  • Do run slice analysis and document failure modes before recommending deployment.

  • Do keep an immutable eval set; refresh training data without contaminating evaluation.

Avoid

  • Avoid random splits for temporal or user-correlated data.

  • Avoid "metric gaming" (optimizing the number without validating business impact).

  • Avoid training on labels created after the prediction timestamp (silent future leakage).

Core Patterns (Overview)

Pattern 1: End-to-End DS Project Lifecycle

Use when: Starting or restructuring any DS/ML project.

Stages:

  • Problem framing - Business objective, success metrics, baseline

  • Data & feasibility - Sources, coverage, granularity, label quality

  • EDA & data quality - Schema, missingness, outliers, leakage checks

  • Feature engineering - Per data type with feature store integration

  • Modelling - Baselines first, then LightGBM, then complexity as needed

  • Evaluation - Offline metrics, slice analysis, error analysis

  • Reporting - Model evaluation report + model card

  • MLOps - CI/CD, CT (continuous training), CM (continuous monitoring)

Detailed guide: EDA Best Practices

Pattern 2: Feature Engineering

Use when: Designing features before modelling or during model improvement.

By data type:

  • Numeric: Standardize, handle outliers, transform skew, scale

  • Categorical: One-hot/ordinal (low cardinality), target/frequency/hashing (high cardinality)

  • Feature Store Integration: Store encoders, mappings, statistics centrally

  • Text: Cleaning, TF-IDF, embeddings, simple stats

  • Time: Calendar features, recency, rolling/lag features

Key Modern Practice: Use feature stores (Feast, Tecton, Databricks) for versioning, sharing, and train-serve parity.

Detailed guide: Feature Engineering Patterns

Pattern 3: Data Contracts & Lineage

Use when: Building production ML systems with data quality requirements.

Components:

  • Contracts: Schema + ranges/nullability + freshness SLAs

  • Lineage: Track source -> feature store -> train -> serve

  • Feature store hygiene: Materialization cadence, backfill/replay, encoder versioning

  • Schema evolution: Backward/forward-compatible migrations with shadow runs

Detailed guide: Data Contracts & Lineage

Pattern 4: Model Selection & Training

Use when: Picking model families and starting experiments.

Decision guide (modern benchmarks):

  • Tabular: Start with a strong baseline (linear/logistic, then gradient boosting) and iterate based on error analysis

  • Baselines: Always implement simple baselines first (majority class, mean, naive forecast)

  • Train/val/test splits: Time-based (forecasting), group-based (user/item leakage), or random (IID)

  • Hyperparameter tuning: Start manual, then Bayesian optimization (Optuna, Ray Tune)

  • Overfitting control: Regularization, early stopping, cross-validation

Detailed guide: Modelling Patterns

Pattern 5: Evaluation & Reporting

Use when: Finalizing a model candidate or handing over to production.

Key components:

  • Metric selection: Primary (ROC-AUC, PR-AUC, RMSE) + guardrails (calibration, fairness)

  • Threshold selection: ROC/PR curves, cost-sensitive, F1 maximization

  • Slice analysis: Performance by geography, user segments, product categories

  • Error analysis: Collect high-error examples, cluster by error type, identify systematic failures

  • Uncertainty: Confidence intervals (bootstrap where appropriate), variance across folds, and stability checks

  • Evaluation report: 8-section report (objective, data, features, models, metrics, slices, risks, recommendation)

  • Model card: Documentation for stakeholders (intended use, data, performance, ethics, operations)

Detailed guide: Evaluation Patterns

Pattern 6: Reproducibility & MLOps

Use when: Ensuring experiments are reproducible and production-ready.

Modern MLOps (CI/CD/CT/CM):

  • CI (Continuous Integration): Automated testing, data validation, code quality

  • CD (Continuous Delivery): Environment-specific promotion (dev -> staging -> prod), canary deployment

  • CT (Continuous Training): Drift-triggered and scheduled retraining

  • CM (Continuous Monitoring): Real-time data drift, performance, system health

Versioning:

  • Code (git commit), data (DVC, LakeFS), features (feature store), models (MLflow Registry)

  • Seeds (reproducibility), hyperparameters (experiment tracker)

Detailed guide: Reproducibility Checklist

Pattern 7: Feature Freshness & Streaming

Use when: Managing real-time features and streaming pipelines.

Components:

  • Freshness contracts: Define freshness SLAs per feature, monitor lag, alert on breaches

  • Batch + stream parity: Same feature logic across batch/stream, idempotent upserts

  • Schema evolution: Version schemas, add forward/backward-compatible parsers, backfill with rollback

  • Data quality gates: PII/format checks, range checks, distribution drift (KL, KS, PSI)

Detailed guide: Feature Freshness & Streaming

Pattern 8: Production Feedback Loops

Use when: Capturing production signals and implementing continuous improvement.

Components:

  • Signal capture: Log predictions + user edits/acceptance/abandonment (scrub PII)

  • Labeling: Route failures/edge cases to human review, create balanced sets

  • Dataset refresh: Periodic refresh (weekly/monthly) with lineage, protect eval set

  • Online eval: Shadow/canary new models, track solve rate, calibration, cost, latency

Detailed guide: Production Feedback Loops

Resources (Detailed Guides)

For comprehensive operational patterns and checklists, see:

  • EDA Best Practices - Structured workflow for exploratory data analysis

  • Feature Engineering Patterns - Operational patterns by data type

  • Data Contracts & Lineage - Data quality, versioning, feature store ops

  • Modelling Patterns - Model selection, hyperparameter tuning, train/test splits

  • Evaluation Patterns - Metrics, slice analysis, evaluation reports, model cards

  • Reproducibility Checklist - Experiment tracking, MLOps (CI/CD/CT/CM)

  • Feature Freshness & Streaming - Real-time features, schema evolution

  • Production Feedback Loops - Online learning, labeling, canary deployment

  • Class Imbalance Patterns - Resampling, cost-sensitive learning, threshold tuning, evaluation for skewed datasets

  • Hyperparameter Optimization - Bayesian optimization, early stopping, search strategies, budget allocation

  • Interpretability & Explainability - SHAP, LIME, feature importance, model cards for regulated domains

Templates

Use these as copy-paste starting points:

Project & Workflow Templates

  • Standard DS project template: assets/project/template-standard.md

  • Quick DS experiment template: assets/project/template-quick.md

Feature Engineering & EDA

  • Feature engineering template: assets/features/template-feature-engineering.md

  • EDA checklist & notebook template: assets/eda/template-eda.md

Evaluation & Reporting

  • Model evaluation report: assets/evaluation/template-evaluation-report.md

  • Model card: assets/evaluation/template-model-card.md

  • ML experiment review: assets/review/experiment-review-template.md

SQL Transformation (SQLMesh)

For SQL-based data transformation and feature engineering:

  • SQLMesh project setup: ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-project.md

  • SQLMesh model types: ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-model.md (FULL, INCREMENTAL, VIEW)

  • Incremental models: ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-incremental.md

  • DAG and dependencies: ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-dag.md

  • Testing and data quality: ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-testing.md

Use SQLMesh when:

  • Building SQL-based feature pipelines

  • Managing incremental data transformations

  • Creating staging/intermediate/marts layers

  • Testing SQL logic with unit tests and audits

For data ingestion (loading raw data), use:

  • ai-mlops skill (dlt templates for REST APIs, databases, warehouses)

Navigation

Resources

  • references/reproducibility-checklist.md

  • references/evaluation-patterns.md

  • references/feature-engineering-patterns.md

  • references/modelling-patterns.md

  • references/feature-freshness-streaming.md

  • references/eda-best-practices.md

  • references/data-contracts-lineage.md

  • references/production-feedback-loops.md

  • references/class-imbalance-patterns.md

  • references/hyperparameter-optimization.md

  • references/interpretability-explainability.md

Templates

  • assets/project/template-standard.md

  • assets/project/template-quick.md

  • assets/features/template-feature-engineering.md

  • assets/eda/template-eda.md

  • assets/evaluation/template-evaluation-report.md

  • assets/evaluation/template-model-card.md

  • assets/review/experiment-review-template.md

  • template-sqlmesh-project.md

  • template-sqlmesh-model.md

  • template-sqlmesh-incremental.md

  • template-sqlmesh-dag.md

  • template-sqlmesh-testing.md

Data

  • data/sources.json - Curated external references

External Resources

See data/sources.json for curated foundational and implementation references:

  • Core ML/DL: scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, JAX

  • Data processing: pandas, NumPy, Polars, DuckDB, Spark, Dask

  • SQL transformation: SQLMesh, dbt (staging/marts/incremental patterns)

  • Feature stores: Feast, Tecton, Databricks Feature Store (centralized feature management)

  • Data validation: Pydantic, Great Expectations, Pandera, Evidently (quality + drift)

  • Visualization: Matplotlib, Seaborn, Plotly, Streamlit, Dash

  • MLOps: MLflow, W&B, DVC, Neptune (experiment tracking + model registry)

  • Hyperparameter tuning: Optuna, Ray Tune, Hyperopt

  • Model serving: BentoML, FastAPI, TorchServe, Seldon, Ray Serve

  • Orchestration: Kubeflow, Metaflow, Prefect, Airflow, ZenML

  • Cloud platforms: AWS SageMaker, Google Vertex AI, Azure ML, Databricks, Snowflake

Use this skill to execute data science projects end-to-end: concrete checklists, patterns, and templates, not theory.

Fact-Checking

  • Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.

  • Prefer primary sources; report source links and dates for volatile information.

  • If web access is unavailable, state the limitation and mark guidance as unverified.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

product-management

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

marketing-visual-design

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

startup-idea-validation

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

software-architecture-design

No summary provided by upstream source.

Repository SourceNeeds Review