Data Wizard
Full-stack data science and ML engineering — from exploratory data analysis through model deployment strategy. Adapts approach based on complexity classification.
Canonical Vocabulary
| Term | Definition |
|---|---|
| EDA | Exploratory Data Analysis — systematic profiling and summarization of a dataset |
| feature | An individual measurable property used as input to a model |
| feature engineering | Creating, transforming, or selecting features to improve model performance |
| hypothesis test | A statistical procedure to determine if observed data supports a claim |
| p-value | Probability of observing data at least as extreme as the actual results, assuming the null hypothesis is true |
| effect size | Magnitude of a difference or relationship, independent of sample size |
| power analysis | Determining sample size needed to detect an effect of a given size |
| CUPED | Controlled-experiment Using Pre-Experiment Data — variance reduction technique for A/B tests |
| MLOps maturity | Level 0 (manual), Level 1 (ML pipeline), Level 2 (CI/CD + CT), Level 3 (full automation) |
| data quality score | Composite metric across completeness, consistency, accuracy, timeliness, uniqueness |
| profile | Statistical summary of a dataset: types, distributions, missing patterns, correlations |
| anomaly | Data point or pattern deviating significantly from expected behavior |
Dispatch
$ARGUMENTS | Action |
|---|---|
eda <data> | EDA — profile dataset, summary stats, missing patterns, distributions |
model <task> | Model Selection — recommend models, libraries, training plan for task |
features <data> | Feature Engineering — suggest transformations, encoding, selection pipeline |
stats <question> | Stats — select and design statistical hypothesis test |
viz <data> | Visualization — recommend chart types, encodings, layout for data |
experiment <hypothesis> | Experiment Design — A/B test design, power analysis, CUPED |
timeseries <data> | Time Series — forecasting approach, decomposition, model selection |
anomaly <data> | Anomaly Detection — detection approach, algorithm selection, threshold strategy |
mlops <model> | MLOps — serving strategy, deployment pipeline, monitoring plan |
| Natural language about data | Auto-detect — classify intent, route to appropriate mode |
| Empty | Gallery — show common data science tasks with mode recommendations |
Auto-Detection Heuristic
If no mode keyword matches:
- Mentions dataset, CSV, columns, rows, missing values → EDA
- Mentions predict, classify, regression, recommend → Model Selection
- Mentions transform, encode, scale, normalize, one-hot → Feature Engineering
- Mentions test, significant, p-value, hypothesis, correlation → Stats
- Mentions chart, plot, graph, visualize, dashboard → Visualization
- Mentions A/B, experiment, control group, treatment, lift → Experiment Design
- Mentions forecast, seasonal, trend, time series, lag → Time Series
- Mentions outlier, anomaly, fraud, unusual, deviation → Anomaly Detection
- Mentions deploy, serve, pipeline, monitor, retrain → MLOps
- Ambiguous → ask: "Which area: EDA, modeling, stats, or something else?"
Gallery (Empty Arguments)
Present common data science tasks:
| # | Task | Mode | Example |
|---|---|---|---|
| 1 | Profile a dataset | eda | /data-wizard eda customer_data.csv |
| 2 | Choose a model | model | /data-wizard model "predict churn from usage features" |
| 3 | Engineer features | features | /data-wizard features sales_data.csv |
| 4 | Pick a stat test | stats | /data-wizard stats "is conversion rate different between groups?" |
| 5 | Choose visualizations | viz | /data-wizard viz time_series_metrics.csv |
| 6 | Design an experiment | experiment | /data-wizard experiment "new checkout flow increases conversion" |
| 7 | Forecast time series | timeseries | /data-wizard timeseries monthly_revenue.csv |
| 8 | Detect anomalies | anomaly | /data-wizard anomaly server_metrics.csv |
| 9 | Plan deployment | mlops | /data-wizard mlops "churn prediction model" |
Pick a number or describe your data science task.
Skill Awareness
Before starting, check if another skill is a better fit:
| Signal | Redirect |
|---|---|
| Database schema, SQL optimization, indexing | Suggest database-architect |
| Frontend dashboard code, React/D3 components | Suggest relevant frontend skill |
| Data pipeline, ETL, orchestration (Airflow, dbt) | Out of scope — suggest data engineering tools |
| Production infrastructure, Kubernetes, scaling | Suggest devops-engineer or infrastructure-coder |
Complexity Classification
Score the query on 4 dimensions (0-2 each, total 0-8):
| Dimension | 0 | 1 | 2 |
|---|---|---|---|
| Data complexity | Single table, clean | Multi-table, some nulls | Messy, multi-source, mixed types |
| Analysis depth | Descriptive stats | Inferential / predictive | Multi-stage pipeline, iteration |
| Domain specificity | General / well-known | Domain conventions apply | Deep domain expertise needed |
| Tooling breadth | Single library suffices | 2-3 libraries needed | Full ML stack integration |
| Total | Tier | Strategy |
|---|---|---|
| 0-2 | Quick | Single inline analysis — eda, viz, stats |
| 3-5 | Standard | Multi-step workflow — features, model, experiment, timeseries, anomaly |
| 6-8 | Full Pipeline | Orchestrated — mlops, complex multi-stage analysis |
Present the scoring to the user. User can override tier.
Mode Protocols
EDA (Quick)
- If file path provided, run:
!uv run python skills/data-wizard/scripts/data-profiler.py "$1" - Parse JSON output — present: row/col counts, dtypes, missing patterns, top correlations
- Highlight: data quality issues, distribution skews, potential target leakage
- Recommend next steps: cleaning, feature engineering, or modeling
Model Selection (Standard)
- Run:
!uv run python skills/data-wizard/scripts/model-recommender.pywith task JSON input - Present ranked model recommendations with rationale
- Read
references/model-selection.mdfor detailed guidance by data size and type - Suggest: train/val/test split strategy, evaluation metrics, baseline approach
Feature Engineering (Standard)
- If file path, run data profiler first for column analysis
- Read
references/feature-engineering.mdfor patterns by data type - Load
data/feature-engineering-patterns.jsonfor structured recommendations - Suggest: transformations, encodings, interaction features, selection methods
Stats (Quick)
- Run:
!uv run python skills/data-wizard/scripts/statistical-test-selector.pywith question parameters - Load
data/statistical-tests-tree.jsonfor decision tree - Read
references/statistical-tests.mdfor assumptions and interpretation guidance - Present: recommended test, alternatives, assumptions to verify, interpretation template
Visualization (Quick)
- Load
data/visualization-grammar.jsonfor chart type selection - Match data characteristics to visualization types
- Recommend: chart type, encoding channels, color palette, layout
Experiment Design (Standard)
- Read
references/experiment-design.mdfor A/B test patterns - Design: hypothesis, metrics, sample size (power analysis), duration
- Address: novelty effects, multiple comparisons, CUPED variance reduction
- Output: experiment brief with decision criteria
Time Series (Standard)
- If file path, run data profiler for temporal patterns
- Assess: stationarity, seasonality, trend, autocorrelation
- Recommend: decomposition method, forecasting model, validation strategy
- Address: cross-validation for time series (walk-forward), feature lags
Anomaly Detection (Standard)
- Classify: point anomalies, contextual anomalies, collective anomalies
- Recommend: algorithm (Isolation Forest, LOF, DBSCAN, autoencoder, etc.)
- Address: threshold selection, false positive management, interpretability
- Suggest: alerting strategy, root cause investigation framework
MLOps (Full Pipeline)
- Read
references/mlops-maturity.mdfor maturity model - Assess current maturity level (0-3)
- Design: serving strategy (batch vs real-time), monitoring, retraining triggers
- Address: model versioning, A/B testing in production, rollback strategy
- Output: deployment architecture brief
Data Quality Assessment
Run: !uv run python skills/data-wizard/scripts/data-quality-scorer.py <path>
Dimensions scored:
| Dimension | Weight | Checks |
|---|---|---|
| Completeness | 25% | Missing values, null patterns |
| Consistency | 20% | Type uniformity, format violations |
| Accuracy | 20% | Range violations, statistical outliers |
| Timeliness | 15% | Stale records, temporal gaps |
| Uniqueness | 20% | Duplicates, near-duplicates |
Reference File Index
| File | Content | Read When |
|---|---|---|
references/statistical-tests.md | Decision tree for test selection, assumptions, interpretation | Stats mode |
references/model-selection.md | Model catalog by task type, data size, interpretability needs | Model Selection mode |
references/feature-engineering.md | Patterns by data type: numeric, categorical, temporal, text, geospatial | Feature Engineering mode |
references/experiment-design.md | A/B test patterns, CUPED, power analysis, multiple comparison corrections | Experiment Design mode |
references/mlops-maturity.md | Maturity levels 0-3, deployment patterns, monitoring strategy | MLOps mode |
references/data-quality.md | Quality framework, scoring dimensions, remediation strategies | EDA mode, Data Quality Assessment |
Loading rule: Load ONE reference at a time per the "Read When" column. Do not preload.
Critical Rules
- Always run data profiler before recommending models or features — never guess at data characteristics without evidence
- Present classification scoring before executing analysis — user must see and can override complexity tier
- Never recommend a statistical test without stating its assumptions — untested assumptions invalidate results
- Always specify effect size alongside p-values — statistical significance without practical significance is misleading
- Model recommendations must include a baseline — always start with the simplest viable model (logistic regression, linear regression, naive forecast)
- Never skip train/test split strategy — leakage is the most common ML mistake
- Experiment designs must include power analysis — underpowered experiments waste resources
- Feature engineering must address target leakage risk — flag any feature derived from post-outcome data
- Time series cross-validation must use walk-forward — random splits violate temporal ordering
- MLOps recommendations must assess current maturity — do not recommend Level 3 automation for Level 0 teams
- Load ONE reference file at a time — do not preload all references into context
- Data quality scores must be computed, not estimated — run the scorer script on actual data
Canonical terms (use these exactly throughout):
- Modes: "EDA", "Model Selection", "Feature Engineering", "Stats", "Visualization", "Experiment Design", "Time Series", "Anomaly Detection", "MLOps"
- Tiers: "Quick", "Standard", "Full Pipeline"
- Quality dimensions: "Completeness", "Consistency", "Accuracy", "Timeliness", "Uniqueness"
- MLOps levels: "Level 0" (manual), "Level 1" (pipeline), "Level 2" (CI/CD+CT), "Level 3" (full auto)