feature-engineer

Feature engineering often makes the difference between mediocre and excellent ML models. This skill transforms raw data into model-ready features through systematic data quality assessment, feature creation, selection, and transformation—all integrated with SpecWeave's increment workflow.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "feature-engineer" with this command: npx skills add anton-abyzov/specweave/anton-abyzov-specweave-feature-engineer

Feature Engineer

Overview

Feature engineering often makes the difference between mediocre and excellent ML models. This skill transforms raw data into model-ready features through systematic data quality assessment, feature creation, selection, and transformation—all integrated with SpecWeave's increment workflow.

The Feature Engineering Pipeline

Phase 1: Data Quality Assessment

Before creating features, understand your data:

from specweave import DataQualityReport

Automated data quality check

report = DataQualityReport(df, increment="0042")

Generates:

- Missing value analysis

- Outlier detection

- Data type validation

- Distribution analysis

- Correlation matrix

- Duplicate detection

Quality Report Output:

Data Quality Report

Dataset Overview

  • Rows: 100,000
  • Columns: 45
  • Memory: 34.2 MB

Missing Values

ColumnMissingPercentage
email15,23415.2%
phone8,9018.9%
purchase_date00.0%

Outliers Detected

  • transaction_amount: 234 outliers (>3 std dev)
  • user_age: 12 outliers (<18 or >100)

Data Type Issues

  • user_id: Stored as float, should be int
  • date_joined: Stored as string, should be datetime

Recommendations

  1. Impute email/phone or create "missing" indicator features
  2. Cap/remove outliers in transaction_amount
  3. Convert data types for efficiency

Phase 2: Feature Creation

Create features from domain knowledge:

from specweave import FeatureCreator

creator = FeatureCreator(df, increment="0042")

Temporal features (from datetime)

creator.add_temporal_features( date_column="purchase_date", features=["hour", "day_of_week", "month", "is_weekend", "is_holiday"] )

Aggregation features (user behavior)

creator.add_aggregation_features( group_by="user_id", target="purchase_amount", aggs=["mean", "std", "count", "min", "max"] )

Creates: user_purchase_amount_mean, user_purchase_amount_std, etc.

Interaction features

creator.add_interaction_features( features=[("age", "income"), ("clicks", "impressions")], operations=["multiply", "divide", "subtract"] )

Creates: age_x_income, clicks_per_impression, etc.

Ratio features

creator.add_ratio_features([ ("revenue", "cost"), ("conversions", "visits") ])

Creates: revenue_to_cost_ratio, conversion_rate

Binning (discretization)

creator.add_binned_features( column="age", bins=[0, 18, 25, 35, 50, 65, 100], labels=["child", "young_adult", "adult", "middle_aged", "senior", "elderly"] )

Text features (from text columns)

creator.add_text_features( column="product_description", features=["length", "word_count", "unique_words", "sentiment"] )

Generate all features

df_enriched = creator.generate()

Auto-documents in increment folder

creator.save_feature_definitions( path=".specweave/increments/0042.../features/feature_definitions.yaml" )

Feature Definitions (auto-generated):

.specweave/increments/0042.../features/feature_definitions.yaml

features:

  • name: purchase_hour type: temporal source: purchase_date description: Hour of purchase (0-23)

  • name: user_purchase_amount_mean type: aggregation source: purchase_amount group_by: user_id description: Average purchase amount per user

  • name: age_x_income type: interaction sources: [age, income] operation: multiply description: Product of age and income

  • name: conversion_rate type: ratio sources: [conversions, visits] description: Conversion rate (conversions / visits)

Phase 3: Feature Selection

Reduce dimensionality, improve performance:

from specweave import FeatureSelector

selector = FeatureSelector(X_train, y_train, increment="0042")

Method 1: Correlation-based (remove redundant features)

selector.remove_correlated_features(threshold=0.95)

Removes features with >95% correlation

Method 2: Variance-based (remove constant features)

selector.remove_low_variance_features(threshold=0.01)

Removes features with <1% variance

Method 3: Statistical tests

selector.select_by_statistical_test(k=50)

SelectKBest with chi2/f_classif

Method 4: Model-based (tree importance)

selector.select_by_model_importance( model=RandomForestClassifier(), threshold=0.01 )

Removes features with <1% importance

Method 5: Recursive Feature Elimination

selector.select_by_rfe( model=LogisticRegression(), n_features=30 )

Get selected features

selected_features = selector.get_selected_features()

Generate selection report

selector.generate_report()

Feature Selection Report:

Feature Selection Report

Original Features: 125

Selected Features: 35 (72% reduction)

Selection Process

  1. Removed 12 correlated features (>95% correlation)
  2. Removed 8 low-variance features
  3. Statistical test: Selected top 50 (chi-squared)
  4. Model importance: Removed 15 low-importance features (<1%)

Top 10 Features (by importance)

  1. user_purchase_amount_mean (0.18)
  2. days_since_last_purchase (0.12)
  3. total_purchases (0.10)
  4. age_x_income (0.08)
  5. conversion_rate (0.07) ...

Removed Features

  • user_id_hash (constant)
  • temp_feature_1 (99% correlated with temp_feature_2)
  • random_noise (0% importance) ...

Phase 4: Feature Transformation

Scale, normalize, encode for model compatibility:

from specweave import FeatureTransformer

transformer = FeatureTransformer(increment="0042")

Numerical transformations

transformer.add_numerical_transformer( columns=["age", "income", "purchase_amount"], method="standard_scaler" # Or: min_max, robust, quantile )

Categorical encoding

transformer.add_categorical_encoder( columns=["country", "device_type", "product_category"], method="onehot", # Or: label, target, binary handle_unknown="ignore" )

Ordinal encoding (for ordered categories)

transformer.add_ordinal_encoder( column="education", order=["high_school", "bachelors", "masters", "phd"] )

Log transformation (for skewed distributions)

transformer.add_log_transform( columns=["transaction_amount", "page_views"], method="log1p" # log(1 + x) to handle zeros )

Box-Cox transformation (for normalization)

transformer.add_power_transform( columns=["revenue", "engagement_score"], method="box-cox" )

Custom transformation

def clip_outliers(x): return np.clip(x, x.quantile(0.01), x.quantile(0.99))

transformer.add_custom_transformer( columns=["outlier_prone_feature"], func=clip_outliers )

Fit and transform

X_train_transformed = transformer.fit_transform(X_train) X_test_transformed = transformer.transform(X_test)

Save transformer pipeline

transformer.save( path=".specweave/increments/0042.../features/transformer.pkl" )

Phase 5: Feature Validation

Ensure features are production-ready:

from specweave import FeatureValidator

validator = FeatureValidator( X_train, X_test, increment="0042" )

Check for data leakage

leakage_report = validator.check_data_leakage()

Detects: perfectly correlated features, future data in training

Check for distribution drift

drift_report = validator.check_distribution_drift()

Compares train vs test distributions

Check for missing values after transformation

missing_report = validator.check_missing_values()

Check for infinite/NaN values

invalid_report = validator.check_invalid_values()

Generate validation report

validator.generate_report()

Validation Report:

Feature Validation Report

Data Leakage: ✅ PASS

No perfect correlations detected between train and test.

Distribution Drift: ⚠️ WARNING

Features with significant drift (KS test p < 0.05):

  • user_age: p=0.023 (minor drift)
  • device_type: p=0.001 (major drift)

Recommendation: Check if test data is from different time period.

Missing Values: ✅ PASS

No missing values after transformation.

Invalid Values: ✅ PASS

No infinite or NaN values detected.

Overall: READY FOR TRAINING

2 warnings, 0 critical issues.

Integration with SpecWeave

Automatic Feature Documentation

All feature engineering steps logged to increment

with track_experiment("feature-engineering-v1", increment="0042") as exp: # Create features df_enriched = creator.generate()

# Select features
selected = selector.select()

# Transform features
X_transformed = transformer.fit_transform(X)

# Validate
validation = validator.validate()

# Auto-logs:
exp.log_param("original_features", 125)
exp.log_param("created_features", 45)
exp.log_param("selected_features", 35)
exp.log_metric("feature_reduction", 0.72)
exp.save_artifact("feature_definitions.yaml")
exp.save_artifact("transformer.pkl")
exp.save_artifact("validation_report.md")

Living Docs Integration

After completing feature engineering:

/sw:sync-docs update

Updates:

<!-- .specweave/docs/internal/architecture/feature-engineering.md -->

Recommendation Model Features (Increment 0042)

Feature Engineering Pipeline

  1. Data Quality: 100K rows, 45 columns
  2. Created: 45 new features (temporal, aggregation, interaction)
  3. Selected: 35 features (72% reduction via importance + RFE)
  4. Transformed: StandardScaler for numerical, OneHot for categorical

Key Features

  • user_purchase_amount_mean: Average user spend (top feature, 18% importance)
  • days_since_last_purchase: Recency indicator (12% importance)
  • age_x_income: Interaction feature (8% importance)

Feature Store

All features documented in: .specweave/increments/0042.../features/

  • feature_definitions.yaml: Feature catalog
  • transformer.pkl: Production transformation pipeline
  • validation_report.md: Quality checks

Best Practices

  1. Document Feature Rationale

Bad: Create features without explanation

df["feature_1"] = df["col_a"] * df["col_b"]

Good: Document why features were created

creator.add_interaction_feature( sources=["age", "income"], operation="multiply", rationale="High-income older users have different behavior patterns" )

  1. Handle Missing Values Systematically

Options for missing values:

1. Imputation (mean, median, mode)

creator.impute_missing(column="age", strategy="median")

2. Indicator features (flag missing as signal)

creator.add_missing_indicator(column="email")

Creates: email_missing (0/1)

3. Forward/backward fill (for time series)

creator.fill_missing(column="sensor_reading", method="ffill")

4. Model-based imputation

creator.impute_with_model(column="income", model=RandomForestRegressor())

  1. Avoid Data Leakage

❌ WRONG: Fit on all data (includes test set!)

scaler.fit(X) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test)

✅ CORRECT: Fit only on train, transform both

scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test)

SpecWeave's transformer enforces this pattern

transformer.fit_transform(X_train) # Fits transformer.transform(X_test) # Only transforms

  1. Version Feature Engineering Pipeline

Version features with increment

transformer.save( path=".specweave/increments/0042.../features/transformer-v1.pkl", metadata={ "version": "v1", "features": selected_features, "transformations": ["standard_scaler", "onehot"] } )

Load specific version for reproducibility

transformer_v1 = FeatureTransformer.load( ".specweave/increments/0042.../features/transformer-v1.pkl" )

  1. Test Feature Engineering on New Data

Before deploying, test on held-out data

X_production_sample = load_production_data()

try: X_transformed = transformer.transform(X_production_sample) except Exception as e: raise FeatureEngineeringError(f"Failed on production data: {e}")

Check for unexpected values

validator = FeatureValidator(X_train, X_production_sample) validation_report = validator.validate()

if validation_report["status"] == "CRITICAL": raise FeatureEngineeringError("Feature engineering failed validation")

Common Feature Engineering Patterns

Pattern 1: RFM (Recency, Frequency, Monetary)

For e-commerce / customer analytics

creator.add_rfm_features( user_id="user_id", transaction_date="purchase_date", transaction_amount="purchase_amount" )

Creates:

- recency: days since last purchase

- frequency: total purchases

- monetary: total spend

Pattern 2: Rolling Window Aggregations

For time series

creator.add_rolling_features( column="daily_sales", windows=[7, 14, 30], aggs=["mean", "std", "min", "max"] )

Creates: daily_sales_7day_mean, daily_sales_7day_std, etc.

Pattern 3: Target Encoding (Categorical → Numerical)

Encode categorical as target mean (careful: can leak!)

creator.add_target_encoding( column="product_category", target="purchase_amount", cv_folds=5 # Cross-validation to prevent leakage )

Creates: product_category_target_encoded

Pattern 4: Polynomial Features

For non-linear relationships

creator.add_polynomial_features( columns=["age", "income"], degree=2, interaction_only=True )

Creates: age^2, income^2, age*income

Commands

Generate feature engineering pipeline for increment

/ml:engineer-features 0042

Validate features before training

/ml:validate-features 0042

Generate feature importance report

/ml:feature-importance 0042

Integration with Other Skills

  • ml-pipeline-orchestrator: Task 2 is "Feature Engineering" (uses this skill)

  • experiment-tracker: Logs all feature engineering experiments

  • model-evaluator: Uses feature importance from models

  • ml-deployment-helper: Packages feature transformer for production

Summary

Feature engineering is 70% of ML success. This skill ensures:

  • ✅ Systematic approach (quality → create → select → transform → validate)

  • ✅ No data leakage (train/test separation enforced)

  • ✅ Production-ready (versioned, validated, documented)

  • ✅ Reproducible (all steps tracked in increment)

  • ✅ Traceable (feature definitions in living docs)

Good features make mediocre models great. Great features make mediocre models excellent.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

n8n-kafka-workflows

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

expo-workflow

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

gitops-workflow

No summary provided by upstream source.

Repository SourceNeeds Review