Feature Engineering
Use this skill for creating, transforming, and selecting features that improve model performance.
When to use this skill
-
After EDA — convert insights into features
-
Model underperforming — need better representations
-
Handling different data types (numerical, categorical, text, datetime)
-
Reducing dimensionality or selecting most predictive features
Feature engineering workflow
Numerical features
-
Scaling (StandardScaler, MinMaxScaler, RobustScaler)
-
Transformations (log, sqrt, Box-Cox for skewness)
-
Binning (equal-width, quantile, custom)
-
Interaction features
Categorical features
-
One-hot encoding (low cardinality)
-
Target/Mean encoding (high cardinality)
-
Ordinal encoding (ordered categories)
-
Frequency/rare category handling
Datetime features
-
Extract components (year, month, day, hour, dayofweek)
-
Cyclical encoding (sin/cos for time cycles)
-
Time since/duration features
Text features
-
TF-IDF, CountVectorizer
-
Embeddings (sentence-transformers)
-
Basic text stats (length, word count)
Feature selection
-
Filter methods (correlation, mutual information)
-
Wrapper methods (recursive feature elimination)
-
Embedded methods (L1 regularization, tree importance)
Quick tool selection
Task Default choice Notes
sklearn pipelines sklearn.pipeline + ColumnTransformer Reproducible, cross-validation safe
Categorical encoding category_encoders Beyond sklearn's limited options
Feature selection sklearn.feature_selection Mutual info, RFE, SelectFromModel
Text embeddings sentence-transformers Pre-trained semantic embeddings
Auto feature engineering Feature-engine Comprehensive transformations
Core implementation rules
- Use pipelines to prevent leakage
from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.pipeline import Pipeline
preprocessor = ColumnTransformer([ ('num', StandardScaler(), numerical_features), ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features) ])
pipeline = Pipeline([ ('prep', preprocessor), ('model', RandomForestClassifier()) ])
- Fit on train only, transform on all
Correct: fit_transform on train, transform on test
X_train_processed = preprocessor.fit_transform(X_train) X_test_processed = preprocessor.transform(X_test) # Only transform!
- Handle unknown categories
OneHotEncoder(handle_unknown='ignore') # Unknown → all zeros
OR
OneHotEncoder(handle_unknown='infrequent_if_exist') # Group rare/unknown
- Document feature importance
Track which features were created, why, and their expected impact.
Common anti-patterns
-
❌ Fitting preprocessors on full dataset (leakage!)
-
❌ One-hot encoding high-cardinality features (dimension explosion)
-
❌ Ignoring feature scaling for distance-based models
-
❌ Creating features without domain reasoning
-
❌ Not validating feature distributions match between train/test
Progressive disclosure
-
../references/categorical-encoding.md — Comprehensive encoding guide
-
../references/datetime-features.md — Time-based feature patterns
-
../references/text-features.md — NLP feature engineering
-
../references/feature-selection.md — Selection strategies and implementations
Related skills
-
@data-science-eda — Understand data before engineering
-
@data-science-model-evaluation — Validate feature impact
-
@data-engineering-core — Data processing fundamentals
References
-
sklearn Preprocessing
-
category_encoders
-
Feature-engine Documentation
-
Sentence Transformers