Data Science Expert
Comprehensive data science frameworks for analytics, machine learning, and data-driven decision making.
Data Strategy
Data Maturity Model
| Level | Name | Characteristics |
|---|
| 1 | Ad Hoc | Manual, inconsistent, siloed |
| 2 | Opportunistic | Some automation, point solutions |
| 3 | Systematic | Defined processes, governance emerging |
| 4 | Differentiating | Data-driven decisions, advanced analytics |
| 5 | Transformative | AI-first, competitive advantage |
Analytics Value Chain
DATA → INFORMATION → INSIGHT → ACTION → VALUE
PROGRESSION:
Descriptive: What happened?
Diagnostic: Why did it happen?
Predictive: What will happen?
Prescriptive: What should we do?
Autonomous: Self-optimizing systems
Statistical Analysis
Descriptive Statistics
CENTRAL TENDENCY:
- Mean: Sum / Count (sensitive to outliers)
- Median: Middle value (robust to outliers)
- Mode: Most frequent value
DISPERSION:
- Range: Max - Min
- Variance: Average squared deviation
- Standard Deviation: √Variance
- IQR: Q3 - Q1 (robust)
DISTRIBUTION SHAPE:
- Skewness: Asymmetry (0 = symmetric)
- Kurtosis: Tail heaviness (3 = normal)
For detailed inferential statistics and hypothesis testing, see Statistical Methods Reference.
Machine Learning
Algorithm Selection
| Task | Algorithms | When to Use |
|---|
| Classification | Logistic Regression, Random Forest, XGBoost, Neural Networks | Categorical outcomes |
| Regression | Linear Regression, Ridge/Lasso, Random Forest, XGBoost | Continuous outcomes |
| Clustering | K-Means, Hierarchical, DBSCAN | Group discovery |
| Dimensionality Reduction | PCA, t-SNE, UMAP | Feature reduction, visualization |
| Anomaly Detection | Isolation Forest, One-Class SVM, Autoencoders | Outlier detection |
| Time Series | ARIMA, Prophet, LSTM | Sequential data |
| Recommendation | Collaborative Filtering, Content-Based, Matrix Factorization | Personalization |
| NLP | Transformers, BERT, GPT | Text understanding/generation |
For detailed ML pipelines, feature engineering, and model monitoring, see ML Pipelines Reference.
Data Governance
Data Governance Framework
GOVERNANCE PILLARS:
POLICIES:
- Data ownership
- Data classification
- Data retention
- Data access
- Data quality standards
ROLES:
- Data Owner: Accountable for data domain
- Data Steward: Day-to-day quality management
- Data Custodian: Technical implementation
- Data Consumer: End user
PROCESSES:
- Data cataloging
- Metadata management
- Data lineage
- Issue resolution
- Change management
METRICS:
- Data quality scores
- Policy compliance
- Data access requests
- Issue resolution time
Data Quality Dimensions
| Dimension | Definition | Measurement |
|---|
| Accuracy | Correct representation of reality | % records matching source |
| Completeness | All required data present | % non-null values |
| Consistency | Same across systems | % matching across sources |
| Timeliness | Available when needed | Latency, freshness |
| Validity | Conforms to format/rules | % passing validation |
| Uniqueness | No unwanted duplicates | Duplicate rate |
Business Intelligence
BI Architecture
ARCHITECTURE LAYERS:
DATA SOURCES:
- Operational systems
- External data
- IoT/streaming
DATA INTEGRATION:
- ETL/ELT pipelines
- Data lakes
- Data warehouses
SEMANTIC LAYER:
- Business definitions
- Calculated metrics
- Hierarchies
- Relationships
PRESENTATION:
- Dashboards
- Reports
- Ad-hoc analysis
- Embedded analytics
Dashboard Design Principles
DESIGN PRINCIPLES:
PURPOSE:
- One clear objective per dashboard
- Know your audience
- Enable decisions
LAYOUT:
- Most important top-left
- Related items grouped
- Progressive disclosure
- Whitespace for clarity
VISUALS:
- Right chart for data type
- Consistent formatting
- Minimal decoration
- Color with purpose
INTERACTIVITY:
- Filters for exploration
- Drill-down capability
- Cross-filtering
- Tooltip details
Metric Design
METRIC DEFINITION TEMPLATE:
NAME: [Metric name]
DEFINITION: [Clear business definition]
FORMULA: [Precise calculation]
OWNER: [Responsible person]
DATA SOURCE: [Where it comes from]
GRAIN: [Level of detail]
FREQUENCY: [Update cadence]
DIMENSIONS: [Slicing attributes]
TARGETS: [Goals/benchmarks]
RELATED: [Related metrics]
Predictive Modeling
Use Case Framework
| Use Case | Business Application | Approach |
|---|
| Churn Prediction | Retention programs | Classification |
| Demand Forecasting | Inventory planning | Time series |
| Lead Scoring | Sales prioritization | Classification |
| Price Optimization | Revenue management | Regression/RL |
| Fraud Detection | Risk mitigation | Anomaly detection |
| Recommendation | Personalization | Collaborative filtering |
| Customer Segmentation | Marketing targeting | Clustering |
| Lifetime Value | Customer investment | Regression |
Data Ethics & Privacy
Ethical AI Framework
PRINCIPLES:
FAIRNESS:
- No discriminatory outcomes
- Bias testing across groups
- Regular auditing
ACCOUNTABILITY:
- Clear ownership
- Decision audit trails
- Escalation process
TRANSPARENCY:
- Explainable decisions
- Clear documentation
- User communication
PRIVACY:
- Data minimization
- Consent management
- Security controls
Bias Detection
BIAS TYPES:
HISTORICAL: Reflects past discrimination
REPRESENTATION: Training data not representative
MEASUREMENT: Proxy variables correlate with protected attributes
AGGREGATION: Single model for diverse populations
EVALUATION: Inappropriate benchmarks
FAIRNESS METRICS:
- Demographic Parity: Equal positive rates
- Equalized Odds: Equal TPR and FPR
- Individual Fairness: Similar inputs, similar outputs
- Calibration: Equal accuracy across groups
Analytics Team Structure
Team Roles
| Role | Focus | Skills |
|---|
| Data Engineer | Pipelines, infrastructure | SQL, Python, Spark, Cloud |
| Data Analyst | Reporting, ad-hoc analysis | SQL, BI tools, Statistics |
| Data Scientist | Modeling, ML | Python/R, ML, Statistics |
| ML Engineer | Model deployment | MLOps, Software Engineering |
| Analytics Engineer | Data modeling | dbt, SQL, Data Modeling |
Operating Models
| Model | Description | Best For |
|---|
| Centralized | Single analytics team | Consistency, efficiency |
| Decentralized | Embedded in business units | Business alignment |
| Hub & Spoke | Central CoE + embedded | Balance of both |
| Federated | Shared platform, domain teams | Scale with autonomy |
References
See Also