Exploratory Data Analysis
Discover patterns, anomalies, and relationships in tabular data through statistical analysis and visualization.
Supported formats: CSV, Excel (.xlsx, .xls), JSON, Parquet, TSV, Feather, HDF5, Pickle
Standard Workflow
- Run statistical analysis:
python scripts/eda_analyzer.py <data_file> -o <output_dir>
- Generate visualizations:
python scripts/visualizer.py <data_file> -o <output_dir>
Read analysis results from <output_dir>/eda_analysis.json
Create report using assets/report_template.md structure
Present findings with key insights and visualizations
Analysis Capabilities
Statistical Analysis
Run scripts/eda_analyzer.py to generate comprehensive analysis:
python scripts/eda_analyzer.py sales_data.csv -o ./output
Produces output/eda_analysis.json containing:
-
Dataset shape, types, memory usage
-
Missing data patterns and percentages
-
Summary statistics (numeric and categorical)
-
Outlier detection (IQR and Z-score methods)
-
Distribution analysis with normality tests
-
Correlation matrices (Pearson and Spearman)
-
Data quality metrics (completeness, duplicates)
-
Automated insights
Visualizations
Run scripts/visualizer.py to generate plots:
python scripts/visualizer.py sales_data.csv -o ./output
Creates high-resolution (300 DPI) PNG files in output/eda_visualizations/ :
-
Missing data heatmaps and bar charts
-
Distribution plots (histograms with KDE)
-
Box plots and violin plots for outliers
-
Correlation heatmaps
-
Scatter matrices for numeric relationships
-
Categorical bar charts
-
Time series plots (if datetime columns detected)
Automated Insights
Access generated insights from the "insights" key in the analysis JSON:
-
Dataset size considerations
-
Missing data warnings (when exceeding thresholds)
-
Strong correlations for feature engineering
-
High outlier rate flags
-
Skewness requiring transformations
-
Duplicate detection
-
Categorical imbalance warnings
Reference Materials
Statistical Interpretation
See references/statistical_tests_guide.md for detailed guidance on:
-
Normality tests (Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov)
-
Distribution characteristics (skewness, kurtosis)
-
Correlation methods (Pearson, Spearman)
-
Outlier detection (IQR, Z-score)
-
Hypothesis testing and data transformations
Use when interpreting statistical results or explaining findings.
Methodology
See references/eda_best_practices.md for comprehensive guidance on:
-
6-step EDA process framework
-
Univariate, bivariate, multivariate analysis approaches
-
Visualization and statistical analysis guidelines
-
Common pitfalls and domain-specific considerations
-
Communication strategies for different audiences
Use when planning analysis or handling specific scenarios.
Report Template
Use assets/report_template.md to structure findings. Template includes:
-
Executive summary
-
Dataset overview
-
Data quality assessment
-
Univariate, bivariate, and multivariate analysis
-
Outlier analysis
-
Key insights and recommendations
-
Limitations and appendices
Fill sections with analysis JSON results and embed visualizations using markdown image syntax.
Example: Complete Analysis
User request: "Explore this sales_data.csv file"
1. Run analysis
python scripts/eda_analyzer.py sales_data.csv -o ./output
2. Generate visualizations
python scripts/visualizer.py sales_data.csv -o ./output
3. Read results
import json with open('./output/eda_analysis.json') as f: results = json.load(f)
4. Build report from assets/report_template.md
- Fill sections with results
- Embed images: 
- Include insights from results['insights']
- Add recommendations
Special Cases
Dataset Size Strategy
If < 100 rows: Note sample size limitations, use non-parametric methods
If 100-1M rows: Standard workflow applies
If > 1M rows: Sample first for quick exploration, note sample size in report, recommend distributed computing for full analysis
Data Characteristics
High-dimensional (>50 columns): Focus on key variables first, use correlation analysis to identify groups, consider PCA or feature selection. See references/eda_best_practices.md for guidance.
Time series: Datetime columns auto-detected, temporal visualizations generated automatically. Consider trends, seasonality, patterns.
Imbalanced: Categorical analysis flags imbalances automatically. Report distributions prominently, recommend stratified sampling if needed.
Output Guidelines
Format findings as markdown:
-
Use headers, tables, and lists for structure
-
Embed visualizations:

-
Include code blocks for suggested transformations
-
Highlight key insights
Make reports actionable:
-
Provide clear recommendations
-
Flag data quality issues requiring attention
-
Suggest next steps (modeling, feature engineering, further analysis)
-
Tailor communication to user's technical level
Error Handling
Unsupported formats: Request conversion to supported format (CSV, Excel, JSON, Parquet)
Files too large: Recommend sampling or chunked processing
Corrupted data: Report specific errors, suggest cleaning steps, attempt partial analysis
Empty columns: Flag in data quality section, recommend removal or investigation
Resources
Scripts (handle all formats automatically):
-
scripts/eda_analyzer.py
-
Statistical analysis engine
-
scripts/visualizer.py
-
Visualization generator
References (load as needed):
-
references/statistical_tests_guide.md
-
Test interpretation and methodology
-
references/eda_best_practices.md
-
EDA process and best practices
Template:
- assets/report_template.md
- Professional report structure
Key Points
-
Run both scripts for complete analysis
-
Structure reports using the template
-
Provide actionable insights, not just statistics
-
Use reference guides for detailed interpretations
-
Document data quality issues and limitations
-
Make clear recommendations for next steps