CSV Data Summary Skill
This skill helps you analyze CSV files and generate comprehensive summaries with statistical insights and visualizations. It automatically detects the type of data you're working with and adapts the analysis accordingly.
Use Cases
- Quick data exploration and understanding
- Identifying data quality issues (missing values, outliers)
- Discovering patterns and correlations in datasets
- Creating visual summaries for reports and presentations
- Time-series analysis when date columns are present
- Categorical data distribution analysis
Prerequisites
You'll need Python with the following libraries:
pip install pandas>=2.0.0 matplotlib>=3.7.0 seaborn>=0.12.0
When to Use This Skill
Use this skill whenever you need to:
- Understand the structure and content of a CSV file
- Get summary statistics for numeric columns
- Identify missing data and data quality issues
- Visualize distributions and correlations
- Analyze time-series trends
- Get a comprehensive overview of categorical variables
How It Works
The skill automatically:
- Loads and inspects the CSV file
- Identifies data structure - column types, date columns, numeric columns, categories
- Adapts analysis based on data type:
- Sales/E-commerce data: Time-series trends, revenue analysis, product performance
- Customer data: Distribution analysis, segmentation, geographic patterns
- Financial data: Trend analysis, statistical summaries, correlations
- Operational data: Time-series, performance metrics, distributions
- Survey data: Frequency analysis, cross-tabulations, distributions
- Generates visualizations relevant to the specific dataset:
- Time-series plots (if date/timestamp columns exist)
- Correlation heatmaps (if multiple numeric columns exist)
- Category distributions (if categorical columns exist)
- Histograms for numeric distributions
- Provides comprehensive output including:
- Data overview (rows, columns, types)
- Key statistics and metrics
- Missing data analysis
- Multiple relevant visualizations
- Actionable insights
Python Implementation
Basic Usage
from analyze import summarize_csv
# Analyze any CSV file
summary = summarize_csv('your_data.csv')
print(summary)
The script will automatically generate:
- A comprehensive text summary
- Multiple visualization files (PNG format)
Example Output
============================================================
📊 DATA OVERVIEW
============================================================
Rows: 5,000 | Columns: 8
📋 DATA TYPES:
• order_date: object
• total_revenue: float64
• customer_segment: object
...
🔍 DATA QUALITY:
✓ No missing values - dataset is complete!
📈 NUMERICAL ANALYSIS:
[Summary statistics for all numeric columns]
🔗 CORRELATIONS:
[Correlation matrix showing relationships]
📅 TIME SERIES ANALYSIS:
Date range: 2024-01-05 to 2024-04-11
Span: 97 days
📊 VISUALIZATIONS CREATED:
✓ correlation_heatmap.png
✓ time_series_analysis.png
✓ distributions.png
✓ categorical_distributions.png
Command Line Usage
You can run the analysis from the command line:
# Analyze a specific CSV file
python scripts/analyze.py path/to/your/data.csv
# Use the sample data
python scripts/analyze.py resources/sample.csv
Understanding the Output
Data Overview
- Shows the dimensions of your dataset (rows × columns)
- Lists all column names
- Shows data type for each column
Data Quality
- Reports missing values by column
- Shows percentage of missing data
- Helps identify data cleaning needs
Numerical Analysis
- Provides descriptive statistics (mean, std, min, max, quartiles)
- Shows correlations between numeric columns
- Creates correlation heatmap visualization
Categorical Analysis
- Shows frequency distribution for each categorical variable
- Displays top 10 values per category
- Creates bar charts for categorical distributions
Time Series Analysis
- Automatically detected when date/time columns are present
- Shows date range and span
- Creates trend plots for numeric metrics over time
- Calculates daily/periodic aggregations
Visualizations Generated
The skill automatically creates relevant visualizations:
-
Correlation Heatmap (
correlation_heatmap.png)- Shows relationships between numeric variables
- Color-coded for easy interpretation
- Only generated when 2+ numeric columns exist
-
Time Series Analysis (
time_series_analysis.png)- Trend lines for numeric metrics over time
- Only generated when date/time columns exist
- Shows up to 3 key metrics
-
Distributions (
distributions.png)- Histograms for numeric columns
- Shows up to 4 numeric variables
- Helps identify outliers and data shape
-
Categorical Distributions (
categorical_distributions.png)- Bar charts for categorical variables
- Shows top 10 values per category
- Up to 4 categorical variables
Tips and Best Practices
- Clean column names: Use lowercase and underscores for better readability
- Date formats: Ensure date columns contain 'date' or 'time' in the name
- Numeric data: Ensure numeric columns are properly typed (not strings)
- Large files: The skill handles large files efficiently with pandas
- Missing data: Review the data quality section carefully before analysis
Troubleshooting
Issue: Date columns not detected
- Ensure column names contain 'date' or 'time'
- Check date format is recognizable (YYYY-MM-DD, MM/DD/YYYY, etc.)
Issue: Numeric columns treated as text
- Check for non-numeric characters in the data
- Clean data or use pandas type conversion
Issue: Too many visualizations
- The script automatically limits visualizations to the most relevant ones
- Focus on the first few metrics of each type
Issue: Import errors
- Ensure all dependencies are installed:
pip install -r requirements.txt - Check Python version (3.8+ recommended)
Advanced Usage
Customizing the Analysis
You can modify analyze.py to:
- Add custom metrics specific to your domain
- Change visualization styles and colors
- Adjust the number of categories shown
- Add domain-specific insights
Integration with Other Tools
The script outputs:
- Plain text summary (easy to parse)
- PNG images (ready for reports)
- Can be extended to output JSON, HTML, or PDF reports
Additional Resources
Differences from Excel Sheet Reference Skill
This skill focuses on:
- Data analysis and visualization (not Excel formula creation)
- CSV file format (not Excel workbooks)
- Statistical insights (not cross-sheet references)
- Python pandas (not openpyxl)
Use the excel-sheet-reference skill when you need to:
- Create Excel files with multiple sheets
- Use cross-sheet formulas (VLOOKUP, COUNTIFS, etc.)
- Maintain data in Excel format with formulas