Data Report Generator
Transform raw CSV or Excel files into professional, insight-rich reports with charts — output as Word (.docx) or PDF.
When This Skill Activates
Trigger when user provides:
- A CSV, Excel (.xlsx/.xls), or TSV file + asks for a report/analysis
- A request to "visualize", "summarize", or "analyze" tabular data
- Any mention of automated weekly/monthly reporting
If no file is uploaded yet, ask the user to upload their data file first.
Step 1: Gather Inputs
Ask for (or infer from context):
- Data file — CSV, Excel, TSV (required)
- Report format — Word (.docx) or PDF? (default: Word)
- Report purpose — sales analysis? operations? marketing? financial? (affects framing)
- Key questions — what does the user most want to understand from this data?
- Audience — internal team? executive summary? client-facing?
- Language — Chinese or English? (default: match user's language)
If purpose/questions aren't specified, proceed with a comprehensive general analysis.
Step 2: Read and Profile the Data
import pandas as pd
import numpy as np
# Support both CSV and Excel
def load_data(filepath):
ext = filepath.rsplit('.', 1)[-1].lower()
if ext in ['xlsx', 'xls']:
df = pd.read_excel(filepath)
elif ext == 'csv':
# Try common encodings
for enc in ['utf-8', 'gbk', 'gb2312', 'utf-8-sig']:
try:
df = pd.read_csv(filepath, encoding=enc)
break
except:
continue
elif ext == 'tsv':
df = pd.read_csv(filepath, sep='\t')
return df
df = load_data('/path/to/file')
Data Profile to Extract
- Shape: row count, column count
- Column types: numeric, categorical, date/time, text
- Missing values: count and % per column
- Basic stats: mean, median, min, max, std for numeric columns
- Unique value counts for categorical columns
- Date range if time columns exist
- Obvious data quality issues
Step 3: Auto-Detect Report Type
Based on column names and data types, auto-select the analysis approach:
| Data Pattern | Report Type | Reference File |
|---|---|---|
| Date column + numeric values | Time Series / Trend | references/time-series.md |
| Category + numeric (sales/revenue) | Sales / Performance | references/sales-analysis.md |
| Multiple numeric columns | Correlation / Distribution | references/statistical.md |
| Category breakdowns only | Segmentation | references/segmentation.md |
| Mixed / unknown | General Analysis | references/general.md |
Read the relevant reference file for chart selection and narrative guidance.
If the user specifies a report type, use that. Otherwise, auto-detect.
Step 4: Generate Charts
Install dependencies first:
pip install matplotlib seaborn pandas openpyxl --break-system-packages --quiet
Chart Generation Rules
import matplotlib
matplotlib.use('Agg') # Non-interactive backend — ALWAYS set this
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
# Style setup — professional look
plt.style.use('seaborn-v0_8-whitegrid')
COLORS = ['#2E86AB', '#A23B72', '#F18F01', '#C73E1D', '#3B1F2B', '#44BBA4']
def save_chart(fig, filename):
fig.savefig(filename, dpi=150, bbox_inches='tight', facecolor='white')
plt.close(fig)
Chart Selection Guide
Time series data → Line chart (trend) + bar chart (period comparison)
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(df['date'], df['value'], color=COLORS[0], linewidth=2, marker='o', markersize=4)
ax.set_title('Trend Over Time', fontsize=14, fontweight='bold')
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
plt.xticks(rotation=45)
save_chart(fig, 'chart_trend.png')
Category comparison → Horizontal bar chart (easier to read labels)
fig, ax = plt.subplots(figsize=(10, 6))
df_sorted = df.sort_values('value', ascending=True)
bars = ax.barh(df_sorted['category'], df_sorted['value'], color=COLORS[0])
ax.bar_label(bars, fmt='%.1f', padding=3)
ax.set_title('Performance by Category', fontsize=14, fontweight='bold')
save_chart(fig, 'chart_category.png')
Distribution → Histogram + optional KDE
fig, ax = plt.subplots(figsize=(8, 5))
ax.hist(df['value'].dropna(), bins=30, color=COLORS[0], edgecolor='white', alpha=0.8)
ax.set_title('Value Distribution', fontsize=14, fontweight='bold')
save_chart(fig, 'chart_dist.png')
Composition/share → Pie or stacked bar (prefer stacked bar for >5 categories)
fig, ax = plt.subplots(figsize=(8, 6))
sizes = df['value']
labels = df['category']
wedges, texts, autotexts = ax.pie(sizes, labels=labels, autopct='%1.1f%%',
colors=COLORS, startangle=90)
ax.set_title('Composition', fontsize=14, fontweight='bold')
save_chart(fig, 'chart_pie.png')
Correlation → Heatmap
fig, ax = plt.subplots(figsize=(10, 8))
corr = df[numeric_cols].corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdBu_r', center=0,
ax=ax, square=True, cbar_kws={'shrink': 0.8})
ax.set_title('Correlation Matrix', fontsize=14, fontweight='bold')
save_chart(fig, 'chart_corr.png')
Generate 3–6 charts total — enough to be comprehensive, not overwhelming.
Step 5: Write the Report (Word .docx)
Use the docx Python library. Install: pip install python-docx --break-system-packages --quiet
Report Structure
1. Cover Page
- Report title, data source name, date generated
2. Executive Summary (1 page)
- 3-5 key findings in plain language
- Most important metric highlighted
3. Data Overview
- Dataset dimensions, time range, data quality notes
4. [Core Analysis Sections — vary by report type]
- Each section: narrative paragraph + chart + key takeaways
5. Key Insights & Recommendations
- Actionable bullet points
6. Appendix: Data Statistics Table
Docx Code Pattern
from docx import Document
from docx.shared import Inches, Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.table import WD_TABLE_ALIGNMENT
import datetime
doc = Document()
# Page setup
section = doc.sections[0]
section.page_width = Inches(8.5)
section.page_height = Inches(11)
section.left_margin = Inches(1)
section.right_margin = Inches(1)
section.top_margin = Inches(1)
section.bottom_margin = Inches(1)
# Title style
def add_title(doc, text, level=1):
heading = doc.add_heading(text, level=level)
heading.alignment = WD_ALIGN_PARAGRAPH.LEFT
run = heading.runs[0]
run.font.color.rgb = RGBColor(0x2E, 0x86, 0xAB) # Brand blue
return heading
# Body text
def add_paragraph(doc, text):
p = doc.add_paragraph(text)
p.paragraph_format.space_after = Pt(6)
run = p.runs[0] if p.runs else p.add_run()
run.font.size = Pt(11)
return p
# Insert chart image
def add_chart(doc, image_path, caption, width=6.0):
doc.add_picture(image_path, width=Inches(width))
last_paragraph = doc.paragraphs[-1]
last_paragraph.alignment = WD_ALIGN_PARAGRAPH.CENTER
cap = doc.add_paragraph(caption)
cap.alignment = WD_ALIGN_PARAGRAPH.CENTER
cap.runs[0].font.size = Pt(9)
cap.runs[0].font.color.rgb = RGBColor(0x66, 0x66, 0x66)
# Stats table
def add_stats_table(doc, df):
stats = df.describe().round(2)
table = doc.add_table(rows=len(stats)+1, cols=len(stats.columns)+1)
table.style = 'Table Grid'
# Header row
table.cell(0, 0).text = 'Metric'
for j, col in enumerate(stats.columns):
table.cell(0, j+1).text = str(col)
# Data rows
for i, idx in enumerate(stats.index):
table.cell(i+1, 0).text = str(idx)
for j, col in enumerate(stats.columns):
table.cell(i+1, j+1).text = str(stats.loc[idx, col])
# Save
doc.save('/home/claude/report_output.docx')
Step 6: Convert to PDF (if requested)
# Use LibreOffice for conversion
python /path/to/skills/docx/scripts/office/soffice.py \
--headless --convert-to pdf /home/claude/report_output.docx \
--outdir /home/claude/
If LibreOffice is unavailable, fall back to reportlab:
pip install reportlab --break-system-packages --quiet
See references/pdf-fallback.md for reportlab-based PDF generation.
Step 7: Output to User
- Copy final file to
/mnt/user-data/outputs/ - Use
present_filestool to share it - Include a brief summary in chat:
- What data was analyzed
- How many charts were generated
- Top 3 insights found
- Any data quality issues to be aware of
Output Quality Checklist
Before delivering, verify:
- Charts have titles, axis labels, and readable fonts
- No blank/empty chart images
- Report has executive summary with plain-language findings
- Data quality issues are noted (missing values, anomalies)
- File opens cleanly (test docx structure validity)
- Chinese text renders correctly if using Chinese (use
SimHeiorMicrosoft YaHeifont in matplotlib) - All chart image files exist before embedding in docx
Handling Chinese / CJK Text in Charts
import matplotlib
# Set Chinese-compatible font
matplotlib.rcParams['font.sans-serif'] = ['SimHei', 'Microsoft YaHei',
'PingFang SC', 'DejaVu Sans']
matplotlib.rcParams['axes.unicode_minus'] = False # Fix minus sign rendering
If font not available on system:
pip install matplotlib --break-system-packages
# Download and install a CJK font
apt-get install -y fonts-wqy-zenhei 2>/dev/null || true
Edge Cases
Large datasets (>100k rows): Sample or aggregate before visualization. Note sampling in report.
Messy/dirty data: Document cleaning steps in the report's "Data Overview" section. Show before/after counts.
Single column data: Generate distribution analysis only. Note limitation.
All categorical data: Focus on frequency analysis and cross-tabulations. No numeric charts.
Date parsing issues: Try multiple date formats (dayfirst=True, yearfirst=True, infer_datetime_format=True).
Multiple sheets in Excel: Ask user which sheet(s) to analyze, or analyze all and create multi-section report.
Reference Files
references/time-series.md— Trend analysis, seasonality detection, period-over-period comparisonsreferences/sales-analysis.md— Sales/revenue specific charts, KPI calculations, funnel analysisreferences/statistical.md— Correlation, distribution, outlier detection, regressionreferences/segmentation.md— Category breakdowns, cohort analysis, rankingreferences/general.md— General-purpose analysis for unknown data typesreferences/pdf-fallback.md— PDF generation with reportlab when LibreOffice unavailablereferences/chart-styling.md— Advanced chart styling, brand colors, annotation patterns