Data Analysis Assistant Workflow
This skill transforms the agent into a powerful local data analysis assistant, strictly adhering to a Local Code Execution paradigm.
Core Architecture & Principles
- Local Execution First: NEVER read large datasets directly into the context window. Always generate Python scripts or SQL commands and execute them locally using
RunCommand. - SQLite as the Engine: All CSV/Excel files should be imported into a local SQLite database (default:
workspace.db). Rely on SQL for robust data manipulation (filtering, joining, grouping). - Non-Destructive Operations (Undo Mechanism): Do not overwrite original tables. When modifying data, create a new table (e.g.,
CREATE TABLE table_v2 AS SELECT ...) or a View. This guarantees the user can always say "undo the last step". - Data Privacy: Keep data local. Only send aggregated statistics or schema info into the context window.
Scenarios & Procedures
Scenario 1: Data Import & Auto-Cleaning
Trigger: User uploads or specifies a CSV/Excel/WPS(.et)/Numbers file. Action:
- Run the built-in importer script (supports
.csv,.xlsx,.xls,.et,.numbers):
Note: This script calculates the MD5 hash of the file. If an identical file was already imported, it skips the import and returns the existing table name. It also automatically handles merged cells, detects the real header row, chunks large CSVs, and sanitizes column names for SQLite.python scripts/data_importer.py "path/to/file.xlsx" --db workspace.db - Once imported, run a quick check to understand the schema and data:
sqlite3 workspace.db "PRAGMA table_info(table_name);" sqlite3 workspace.db "SELECT * FROM table_name LIMIT 3;" -header -column - Ask the user if they want to perform standard cleaning (e.g., handling missing values, deduplication). Execute these via SQL.
Scenario 2: Continuous Queries & Manipulation
Trigger: User asks to filter, sort, aggregate, or add columns. Action:
- Formulate the SQL query.
- Execute it via
RunCommand:sqlite3 workspace.db "SELECT ..." - For structural changes, remember the Undo principle:
CREATE TABLE table_name_step2 AS SELECT ...
Scenario 3: Semantic Extraction & Fuzzy Join
Trigger: User wants to split addresses, do sentiment analysis, or join tables with mismatched keys (e.g., "Beijing Branch" vs "Beijing Office"). Action:
- Generate a Python script using
pandasandsqlite3. - For Fuzzy Joins, use libraries like
thefuzzordifflibin the Python script to match keys, then write the mapping back to SQLite. - For Semantic extraction, use regex or heuristic rules in Python. If LLM analysis is strictly required, write a script that processes the column locally or prompts the user for permission to send a sample.
Scenario 4: Chart Generation
Trigger: User requests a visualization (bar, pie, line, scatter, map, funnel, 3D charts, etc.). Action:
- Do NOT write custom Python scripts from scratch.
- We have a powerful template-based rendering engine. Use the built-in
scripts/chart_generator.pyscript. - First, identify the required chart type. Look into
references/prompts/directory to find the corresponding Prompt skeleton for the exact chart type (e.g.,references/prompts/line/stacked_area.md). Read the prompt to understand the data structure requirements. - Formulate the SQL query that aggregates the data correctly according to the prompt's requirements.
- Generate the
custom_jsandecharts_optionbased on the prompt template. - Construct a JSON configuration file (save it in
outputs/configs/) matching this structure:
Note: For map charts requiring coordinates, use the built-in Geocoding capabilities or ECharts native{ "db_path": "workspace.db", "query": "SELECT category, SUM(value) as val FROM table GROUP BY category", "title": "Chart Title", "output_path": "/Users/wuliang/workspace/data-skill/outputs/html/output_chart.html", "echarts_option": { ... }, // Generated option from prompt "custom_js": "..." // Optional JS logic for complex data binding }geocoordinate systems. Output files MUST be stored in the isolatedoutputs/html/directory. - Execute the command:
python scripts/chart_generator.py --config outputs/configs/your_config.json - The script will automatically start a local HTTP server and return an access URL. Provide this URL to the user to view the interactive chart.
Scenario 5: File Merging & Splitting
Trigger: User needs to combine multiple identical reports or split a master sheet by department. Action:
- Merge: Iterate over the files and run
data_importer.pypointing to the same table name (the script appends automatically if the table exists, or write a custom Python script). - Split: Generate a Python script that reads the master table from SQLite and exports it into multiple Excel files using
pandas.DataFrame.to_excel()inside a loop.
Scenario 6: Export & Reporting
Trigger: User wants to download the final result or generate a summary report. Action:
- Export CSV/Excel: Use the built-in exporter script to dump a table or query result to
.csvor.xlsx:# Export an entire table python scripts/data_exporter.py "outputs/final_result.csv" --table "final_table" # Export a specific query python scripts/data_exporter.py "outputs/final_result.xlsx" --query "SELECT category, SUM(value) FROM sales GROUP BY category" - Report Generation: Write a Markdown file summarizing the analysis steps, key metrics (retrieved via SQL), and referencing any generated charts. Provide the user with the path to the report.
Scenario 7: Data Cleanup
Trigger: Routine maintenance or user request to clean up old data. Action:
- Run the cleaner script to remove tables and metadata not accessed in the last 30 days:
python scripts/data_cleaner.py --db workspace.db --days 30
Scenario 8: Metrics Management
Trigger: User describes or defines a specific metric calculation logic or business definition (口径). Action:
- When the user provides a metric definition, save it to the local markdown file
references/metrics.mdto build up context for future SQL generation. - Use the built-in script
scripts/metrics_manager.pyto append the metric:python scripts/metrics_manager.py --name "Metric Name" --desc "Metric calculation logic or business description" - When generating SQL queries later, ALWAYS read
references/metrics.mdto ensure the generated SQL aligns with the saved business definitions.