Data Analysis Assistant Workflow

This skill transforms the agent into a powerful local data analysis assistant, strictly adhering to a Local Code Execution paradigm.

Core Architecture & Principles

Local Execution First: NEVER read large datasets directly into the context window. Always generate Python scripts or SQL commands and execute them locally using RunCommand.
SQLite as the Engine: All CSV/Excel files should be imported into a local SQLite database (default: workspace.db). Rely on SQL for robust data manipulation (filtering, joining, grouping).
Non-Destructive Operations (Undo Mechanism): Do not overwrite original tables. When modifying data, create a new table (e.g., CREATE TABLE table_v2 AS SELECT ...) or a View. This guarantees the user can always say "undo the last step".
Data Privacy: Keep data local. Only send aggregated statistics or schema info into the context window.

Scenarios & Procedures

Scenario 1: Data Import & Auto-Cleaning

Trigger: User uploads or specifies a CSV/Excel/WPS(.et)/Numbers file. Action:

Run the built-in importer script (supports .csv, .xlsx, .xls, .et, .numbers):
```
python scripts/data_importer.py "path/to/file.xlsx" --db workspace.db
```
Note: This script calculates the MD5 hash of the file. If an identical file was already imported, it skips the import and returns the existing table name. It also automatically handles merged cells, detects the real header row, chunks large CSVs, and sanitizes column names for SQLite.

Once imported, run a quick check to understand the schema and data:

sqlite3 workspace.db "PRAGMA table_info(table_name);"
sqlite3 workspace.db "SELECT * FROM table_name LIMIT 3;" -header -column

Ask the user if they want to perform standard cleaning (e.g., handling missing values, deduplication). Execute these via SQL.

Scenario 2: Continuous Queries & Manipulation

Trigger: User asks to filter, sort, aggregate, or add columns. Action:

Formulate the SQL query.
Execute it via RunCommand: sqlite3 workspace.db "SELECT ..."
For structural changes, remember the Undo principle: CREATE TABLE table_name_step2 AS SELECT ...

Scenario 3: Semantic Extraction & Fuzzy Join

Trigger: User wants to split addresses, do sentiment analysis, or join tables with mismatched keys (e.g., "Beijing Branch" vs "Beijing Office"). Action:

Generate a Python script using pandas and sqlite3.
For Fuzzy Joins, use libraries like thefuzz or difflib in the Python script to match keys, then write the mapping back to SQLite.
For Semantic extraction, use regex or heuristic rules in Python. If LLM analysis is strictly required, write a script that processes the column locally or prompts the user for permission to send a sample.

Scenario 4: Chart Generation

Trigger: User requests a visualization (bar, pie, line, scatter, map, funnel, 3D charts, etc.). Action:

Do NOT write custom Python scripts from scratch.
We have a powerful template-based rendering engine. Use the built-in scripts/chart_generator.py script.
First, identify the required chart type. Look into references/prompts/ directory to find the corresponding Prompt skeleton for the exact chart type (e.g., references/prompts/line/stacked_area.md). Read the prompt to understand the data structure requirements.
Formulate the SQL query that aggregates the data correctly according to the prompt's requirements.
Generate the custom_js and echarts_option based on the prompt template.

Construct a JSON configuration file (save it in outputs/configs/) matching this structure:

{
    "db_path": "workspace.db",
    "query": "SELECT category, SUM(value) as val FROM table GROUP BY category",
    "title": "Chart Title",
    "output_path": "/Users/wuliang/workspace/data-skill/outputs/html/output_chart.html",
    "echarts_option": { ... }, // Generated option from prompt
    "custom_js": "..." // Optional JS logic for complex data binding
}

Note: For map charts requiring coordinates, use the built-in Geocoding capabilities or ECharts native geo coordinate systems. Output files MUST be stored in the isolated outputs/html/ directory.

Execute the command:

python scripts/chart_generator.py --config outputs/configs/your_config.json

The script will automatically start a local HTTP server and return an access URL. Provide this URL to the user to view the interactive chart.

Scenario 5: File Merging & Splitting

Trigger: User needs to combine multiple identical reports or split a master sheet by department. Action:

Merge: Iterate over the files and run data_importer.py pointing to the same table name (the script appends automatically if the table exists, or write a custom Python script).
Split: Generate a Python script that reads the master table from SQLite and exports it into multiple Excel files using pandas.DataFrame.to_excel() inside a loop.

Scenario 6: Export & Reporting

Trigger: User wants to download the final result or generate a summary report. Action:

Export CSV/Excel: Use the built-in exporter script to dump a table or query result to .csv or .xlsx:

# Export an entire table
python scripts/data_exporter.py "outputs/final_result.csv" --table "final_table"

# Export a specific query
python scripts/data_exporter.py "outputs/final_result.xlsx" --query "SELECT category, SUM(value) FROM sales GROUP BY category"

Report Generation: Write a Markdown file summarizing the analysis steps, key metrics (retrieved via SQL), and referencing any generated charts. Provide the user with the path to the report.

Scenario 7: Data Cleanup

Trigger: Routine maintenance or user request to clean up old data. Action:

Run the cleaner script to remove tables and metadata not accessed in the last 30 days:
```
python scripts/data_cleaner.py --db workspace.db --days 30
```

Scenario 8: Metrics Management

Trigger: User describes or defines a specific metric calculation logic or business definition (口径). Action:

When the user provides a metric definition, save it to the local markdown file references/metrics.md to build up context for future SQL generation.

Use the built-in script scripts/metrics_manager.py to append the metric:

python scripts/metrics_manager.py --name "Metric Name" --desc "Metric calculation logic or business description"

When generating SQL queries later, ALWAYS read references/metrics.md to ensure the generated SQL aligns with the saved business definitions.

data-skill

Safety Notice

Copy this and send it to your AI assistant to learn

Data Analysis Assistant Workflow

Core Architecture & Principles

Scenarios & Procedures

Scenario 1: Data Import & Auto-Cleaning

Scenario 2: Continuous Queries & Manipulation

Scenario 3: Semantic Extraction & Fuzzy Join

Scenario 4: Chart Generation

Scenario 5: File Merging & Splitting

Scenario 6: Export & Reporting

Scenario 7: Data Cleanup

Scenario 8: Metrics Management

Source Transparency

Related Skills

run.dev — Local Dev Environment Manager

ifly-image-understanding

Civic Google

Agent Browser.Skip