Sheet Data Enrichment & Summarization
Workflow
Phase 1: Reconnaissance
- Read the full sheet (or specify range) to understand the schema
- Identify the source column (e.g., URLs/links) and target column (data to fill)
- Count empty rows in the target column — these are the work items
- Confirm the plan with the user before proceeding
Phase 2: Extraction
Process rows in batches. For each row with an empty target column:
-
Classify the source before fetching:
- Regular article URL →
web_fetch - SPA / JS-rendered page → browser automation
- WeChat article (mp.weixin.qq.com) → browser (captcha may block
web_fetch) - Social media (Weibo, Douyin, etc.) → browser or note as "no extractable data"
- Dead link / paywall → flag for user
- Regular article URL →
-
Extract the target data from the fetched content:
- Search for the data in multiple locations (top, bottom, metadata)
- Use targeted patterns (see references/extraction-patterns.md)
- If ambiguous, flag for manual review rather than guessing
-
Verify before writing (每篇先核实):
- Confirm the extracted value is correct before writing to the sheet
- Cross-reference with known mappings if available
- When batch-processing, verify a sample first, then apply patterns
Phase 3: Write-back
- Write confirmed values to the target column
- Always verify row alignment — read the row before writing to confirm the target cell matches the expected entry
- Use single-cell writes (
G5:G5) not bare references (G5) - For Feishu Sheets: range format must be
sheetId!G5:G5
Phase 4: Summarization
When the user requests aggregation:
- Re-read all enriched sheets to capture any manual corrections
- Group by the requested dimensions (e.g., assignee → contributor → sum)
- Sort groups by total descending for readability
- Output to a new sheet/spreadsheet with headers + subtotals + grand total
Key Lessons (Extraction)
Where to find data on a web page
Data placement varies by source. Always check multiple locations:
| Position | Example |
|---|---|
| Below title | 作者:张三 / By Jane Doe |
| End of article | (记者 李四) / Reporter: John |
| Before timestamp | 王五 2026-03-18 14:00 |
| Metadata line | 文|赵六 / Author: Sarah |
| Combined format | 采写:记者 孙七 编辑:周八 |
Never conclude "no data" after checking only the top of the page. Always check the end too.
When to use browser vs web_fetch
| Signal | Tool |
|---|---|
| Static HTML, server-rendered | web_fetch (fast, cheap) |
| Returns empty/minimal content | Switch to browser |
URL contains mp.weixin.qq.com | Browser (WeChat captcha) |
| SPA framework (React/Vue/Angular) | Browser |
Baidu mini-program (smartapps.cn) | Browser or find alternate URL |
| Social media embeds (Weibo, Douyin) | Browser, but often no structured data |
Reusable mappings
When the same source consistently maps to the same value across rows/sheets:
- Build a mapping table after confirming 2+ instances
- Apply mappings to future rows without re-fetching
- Always note the mapping source for audit
Common "no data" patterns
These formats typically have no individual attribution:
- Flash news / wire alerts (e.g., "财联社电", "每经AI快讯")
- Press releases / corporate announcements
- Social media reposts without original attribution
- Aggregated roundup articles
Mark these clearly (leave blank or use a placeholder like "/" per user preference).
Row Alignment Safety
Critical: Off-by-one errors are the #1 failure mode when writing to spreadsheets.
Before every write:
- Read the target row to verify the adjacent cells match expectations
- If processing multiple rows, re-read periodically to catch drift
- Never assume row N in your data maps to row N in the sheet — always verify
Feishu Sheets Specifics
- Range format:
sheetId!A1:B2(not sheet name) - Single-cell write:
sheetId!G5:G5(notsheetId!G5) - Get sheet IDs via
feishu_sheetactioninfo - Write returns
revisionnumber — useful for tracking changes - Cannot create new sheet tabs in existing spreadsheet via API; create a new spreadsheet for summaries
Output Format
Summary sheets should include:
- Headers: Group dimension, detail dimension, count, total
- Subtotals: Per group
- Grand total: Final row
- Sorting: By total descending within each group, groups ordered by subtotal descending