News Digest - 每日新闻摘要
Automated 3-stage pipeline for Chinese news aggregation and digest generation.
Quick Start
python scripts/news_digest_v2/run_all_stages.py
Output: .news-digest-out.md (workspace) + 新闻摘要_YYYYMMDD_HHMMSS.txt (desktop)
Architecture
Stage 1: Fetch → Scrape 42 websites → Filter → Save to SQLite DB
Stage 2: Process → Deduplicate (≥90% similarity) → Tag keywords
Stage 2.5: LLM → Batch LLM summarization (optional, requires API key)
Stage 3: Output → Read LLM summaries (fallback to rule summaries) → Save to files
Setup
Prerequisites
- Python 3.8+ with:
requests,beautifulsoup4 - SQLite (built-in)
Initialize Database
The database stores articles and configuration. Default path: news_digest_v2/news.db (relative to scripts directory).
Override with environment variable: NEWS_DIGEST_DB=/your/path/news.db
Then seed the database with monitored websites and system keywords using SQL insertion into monitor_websites and system_keywords tables.
Core Database Tables
| Table | Purpose |
|---|---|
articles | Scraped news articles (title, content, URL, date, keywords, duplicate flag) |
monitor_websites | 42 monitored websites (name, URL, CSS selector, category, enabled) |
system_keywords | Keywords for relevance scoring (core vs auxiliary, with weight) |
Usage
Full Pipeline
python scripts/news_digest_v2/run_all_stages.py
Takes ~5 minutes (network-bound, 42 websites).
Entry Point (PowerShell wrapper)
For OpenClaw or automated integration, create a wrapper script that:
- Runs the pipeline
- Reads the output file
- Sends to your preferred messaging platform
Cron Job Example
schedule: "0 20 * * *" # Daily 20:00
payload:
run: python scripts/news_digest_v2/run_all_stages.py
then: read .news-digest-out.md and send to messaging
timeout: 600 # 10 minutes
Output Format
【来源:标题】
摘要内容(智能选段,300字以内,包含关键数据和核心事实)
发布时间:YYYY-MM-DD
原文链接:http://...
摘要质量保证
不完整句子自动过滤:
- 摘要末尾以逗号、顿号、分号、冒号等结尾 → 回退截断到上一个句号
- 全文没有句号(整段残缺)→ 直接丢弃,不输出
- 截断时信息损失超过 40% → 整段放弃,宁缺毋滥
教程/指南类内容全部过滤:
- 标题或内容包含"教程"、"指南"、"攻略"、"手把手"、"从零开始"等 → 自动排除
- 科研绘图/PS教程/Illustrator教程 → 自动排除
- 详见
rules_config.py中social分类的教程关键词列表
Key Features
Smart Summary Extraction (fetcher.py → extract_brief_summary)
Not simple truncation. Each paragraph is scored by:
- Position: Lead paragraph +10, top-3 +5 (inverted pyramid journalism)
- Data density: Numbers × 1.5
- Signal words: 印发/发布/宣布/决定/完成/启动 (+2 each)
- Entity density: Organizations, locations (+1 each)
- Completeness: Full sentence ending +3
Then filtered: removes image captions, journalist bylines, ads, subtitles, boilerplate.
截断保护:截断时信息损失 >40% → 整段放弃。
摘要后处理 (formatter.py → clean_summary)
- 电头/记者署名清理(预编译正则,支持新华社、中新网、财联社等)
- 不完整句子过滤:以逗号/顿号/分号结尾 → 回退到上一个句号
- 全文无句号 → 丢弃(不输出残缺内容)
Filtering Rules (rules_config.py)
Excluded topics: entertainment, social news, violence, crime cases, health/wellness, education, automotive consumer news, science popularization (科普类), animal/archaeology news.
教程类(全部过滤):教程、指南、攻略、入门、自学、从零开始、手把手、保姆级教程、怎么做、如何使用、操作步骤、图文教程、视频教程、科研绘图、PS教程、Illustrator、AI教程、钢笔工具、高斯模糊、路径查找器等。
Invalid keywords: clickbait patterns, advertising, webpage navigation elements.
See scripts/news_digest_v2/rules_config.py for full lists.
Deduplication (similarity.py)
- Jaccard similarity on keyword sets
- Threshold: ≥90% → mark as duplicate
- Only one version appears in output
Date Filtering
- Normal: within 3 days
- Holidays: within 7 days
- No date → discard
- Old URLs (year > 1 year ago) → skip
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
NEWS_DIGEST_DB | news_digest_v2/news.db | SQLite database path |
NEWS_DIGEST_LLM_API_KEY | (empty) | LLM API key for Stage 2.5 summarization |
NEWS_DIGEST_LLM_BASE_URL | (empty) | LLM API base URL |
NEWS_DIGEST_LLM_MODEL | qwen-plus | LLM model name |
If LLM env vars are not set, Stage 2.5 is silently skipped and rule-based summaries are used instead.
Add/Remove Websites
Edit monitor_websites table:
INSERT INTO monitor_websites (name, url, selector, category, enabled)
VALUES ('示例网站', 'https://example.com', 'a', '财经', 1);
Customize Keywords
Edit system_keywords table:
INSERT INTO system_keywords (keyword, category, weight, enabled)
VALUES ('新能源', 'core', 5, 1);
Adjust Output
In config.py:
MAX_OUTPUT_COUNT = 35(max articles per digest)SIMILARITY_THRESHOLD = 0.90
Files
news-digest/
├── SKILL.md
└── scripts/
└── news_digest_v2/
├── __init__.py
├── config.py # DB path, websites, keywords, holidays, LLM config
├── database.py # SQLite operations
├── fetcher.py # Web scraping + smart summary extraction
├── filters.py # Content filtering logic
├── formatter.py # Output formatting + incomplete sentence handling
├── rules_config.py # Exclusion rules, keywords, dateline patterns
├── similarity.py # Jaccard deduplication
├── stage1_fetch.py # Stage 1 entry (fetch)
├── stage2_process.py # Stage 2 entry (dedup + keywords)
├── stage2_5_llm_summary.py # Stage 2.5 (LLM batch summarization)
├── stage3_output.py # Stage 3 entry (read + format + save)
└── run_all_stages.py # Full pipeline entry
Performance Notes
- ~5 minutes for full 42-website scrape (network I/O bound)
- Some sites may fail (SSL issues, 521 errors, 404s) — pipeline continues
- Recommended cron timeout: 600 seconds
- 数据库是增量追加的,不会被清空。新新闻按 URL 去重插入(
INSERT OR IGNORE),旧新闻保留。 - 重复新闻标记
is_duplicate = 1,不删除。 - 数据库增长约 30-50 条/天,建议定期清理(可选)。