News Digest - 每日新闻摘要

Automated 3-stage pipeline for Chinese news aggregation and digest generation.

Quick Start

python scripts/news_digest_v2/run_all_stages.py

Output: .news-digest-out.md (workspace) + 新闻摘要_YYYYMMDD_HHMMSS.txt (desktop)

Architecture

Stage 1:   Fetch     →  Scrape 42 websites → Filter → Save to SQLite DB
Stage 2:   Process   →  Deduplicate (≥90% similarity) → Tag keywords
Stage 2.5: LLM       →  Batch LLM summarization (optional, requires API key)
Stage 3:   Output    →  Read LLM summaries (fallback to rule summaries) → Save to files

Setup

Prerequisites

Python 3.8+ with: requests, beautifulsoup4
SQLite (built-in)

Initialize Database

The database stores articles and configuration. Default path: news_digest_v2/news.db (relative to scripts directory).

Override with environment variable: NEWS_DIGEST_DB=/your/path/news.db

Then seed the database with monitored websites and system keywords using SQL insertion into monitor_websites and system_keywords tables.

Core Database Tables

Table	Purpose
`articles`	Scraped news articles (title, content, URL, date, keywords, duplicate flag)
`monitor_websites`	42 monitored websites (name, URL, CSS selector, category, enabled)
`system_keywords`	Keywords for relevance scoring (core vs auxiliary, with weight)

Usage

Full Pipeline

python scripts/news_digest_v2/run_all_stages.py

Takes ~5 minutes (network-bound, 42 websites).

Entry Point (PowerShell wrapper)

For OpenClaw or automated integration, create a wrapper script that:

Runs the pipeline
Reads the output file
Sends to your preferred messaging platform

Cron Job Example

schedule: "0 20 * * *"  # Daily 20:00
payload:
  run: python scripts/news_digest_v2/run_all_stages.py
  then: read .news-digest-out.md and send to messaging
timeout: 600  # 10 minutes

Output Format

【来源：标题】
摘要内容（智能选段，300字以内，包含关键数据和核心事实）
发布时间：YYYY-MM-DD
原文链接：http://...

摘要质量保证

不完整句子自动过滤：

摘要末尾以逗号、顿号、分号、冒号等结尾 → 回退截断到上一个句号
全文没有句号（整段残缺）→ 直接丢弃，不输出
截断时信息损失超过 40% → 整段放弃，宁缺毋滥

教程/指南类内容全部过滤：

标题或内容包含"教程"、"指南"、"攻略"、"手把手"、"从零开始"等 → 自动排除
科研绘图/PS教程/Illustrator教程 → 自动排除
详见 rules_config.py 中 social 分类的教程关键词列表

Key Features

Smart Summary Extraction (fetcher.py → extract_brief_summary)

Not simple truncation. Each paragraph is scored by:

Position: Lead paragraph +10, top-3 +5 (inverted pyramid journalism)
Data density: Numbers × 1.5
Signal words: 印发/发布/宣布/决定/完成/启动 (+2 each)
Entity density: Organizations, locations (+1 each)
Completeness: Full sentence ending +3

Then filtered: removes image captions, journalist bylines, ads, subtitles, boilerplate.

截断保护：截断时信息损失 >40% → 整段放弃。

摘要后处理 (formatter.py → clean_summary)

电头/记者署名清理（预编译正则，支持新华社、中新网、财联社等）
不完整句子过滤：以逗号/顿号/分号结尾 → 回退到上一个句号
全文无句号 → 丢弃（不输出残缺内容）

Filtering Rules (rules_config.py)

Excluded topics: entertainment, social news, violence, crime cases, health/wellness, education, automotive consumer news, science popularization (科普类), animal/archaeology news.

教程类（全部过滤）：教程、指南、攻略、入门、自学、从零开始、手把手、保姆级教程、怎么做、如何使用、操作步骤、图文教程、视频教程、科研绘图、PS教程、Illustrator、AI教程、钢笔工具、高斯模糊、路径查找器等。

Invalid keywords: clickbait patterns, advertising, webpage navigation elements.

See scripts/news_digest_v2/rules_config.py for full lists.

Deduplication (similarity.py)

Jaccard similarity on keyword sets
Threshold: ≥90% → mark as duplicate
Only one version appears in output

Date Filtering

Normal: within 3 days
Holidays: within 7 days
No date → discard
Old URLs (year > 1 year ago) → skip

Configuration

Environment Variables

Variable	Default	Description
`NEWS_DIGEST_DB`	`news_digest_v2/news.db`	SQLite database path
`NEWS_DIGEST_LLM_API_KEY`	(empty)	LLM API key for Stage 2.5 summarization
`NEWS_DIGEST_LLM_BASE_URL`	(empty)	LLM API base URL
`NEWS_DIGEST_LLM_MODEL`	`qwen-plus`	LLM model name

If LLM env vars are not set, Stage 2.5 is silently skipped and rule-based summaries are used instead.

Add/Remove Websites

Edit monitor_websites table:

INSERT INTO monitor_websites (name, url, selector, category, enabled)
VALUES ('示例网站', 'https://example.com', 'a', '财经', 1);

Customize Keywords

Edit system_keywords table:

INSERT INTO system_keywords (keyword, category, weight, enabled)
VALUES ('新能源', 'core', 5, 1);

Adjust Output

In config.py:

MAX_OUTPUT_COUNT = 35 (max articles per digest)
SIMILARITY_THRESHOLD = 0.90

Files

news-digest/
├── SKILL.md
└── scripts/
    └── news_digest_v2/
        ├── __init__.py
        ├── config.py              # DB path, websites, keywords, holidays, LLM config
        ├── database.py            # SQLite operations
        ├── fetcher.py             # Web scraping + smart summary extraction
        ├── filters.py             # Content filtering logic
        ├── formatter.py           # Output formatting + incomplete sentence handling
        ├── rules_config.py        # Exclusion rules, keywords, dateline patterns
        ├── similarity.py          # Jaccard deduplication
        ├── stage1_fetch.py        # Stage 1 entry (fetch)
        ├── stage2_process.py      # Stage 2 entry (dedup + keywords)
        ├── stage2_5_llm_summary.py # Stage 2.5 (LLM batch summarization)
        ├── stage3_output.py       # Stage 3 entry (read + format + save)
        └── run_all_stages.py      # Full pipeline entry

Performance Notes

~5 minutes for full 42-website scrape (network I/O bound)
Some sites may fail (SSL issues, 521 errors, 404s) — pipeline continues
Recommended cron timeout: 600 seconds
数据库是增量追加的，不会被清空。新新闻按 URL 去重插入（INSERT OR IGNORE），旧新闻保留。
重复新闻标记 is_duplicate = 1，不删除。
数据库增长约 30-50 条/天，建议定期清理（可选）。

news-digest

Safety Notice

Copy this and send it to your AI assistant to learn