news-digest

Automatically scrape, process, and generate daily news digests from 42 Chinese news sources. Covers industry dynamics, policy updates, economy, tech, energy, and pricing information. Use when: user asks for daily news summary, news digest, 每日新闻摘要, 新闻汇总, 新闻摘要, or wants to set up automated news monitoring from Chinese news websites. Outputs formatted summaries with source attribution and original links.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "news-digest" with this command: npx skills add zigu-creator/news-digest-v1

News Digest - 每日新闻摘要

Automated 3-stage pipeline for Chinese news aggregation and digest generation.

Quick Start

python scripts/news_digest_v2/run_all_stages.py

Output: .news-digest-out.md (workspace) + 新闻摘要_YYYYMMDD_HHMMSS.txt (desktop)

Architecture

Stage 1:   Fetch     →  Scrape 42 websites → Filter → Save to SQLite DB
Stage 2:   Process   →  Deduplicate (≥90% similarity) → Tag keywords
Stage 2.5: LLM       →  Batch LLM summarization (optional, requires API key)
Stage 3:   Output    →  Read LLM summaries (fallback to rule summaries) → Save to files

Setup

Prerequisites

  • Python 3.8+ with: requests, beautifulsoup4
  • SQLite (built-in)

Initialize Database

The database stores articles and configuration. Default path: news_digest_v2/news.db (relative to scripts directory).

Override with environment variable: NEWS_DIGEST_DB=/your/path/news.db

Then seed the database with monitored websites and system keywords using SQL insertion into monitor_websites and system_keywords tables.

Core Database Tables

TablePurpose
articlesScraped news articles (title, content, URL, date, keywords, duplicate flag)
monitor_websites42 monitored websites (name, URL, CSS selector, category, enabled)
system_keywordsKeywords for relevance scoring (core vs auxiliary, with weight)

Usage

Full Pipeline

python scripts/news_digest_v2/run_all_stages.py

Takes ~5 minutes (network-bound, 42 websites).

Entry Point (PowerShell wrapper)

For OpenClaw or automated integration, create a wrapper script that:

  1. Runs the pipeline
  2. Reads the output file
  3. Sends to your preferred messaging platform

Cron Job Example

schedule: "0 20 * * *"  # Daily 20:00
payload:
  run: python scripts/news_digest_v2/run_all_stages.py
  then: read .news-digest-out.md and send to messaging
timeout: 600  # 10 minutes

Output Format

【来源:标题】
摘要内容(智能选段,300字以内,包含关键数据和核心事实)
发布时间:YYYY-MM-DD
原文链接:http://...

摘要质量保证

不完整句子自动过滤

  • 摘要末尾以逗号、顿号、分号、冒号等结尾 → 回退截断到上一个句号
  • 全文没有句号(整段残缺)→ 直接丢弃,不输出
  • 截断时信息损失超过 40% → 整段放弃,宁缺毋滥

教程/指南类内容全部过滤

  • 标题或内容包含"教程"、"指南"、"攻略"、"手把手"、"从零开始"等 → 自动排除
  • 科研绘图/PS教程/Illustrator教程 → 自动排除
  • 详见 rules_config.pysocial 分类的教程关键词列表

Key Features

Smart Summary Extraction (fetcher.py → extract_brief_summary)

Not simple truncation. Each paragraph is scored by:

  • Position: Lead paragraph +10, top-3 +5 (inverted pyramid journalism)
  • Data density: Numbers × 1.5
  • Signal words: 印发/发布/宣布/决定/完成/启动 (+2 each)
  • Entity density: Organizations, locations (+1 each)
  • Completeness: Full sentence ending +3

Then filtered: removes image captions, journalist bylines, ads, subtitles, boilerplate.

截断保护:截断时信息损失 >40% → 整段放弃。

摘要后处理 (formatter.py → clean_summary)

  • 电头/记者署名清理(预编译正则,支持新华社、中新网、财联社等)
  • 不完整句子过滤:以逗号/顿号/分号结尾 → 回退到上一个句号
  • 全文无句号 → 丢弃(不输出残缺内容)

Filtering Rules (rules_config.py)

Excluded topics: entertainment, social news, violence, crime cases, health/wellness, education, automotive consumer news, science popularization (科普类), animal/archaeology news.

教程类(全部过滤):教程、指南、攻略、入门、自学、从零开始、手把手、保姆级教程、怎么做、如何使用、操作步骤、图文教程、视频教程、科研绘图、PS教程、Illustrator、AI教程、钢笔工具、高斯模糊、路径查找器等。

Invalid keywords: clickbait patterns, advertising, webpage navigation elements.

See scripts/news_digest_v2/rules_config.py for full lists.

Deduplication (similarity.py)

  • Jaccard similarity on keyword sets
  • Threshold: ≥90% → mark as duplicate
  • Only one version appears in output

Date Filtering

  • Normal: within 3 days
  • Holidays: within 7 days
  • No date → discard
  • Old URLs (year > 1 year ago) → skip

Configuration

Environment Variables

VariableDefaultDescription
NEWS_DIGEST_DBnews_digest_v2/news.dbSQLite database path
NEWS_DIGEST_LLM_API_KEY(empty)LLM API key for Stage 2.5 summarization
NEWS_DIGEST_LLM_BASE_URL(empty)LLM API base URL
NEWS_DIGEST_LLM_MODELqwen-plusLLM model name

If LLM env vars are not set, Stage 2.5 is silently skipped and rule-based summaries are used instead.

Add/Remove Websites

Edit monitor_websites table:

INSERT INTO monitor_websites (name, url, selector, category, enabled)
VALUES ('示例网站', 'https://example.com', 'a', '财经', 1);

Customize Keywords

Edit system_keywords table:

INSERT INTO system_keywords (keyword, category, weight, enabled)
VALUES ('新能源', 'core', 5, 1);

Adjust Output

In config.py:

  • MAX_OUTPUT_COUNT = 35 (max articles per digest)
  • SIMILARITY_THRESHOLD = 0.90

Files

news-digest/
├── SKILL.md
└── scripts/
    └── news_digest_v2/
        ├── __init__.py
        ├── config.py              # DB path, websites, keywords, holidays, LLM config
        ├── database.py            # SQLite operations
        ├── fetcher.py             # Web scraping + smart summary extraction
        ├── filters.py             # Content filtering logic
        ├── formatter.py           # Output formatting + incomplete sentence handling
        ├── rules_config.py        # Exclusion rules, keywords, dateline patterns
        ├── similarity.py          # Jaccard deduplication
        ├── stage1_fetch.py        # Stage 1 entry (fetch)
        ├── stage2_process.py      # Stage 2 entry (dedup + keywords)
        ├── stage2_5_llm_summary.py # Stage 2.5 (LLM batch summarization)
        ├── stage3_output.py       # Stage 3 entry (read + format + save)
        └── run_all_stages.py      # Full pipeline entry

Performance Notes

  • ~5 minutes for full 42-website scrape (network I/O bound)
  • Some sites may fail (SSL issues, 521 errors, 404s) — pipeline continues
  • Recommended cron timeout: 600 seconds
  • 数据库是增量追加的,不会被清空。新新闻按 URL 去重插入(INSERT OR IGNORE),旧新闻保留。
  • 重复新闻标记 is_duplicate = 1,不删除。
  • 数据库增长约 30-50 条/天,建议定期清理(可选)。

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

中医面诊分析工具

Supports uploading local MP4 videos or network video URLs to call the server-side API for facial diagnosis. It returns structured TCM facial diagnosis result...

Registry SourceRecently Updated
General

李善友·思维框架

李善友(混沌学园创办人)的思维框架与表达方式。 基于6个维度的深度调研,提炼3个核心心智模型、6条决策启发式和完整的表达DNA。 用途:作为思维顾问,用李善友的视角分析商业创新、跨越非连续性、审视认知升级。 当用户提到「用李善友的视角」「善友会怎么看」「混沌视角」「li shanyou perspective」时...

Registry SourceRecently Updated
General

微信频道语音+视频实现气泡自动播报模式

制作有声视频文档技能 - 将文字自动转为 TTS 语音,并配上 AI 生成的精美背景图,合成 MP4 视频文件。适用于微信视频卡片交付。

Registry SourceRecently Updated
General

Skill Check Butler

发了3次都被拒,不知道哪里出了问题? 粘贴SKILL.md进来,5模块30秒跑完,P0问题直接标红给修复代码。 还能评赛道、写文案、配生态——发布前最后一道安全网。

Registry SourceRecently Updated