WeChat Article Summarize
把一个或多个微信公众号文章链接整理成结构化 markdown,支持单篇整理和多篇日报汇总。
功能简介
- 读取一个或多个
mp.weixin.qq.com文章链接 - 抽取文章正文、标题、发布时间,以及可选的图片链接
- 自动修复常见的微信正文乱码问题
- 调用
summarize用中文总结全文内容 - 生成结构化 markdown 文件
- 单篇文章整理
- 多篇文章汇总 / 日报
- 支持按日期 + 标题,或日期 + 篇数 + 汇总说明命名
- 支持把文件保存到用户指定目录
使用前需要确认
在真正开始抓取文章之前,需要先确认:
summarize已经配置好 API key,并且可正常使用- 是否需要在最终 markdown 中保留图片链接
- 最终文件保存到哪个目录
适用场景
- 总结单篇微信文章
- 把多篇微信文章汇总成一份日报
- 输出适合继续阅读、归档或二次整理的 markdown 文件
Workflow
Step 0: Confirm prerequisites before fetching anything
Do not fetch article content until all three items are clear:
-
summarize is ready
- Ask the user to configure summarize API access first if needed.
- Verify summarize by running a tiny Chinese test.
- Proceed only if summarize returns a usable
summary.
-
Image preference
- Ask whether the final markdown should include image links.
- Map user intent to
include_images=true|false.
-
Output directory
- Ask where to save the final markdown file.
- If the user says “下载文件夹”, use
~/Downloads. - Create the target directory if it does not exist.
If any of the three items is missing, stop and ask before continuing.
Step 1: Extract each WeChat article
For each mp.weixin.qq.com URL, run:
python3 scripts/read_wechat_article.py '<wechat_url>' --out '<temp_dir>'
This produces structured metadata, raw HTML, and a first-pass markdown export.
Step 2: Clean the body text
Do not trust the first-pass article markdown blindly.
If the body contains mojibake or obvious encoding corruption, repair it from raw.html by running:
python3 scripts/fix_wechat_body.py '<raw.html>' --out '<body-fixed.txt>'
Use the cleaned body text as the canonical input for summarization.
Step 3: Summarize in Chinese
Always summarize the cleaned local text, not the original WeChat URL.
Run:
python3 scripts/summarize_cn.py '<body-fixed.txt>' --out '<summary.json>' --length short
or for a combined report:
python3 scripts/summarize_cn.py '<combined-input.md>' --out '<summary.json>' --length medium
The script enforces Chinese output and fails if the returned summary is not sufficiently Chinese.
Step 4: Normalize summary text before writing markdown
Never write summarize output directly into the final file.
Normalize paragraph breaks and spacing with:
python3 scripts/normalize_markdown_text.py '<input.txt>' --out '<normalized.txt>'
Use this for:
- each single-article summary
- the combined daily-report overview
This prevents ugly line wrapping and mixed-language formatting artifacts.
Step 5: Build the final markdown
Single article
Run:
python3 scripts/build_mindmap_markdown.py \
--result '<result.json>' \
--body '<body-fixed.txt>' \
--summary '<summary.json>' \
--output-dir '<chosen-dir>' \
--include-images true
Multiple articles / daily report
Run:
python3 scripts/build_batch_report.py \
--inputs '<dir1>' '<dir2>' '<dir3>' \
--output-dir '<chosen-dir>' \
--include-images true \
--report-label '微信文章日报'
The batch report must:
- summarize all articles individually
- summarize the full set as one combined overview
- place the combined overview first
- then append each single article section
Output rules
Naming
Single article
YYYYMMDD-文章标题.md
Multiple articles
YYYYMMDD-<总文章数量>篇-<汇总说明>.md
Content rules
Single article output should contain
- title
- source URL
- publish time
- summarize-generated Chinese summary
- mindmap-style structure
- optional image section
Batch report output should contain
- combined daily overview at the top
- combined mindmap
- per-article title, URL, date, summary, and mindmap
- optional image overview
Non-negotiable quality gates
Before writing the final markdown:
-
Summary language check
- If the summary is not mainly Chinese, retry or fail.
-
Paragraph normalization
- Collapse unnatural line breaks inside prose.
- Keep markdown headings and bullet lists intact.
-
Clean body source
- Prefer repaired text from
raw.htmlwhen the extracted body is corrupted.
- Prefer repaired text from
Bundled scripts
scripts/read_wechat_article.py— fetch WeChat article metadata, body, raw HTML, and image linksscripts/fix_wechat_body.py— repair mojibake and extract clean text from raw HTMLscripts/summarize_cn.py— run summarize in Chinese and enforce a language checkscripts/normalize_markdown_text.py— normalize prose paragraphs and line breaksscripts/build_mindmap_markdown.py— generate single-article markdown filesscripts/build_batch_report.py— generate multi-article combined reportsscripts/run_wechat_mindmap_workflow.py— orchestrate the full workflow end to end after the required user confirmations