Xiaohongshu Collector
Overview
Use this skill when working on Xiaohongshu collection in forbidden_company, especially for post bodies, comment pagination, cookie updates, single-URL refreshes, or browser-plugin integration.
What To Use
Prefer the existing repo implementation instead of inventing a new flow:
scripts/collect_xiaohongshu.pyscripts/admin_server.pyscripts/run_xiaohongshu_collection.shbrowser-extension/xhs-collector/docs/xiaohongshu-collector.mddocs/xhs-plugin-api.md
Core Rules
- Keep cookies private. Never repeat them in final output.
comment_limit=0means collect all available comments.- Comment collection must paginate.
- If the direct comment API returns a login/account error, use the browser-rendered fallback.
- Do not rely on Firecrawl for comment pagination.
Workflow
- Confirm whether the task is batch collection or single-URL refresh.
- Load the saved cookie from
data/xiaohongshu-cookie.txtunless a newer cookie is provided. - Run or update
scripts/collect_xiaohongshu.pywith the requested URL(s),--db,--refresh-url, and--comment-limit 0when full comments are needed. - For browser plugin work, wire the popup/background scripts to the local backend endpoints in
scripts/admin_server.py. - Verify that post rows, comment rows, and exported artifacts are written correctly.
Endpoint Map
Use these backend endpoints when integrating the browser plugin:
GET/POST /api/xhs-cookieGET /api/xhs-plugin/statusPOST /api/xhs-plugin/collectPOST /api/xhs-plugin/refresh
Validation Notes
- Refresh mode must delete the old note rows before writing the new ones.
- The plugin should expose downloadable CSV and JSON artifacts.
- When debugging, check whether the failure is cookie-related, pagination-related, or page-structure related.
Safety Notes
- Do not propose or implement shared-server mass scraping.
- Keep the browser/plugin model user-driven and local-first.
- Preserve source URLs and timestamps for traceability.
Reference
See collector-workflow.md for operational details.