Sustainability RSS Fetch
Core Goal
- Ingest all RSS/Atom items into SQLite before topic filtering.
- Use
doias the primary key inentries. - Keep RSS metadata isolated in its own DB file.
- After semantic screening, keep relevant rows and prune non-relevant rows to DOI-only.
Triggering Conditions
- Receive a request to import sustainability feeds and persist all fetched records first.
- Receive a request to do prompt-based topic screening after DB ingestion.
- Receive a request to convert irrelevant rows into lightweight DOI-only records.
- Need stable DOI-keyed storage for downstream API/fulltext/summarization.
Mandatory Workflow
- Prepare runtime and RSS metadata DB path.
python3 -m pip install feedparser
export SUSTAIN_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/sustainability_rss.db"
python3 scripts/rss_subscribe.py init-db --db "$SUSTAIN_RSS_DB_PATH"
- Collect RSS window and ingest all fetched items first.
python3 scripts/rss_subscribe.py collect-window \
--db "$SUSTAIN_RSS_DB_PATH" \
--opml assets/journal.opml \
--start 2026-02-01 \
--end 2026-02-10 \
--max-items-per-feed 150 \
--topic-prompt "筛选与可持续主题相关的文章:生命周期评价、物质流分析、绿色供应链、绿电、绿色设计、减污降碳" \
--output /tmp/sustainability-candidates.json \
--pretty
- Screen candidates in agent context (semantic, not regex-only).
- Use
topic_prompt+ user instructions. - Produce selected
candidate_idlist.
- Mark selected rows as relevant and prune unselected rows.
python3 scripts/rss_subscribe.py insert-selected \
--db "$SUSTAIN_RSS_DB_PATH" \
--candidates /tmp/sustainability-candidates.json \
--selected-ids 3,7,12,21
Result:
- selected candidates:
is_relevant=1, keep metadata. - unselected candidates: clear metadata fields, keep DOI-only row (
is_relevant=0).
Optional Maintenance Sync
python3 scripts/rss_subscribe.py sync --db "$SUSTAIN_RSS_DB_PATH" --max-feeds 20 --max-items-per-feed 100
Source Management
python3 scripts/rss_subscribe.py add-feed --db "$SUSTAIN_RSS_DB_PATH" --url "https://example.com/feed.xml"
python3 scripts/rss_subscribe.py import-opml --db "$SUSTAIN_RSS_DB_PATH" --opml assets/journal.opml
Query Data
python3 scripts/rss_subscribe.py list-feeds --db "$SUSTAIN_RSS_DB_PATH" --limit 50
python3 scripts/rss_subscribe.py list-entries --db "$SUSTAIN_RSS_DB_PATH" --limit 100
Data Contract
feedstable: subscription and fetch state.entriestable (doiPK):- metadata fields (
title/url/summary/categories/...) doi_is_surrogate(when no DOI is present in source)is_relevant(1relevant,0pruned non-relevant,NULLnot labeled yet)
- metadata fields (
- Non-relevant rows are pruned to DOI-only payload for storage efficiency.
Configurable Parameters
--dbSUSTAIN_RSS_DB_PATH--opml--feed-url--use-subscribed-feeds--topic-prompt--start/--end--max-feeds--max-items-per-feed--user-agent--cleanup-ttl-days
Error and Boundary Handling
- Feed/network failure: continue other feeds and keep errors in feed state.
- Missing
feedparser: return install guidance. - Missing DOI in RSS item: create deterministic surrogate DOI key to keep full-ingestion guarantee.
- Invalid selected IDs: fail fast before label/prune write.
References
references/input-model.mdreferences/output-rules.mdreferences/time-range-rules.md
Assets
assets/journal.opmlassets/config.example.json
Scripts
scripts/rss_subscribe.py