Sustainability Fulltext Fetch
Core Goal
- Read relevant DOI entries from RSS metadata DB.
- Write fetched content into a separate fulltext DB.
- Process only relevant entries (
is_relevant=1).
- Prefer API metadata retrieval by DOI (OpenAlex first, Semantic Scholar fallback).
- Fallback to webpage fulltext extraction when API metadata is unavailable.
- Persist one content row per DOI in
entry_content.
Triggering Conditions
- Receive a request to enrich relevant DOI records with abstract/fulltext content.
- Receive a request to replace webpage-first crawling with API-first enrichment.
- Need retry-safe incremental updates without duplicate rows.
Workflow
- Ensure upstream DOI/relevance data exists.
export SUSTAIN_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/sustainability_rss.db"
export SUSTAIN_FULLTEXT_DB_PATH="/absolute/path/to/workspace-rss-bot/sustainability_fulltext.db"
python3 scripts/fulltext_fetch.py init-db --content-db "$SUSTAIN_FULLTEXT_DB_PATH"
- Run incremental sync (API first, webpage fallback).
python3 scripts/fulltext_fetch.py sync \
--rss-db "$SUSTAIN_RSS_DB_PATH" \
--content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
--limit 50 \
--openalex-email "you@example.com" \
--api-min-chars 80 \
--min-chars 300
- Fetch one DOI on demand.
python3 scripts/fulltext_fetch.py fetch-entry \
--rss-db "$SUSTAIN_RSS_DB_PATH" \
--content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
--doi "10.1038/nature12373"
- Inspect stored content state.
python3 scripts/fulltext_fetch.py list-content \
--rss-db "$SUSTAIN_RSS_DB_PATH" \
--content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
--status ready \
--limit 100
Data Contract
- Reads from RSS DB
entries:
doi, doi_is_surrogate, is_relevant, canonical_url, url, title.
- Writes to fulltext DB
entry_content (primary key doi):
- source URL/status/extractor
content_kind (abstract or fulltext)
content_text, content_hash, content_length
- retry fields and timestamps.
Extraction Priority
- API metadata path:
- OpenAlex by DOI.
- Semantic Scholar fallback by DOI.
- If accepted (
--api-min-chars), persist as content_kind=abstract.
- Webpage fallback path:
- Use
canonical_url then url.
- Extract with
trafilatura when available, else built-in HTML parser.
- Persist as
content_kind=fulltext.
Update Semantics
- Upsert key:
doi.
- Success: status
ready, reset retry counters.
- Failure with existing ready row: keep old content, record latest error.
- Failure without ready row: set
status=failed, increment retry state.
Configurable Parameters
--rss-db
--content-db
SUSTAIN_RSS_DB_PATH
SUSTAIN_FULLTEXT_DB_PATH
--limit
--force
--only-failed
--refetch-days
--timeout
--max-bytes
--min-chars
--openalex-email / OPENALEX_EMAIL
--s2-api-key / S2_API_KEY
--api-timeout
--api-min-chars
--disable-api-metadata
--max-retries
--retry-backoff-minutes
--user-agent
--disable-trafilatura
--fail-on-errors
Error Handling
- Missing DOI-keyed
entries table: stop with actionable message.
- RSS DB and fulltext DB path collision: fail fast and require separate files.
- API/network/HTTP failures: record failures and continue queue.
- Webpage non-text content: mark failed for that DOI.
- Short extraction: fail by threshold to avoid low-quality content.
References
references/schema.md
references/fetch-rules.md
Assets
assets/config.example.json
Scripts
scripts/fulltext_fetch.py