Sustainability Fulltext Fetch

Core Goal

Read relevant DOI entries from RSS metadata DB.
Write fetched content into a separate fulltext DB.
Process only relevant entries (is_relevant=1).
Prefer API metadata retrieval by DOI (OpenAlex first, Semantic Scholar fallback).
Fallback to webpage fulltext extraction when API metadata is unavailable.
Persist one content row per DOI in entry_content.

Triggering Conditions

Receive a request to enrich relevant DOI records with abstract/fulltext content.
Receive a request to replace webpage-first crawling with API-first enrichment.
Need retry-safe incremental updates without duplicate rows.

Workflow

Ensure upstream DOI/relevance data exists.

export SUSTAIN_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/sustainability_rss.db"
export SUSTAIN_FULLTEXT_DB_PATH="/absolute/path/to/workspace-rss-bot/sustainability_fulltext.db"
python3 scripts/fulltext_fetch.py init-db --content-db "$SUSTAIN_FULLTEXT_DB_PATH"

Run incremental sync (API first, webpage fallback).

python3 scripts/fulltext_fetch.py sync \
  --rss-db "$SUSTAIN_RSS_DB_PATH" \
  --content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
  --limit 50 \
  --openalex-email "you@example.com" \
  --api-min-chars 80 \
  --min-chars 300

Fetch one DOI on demand.

python3 scripts/fulltext_fetch.py fetch-entry \
  --rss-db "$SUSTAIN_RSS_DB_PATH" \
  --content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
  --doi "10.1038/nature12373"

Inspect stored content state.

python3 scripts/fulltext_fetch.py list-content \
  --rss-db "$SUSTAIN_RSS_DB_PATH" \
  --content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
  --status ready \
  --limit 100

Data Contract

Reads from RSS DB entries:
- doi, doi_is_surrogate, is_relevant, canonical_url, url, title.
Writes to fulltext DB entry_content (primary key doi):
- source URL/status/extractor
- content_kind (abstract or fulltext)
- content_text, content_hash, content_length
- retry fields and timestamps.

Extraction Priority

API metadata path:

OpenAlex by DOI.
Semantic Scholar fallback by DOI.
If accepted (--api-min-chars), persist as content_kind=abstract.

Webpage fallback path:

Use canonical_url then url.
Extract with trafilatura when available, else built-in HTML parser.
Persist as content_kind=fulltext.

Update Semantics

Upsert key: doi.
Success: status ready, reset retry counters.
Failure with existing ready row: keep old content, record latest error.
Failure without ready row: set status=failed, increment retry state.

Configurable Parameters

--rss-db
--content-db
SUSTAIN_RSS_DB_PATH
SUSTAIN_FULLTEXT_DB_PATH
--limit
--force
--only-failed
--refetch-days
--timeout
--max-bytes
--min-chars
--openalex-email / OPENALEX_EMAIL
--s2-api-key / S2_API_KEY
--api-timeout
--api-min-chars
--disable-api-metadata
--max-retries
--retry-backoff-minutes
--user-agent
--disable-trafilatura
--fail-on-errors

Error Handling

Missing DOI-keyed entries table: stop with actionable message.
RSS DB and fulltext DB path collision: fail fast and require separate files.
API/network/HTTP failures: record failures and continue queue.
Webpage non-text content: mark failed for that DOI.
Short extraction: fail by threshold to avoid low-quality content.

References

references/schema.md
references/fetch-rules.md

Assets

assets/config.example.json

Scripts

scripts/fulltext_fetch.py

sustainability-fulltext-fetch

Safety Notice

Copy this and send it to your AI assistant to learn

Sustainability Fulltext Fetch

Core Goal

Triggering Conditions

Workflow

Data Contract

Extraction Priority

Update Semantics

Configurable Parameters

Error Handling

References

Assets

Scripts

Source Transparency

Related Skills

ai-tech-rss-fetch

email-smtp-send

email-imap-fetch

sci-journals-hybrid-search