AI Tech Fulltext Fetch

Core Goal

Reuse the same SQLite database populated by ai-tech-rss-fetch.
Fetch article body text from each RSS entry URL.
Persist extraction status and text in a companion table (entry_content).
Support incremental runs and safe retries without creating duplicate fulltext rows.

Receive a request to fetch article body/full text for entries already in ai_rss.db.
Receive a request to build a second-stage pipeline after RSS metadata sync.
Need a stable, resumable queue over existing entries rows.
Need URL-based fulltext persistence before chunking, indexing, or summarization.

Run ai-tech-rss-fetch and populate entries in SQLite before using this skill.
This skill requires the entries table to exist.
In multi-agent runtimes, pin DB to the same absolute path used by ai-tech-rss-fetch:

export AI_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/ai_rss.db"

python3 scripts/fulltext_fetch.py init-db --db "$AI_RSS_DB_PATH"

python3 scripts/fulltext_fetch.py sync \
  --db "$AI_RSS_DB_PATH" \
  --limit 50 \
  --timeout 20 \
  --min-chars 300

python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$AI_RSS_DB_PATH" \
  --entry-id 1234

python3 scripts/fulltext_fetch.py list-content \
  --db "$AI_RSS_DB_PATH" \
  --status ready \
  --limit 100

URL source priority: canonical_url first, fallback to url.
Attempt trafilatura extraction when dependency is available, fallback to built-in HTML parser.
Upsert by entry_id:
- Success: write/update full text and reset retry_count to 0.
- Failure with existing ready content: keep old text, keep status ready, record last_error.
- Failure without ready content: status becomes failed, increment retry_count, set next_retry_at.
Failed retries are capped by --max-retries (default 3) and paced by --retry-backoff-minutes.
--force allows refetching already ready rows.
--refetch-days N allows refreshing rows older than N days.

Missing entries table: return actionable error and stop.
Network/HTTP/parse errors: store failure state and continue processing other entries.
Non-text content types (PDF/image/audio/video/zip): mark as failed for that entry.
Extraction too short (--min-chars): treat as failure to avoid low-quality body text.