AI Tech Fulltext Fetch
Core Goal
- Reuse the same SQLite database populated by
ai-tech-rss-fetch.
- Fetch article body text from each RSS entry URL.
- Persist extraction status and text in a companion table (
entry_content).
- Support incremental runs and safe retries without creating duplicate fulltext rows.
Triggering Conditions
- Receive a request to fetch article body/full text for entries already in
ai_rss.db.
- Receive a request to build a second-stage pipeline after RSS metadata sync.
- Need a stable, resumable queue over existing
entries rows.
- Need URL-based fulltext persistence before chunking, indexing, or summarization.
Workflow
- Ensure metadata table exists first.
- Run
ai-tech-rss-fetch and populate entries in SQLite before using this skill.
- This skill requires the
entries table to exist.
- In multi-agent runtimes, pin DB to the same absolute path used by
ai-tech-rss-fetch:
export AI_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/ai_rss.db"
- Initialize fulltext table.
python3 scripts/fulltext_fetch.py init-db --db "$AI_RSS_DB_PATH"
- Run incremental fulltext sync.
- Default behavior fetches rows that are missing full text or currently failed.
python3 scripts/fulltext_fetch.py sync \
--db "$AI_RSS_DB_PATH" \
--limit 50 \
--timeout 20 \
--min-chars 300
- Fetch one entry on demand.
python3 scripts/fulltext_fetch.py fetch-entry \
--db "$AI_RSS_DB_PATH" \
--entry-id 1234
- Inspect extracted content state.
python3 scripts/fulltext_fetch.py list-content \
--db "$AI_RSS_DB_PATH" \
--status ready \
--limit 100
Data Contract
- Reads from existing
entries table:
id, canonical_url, url, title.
- Writes to
entry_content table:
entry_id (unique, one row per entry)
source_url, final_url, http_status
extractor (trafilatura, html-parser, or none)
content_text, content_hash, content_length
status (ready or failed)
retry_count, last_error, timestamps.
Extraction and Update Rules
- URL source priority:
canonical_url first, fallback to url.
- Attempt
trafilatura extraction when dependency is available, fallback to built-in HTML parser.
- Upsert by
entry_id:
- Success: write/update full text and reset
retry_count to 0.
- Failure with existing
ready content: keep old text, keep status ready, record last_error.
- Failure without ready content: status becomes
failed, increment retry_count, set next_retry_at.
- Failed retries are capped by
--max-retries (default 3) and paced by --retry-backoff-minutes.
--force allows refetching already ready rows.
--refetch-days N allows refreshing rows older than N days.
Configurable Parameters
--db
AI_RSS_DB_PATH (recommended absolute path in multi-agent runtime)
--limit
--force
--only-failed
--refetch-days
--oldest-first
--timeout
--max-bytes
--min-chars
--max-retries
--retry-backoff-minutes
--user-agent
--disable-trafilatura
--fail-on-errors
Error Handling
- Missing
entries table: return actionable error and stop.
- Network/HTTP/parse errors: store failure state and continue processing other entries.
- Non-text content types (PDF/image/audio/video/zip): mark as failed for that entry.
- Extraction too short (
--min-chars): treat as failure to avoid low-quality body text.
References
references/schema.md
references/fetch-rules.md
Assets
assets/config.example.json
Scripts
scripts/fulltext_fetch.py