Literature Engineer (evidence collector)
Goal: build a large, verifiable candidate pool for downstream dedupe/rank, mapping, notes, citations, and drafting.
This skill is intentionally evidence-first: if you can't reach the target size with verifiable IDs/provenance, the correct behavior is to block and ask for more exports / enable network, not to fabricate.
Load Order
Always read:
- references/domain_pack_overview.md — how domain packs drive topic-specific behavior
Domain packs (loaded by topic match):
- assets/domain_packs/llm_agents.json — pinned classic/survey arXiv IDs for LLM agent topics
Script Boundary
Use scripts/run.py only for:
-
multi-route offline import, normalization, and provenance tagging
-
online arXiv/Semantic Scholar API retrieval
-
snowball expansion and deduplication
-
retrieval report generation
Do not treat run.py as the place for:
-
hardcoded pinned arXiv ID lists (use domain packs)
-
hardcoded topic detection logic (use domain packs)
Inputs
-
queries.md
-
keywords , exclude , max_results , time window
-
Optional offline sources (any combination; all are merged):
-
papers/import.(csv|json|jsonl|bib)
-
papers/arxiv_export.(csv|json|jsonl|bib)
-
papers/imports/*.(csv|json|jsonl|bib)
-
Optional snowball exports (offline):
-
papers/snowball/*.(csv|json|jsonl|bib)
Outputs
-
papers/papers_raw.jsonl
-
1 record per line; minimum fields:
-
title (str), authors (list[str]), year (int|""), url (str)
-
stable identifier(s): arxiv_id and/or doi
-
abstract (str; may be empty in offline mode)
-
source (str) + provenance (list[dict])
-
papers/papers_raw.csv (human scan)
-
papers/retrieval_report.md (route counts, missing-meta stats, next actions)
Workflow (multi-route)
-
Offline-first merge: ingest all available offline exports (and label provenance per file).
-
Online retrieval (optional): if enabled, run arXiv API retrieval for each keyword query.
-
Snowballing (optional): expand from seed papers via references/cited-by (online), or merge offline snowball exports.
-
Normalize + dedupe: canonicalize IDs/URLs, merge duplicates while unioning provenance .
-
Report: write a concise retrieval report with coverage buckets and missing-meta counts.
Quality checklist
-
Candidate pool size target met (A150++: ≥1200) without fabrication.
-
Each record has a stable identifier (arxiv_id or doi , plus url ).
-
Each record has provenance: which route/file/API produced it.
Script
Quick Start
- python .codex/skills/literature-engineer/scripts/run.py --help
All Options
-
See python .codex/skills/literature-engineer/scripts/run.py --help .
-
Reads retrieval config from queries.md .
-
Offline inputs (merged if present): papers/import.(csv|json|jsonl|bib) , papers/arxiv_export.(csv|json|jsonl|bib) , papers/imports/*.(csv|json|jsonl|bib) .
-
Optional offline snowball inputs: papers/snowball/*.(csv|json|jsonl|bib) .
-
Online expansion requires network: use --online and/or --snowball .
-
Online retrieval is best-effort: arXiv API can be flaky in some environments; the script will also attempt a Semantic Scholar route when needed.
-
For LLM-agent topics, the script also performs a best-effort pinned arXiv id_list fetch (canonical classics like ReAct/Toolformer/Reflexion/Voyager/Tree-of-Thoughts + a small prior-survey seed set) so ref.bib can include must-cite anchors even when keyword search misses them.
-
If HTTPS/TLS to external domains is unstable, the Semantic Scholar route is fetched via the r.jina.ai proxy so the pipeline can still self-boot without manual exports.
-
When an online run returns 0 records due to transient network errors, a simple rerun is often sufficient (the pipeline should not fabricate).
Examples
Offline imports only:
-
Put exports under papers/imports/ then run:
-
python .codex/skills/literature-engineer/scripts/run.py --workspace <ws>
Explicit offline inputs (multi-route):
-
python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --input path/to/a.bib --input path/to/b.jsonl
Online arXiv retrieval (needs network):
-
python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --online
Snowballing (needs network unless you provide offline snowball exports):
- python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --snowball
Troubleshooting
Issue: can't reach ≥1200 papers
Symptom:
- papers/papers_raw.jsonl size is far below target; later stages will fail mapping/bindings and citation density.
Causes:
-
Only a small offline export was provided.
-
Network is blocked so online retrieval/snowballing can't run.
Solutions:
-
Provide additional exports under papers/imports/ (multiple routes/queries).
-
Provide snowball exports under papers/snowball/ .
-
Enable network and rerun with --online --snowball .
Issue: many records missing stable IDs
Symptom:
- Report shows many entries with empty arxiv_id and doi .
Solutions:
-
Prefer arXiv/OpenReview/ACL exports that include stable IDs.
-
If you have network, rerun with --online to backfill arXiv IDs.
-
Filter out ID-less entries before downstream citation generation.