pdf-text-extractor

Optionally collect full-text snippets to deepen evidence beyond abstracts.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "pdf-text-extractor" with this command: npx skills add willoscar/research-units-pipeline-skills/willoscar-research-units-pipeline-skills-pdf-text-extractor

PDF Text Extractor

Optionally collect full-text snippets to deepen evidence beyond abstracts.

This skill is intentionally conservative: in many survey runs, abstract/snippet mode is enough and avoids heavy downloads.

Inputs

  • papers/core_set.csv (expects paper_id , title , and ideally pdf_url /arxiv_id /url )

  • Optional: outline/mapping.tsv (to prioritize mapped papers)

Outputs

  • papers/fulltext_index.jsonl (one record per attempted paper)

  • Side artifacts:

  • papers/pdfs/<paper_id>.pdf (cached downloads)

  • papers/fulltext/<paper_id>.txt (extracted text)

Decision: evidence mode

  • queries.md can set evidence_mode: "abstract" | "fulltext" .

  • abstract (default template): do not download; write an index that clearly records skipping.

  • fulltext : download PDFs (when possible) and extract text to papers/fulltext/ .

Local PDFs Mode

When you cannot/should not download PDFs (restricted network, rate limits, no permission), provide PDFs manually and run in “local PDFs only” mode.

  • PDF naming convention: papers/pdfs/<paper_id>.pdf where <paper_id> matches papers/core_set.csv .

  • Set - evidence_mode: "fulltext" in queries.md .

  • Run: python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only

If PDFs are missing, the script writes a to-do list:

  • output/MISSING_PDFS.md (human-readable summary)

  • papers/missing_pdfs.csv (machine-readable list)

Workflow (heuristic)

  • Read papers/core_set.csv .

  • If outline/mapping.tsv exists, prioritize mapped papers first.

  • For each selected paper (fulltext mode):

  • resolve pdf_url (use pdf_url , else derive from arxiv_id /url when possible)

  • download to papers/pdfs/<paper_id>.pdf if missing

  • extract a reasonable prefix of text to papers/fulltext/<paper_id>.txt

  • append/update a JSONL record in papers/fulltext_index.jsonl with status + stats

  • Never overwrite existing extracted text unless explicitly requested (delete the .txt to re-extract).

Quality checklist

  • papers/fulltext_index.jsonl exists and is non-empty.

  • If evidence_mode: "fulltext" : at least a small but non-trivial subset has extracted text (strict mode blocks if extraction coverage is near-zero).

  • If evidence_mode: "abstract" : the index records clearly reflect skip status (no downloads attempted).

Script

Quick Start

  • python .codex/skills/pdf-text-extractor/scripts/run.py --help

  • python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <workspace_dir>

All Options

  • --max-papers <n> : cap number of papers processed (can be overridden by queries.md )

  • --max-pages <n> : extract at most N pages per PDF

  • --min-chars <n> : minimum extracted chars to count as OK

  • --sleep <sec> : delay between downloads

  • --local-pdfs-only : do not download; only use papers/pdfs/<paper_id>.pdf if present

  • queries.md supports: evidence_mode , fulltext_max_papers , fulltext_max_pages , fulltext_min_chars

Examples

  • Abstract mode (no downloads):

  • Set - evidence_mode: "abstract" in queries.md , then run the script (it will emit papers/fulltext_index.jsonl with skip statuses)

  • Fulltext mode with local PDFs only:

  • Set - evidence_mode: "fulltext" in queries.md , put PDFs under papers/pdfs/ , then run: python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only

  • Fulltext mode with smaller budget:

  • python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --max-papers 20 --max-pages 4 --min-chars 1200

Notes

  • Downloads are cached under papers/pdfs/ ; extracted text is cached under papers/fulltext/ .

  • The script does not overwrite existing extracted text unless you delete the .txt file.

Troubleshooting

Issue: no PDFs are available to download

Fix:

  • Use evidence_mode: abstract (default) or provide local PDFs under papers/pdfs/ and rerun with --local-pdfs-only .

Issue: extracted text is empty/garbled

Fix:

  • Try a different extraction backend if supported; otherwise mark the paper as abstract evidence level and avoid strong fulltext claims.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

latex-compile-qa

No summary provided by upstream source.

Repository SourceNeeds Review
Research

draft-polisher

No summary provided by upstream source.

Repository SourceNeeds Review
Research

citation-verifier

No summary provided by upstream source.

Repository SourceNeeds Review
Research

paper-notes

No summary provided by upstream source.

Repository SourceNeeds Review