PDF & Document Extraction

For DOCX: use python-docx (parses actual document structure, far better than OCR). For PPTX: see the powerpoint skill (uses python-pptx with full slide/notes support). This skill covers PDFs and scanned documents.

Step 1: Remote URL Available?

If the document has a URL, always try web_extract first:

web_extract(urls=["https://arxiv.org/pdf/2402.03300"]) web_extract(urls=["https://example.com/report.pdf"])

This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.

Only use local extraction when: the file is local, web_extract fails, or you need batch processing.

Step 2: Choose Local Extractor

Feature pymupdf (~25MB) marker-pdf (~3-5GB)

Text-based PDF ✅ ✅

Scanned PDF (OCR) ❌ ✅ (90+ languages)

Tables ✅ (basic) ✅ (high accuracy)

Equations / LaTeX ❌ ✅

Code blocks ❌ ✅

Forms ❌ ✅

Headers/footers removal ❌ ✅

Reading order detection ❌ ✅

Images extraction ✅ (embedded) ✅ (with context)

Images → text (OCR) ❌ ✅

EPUB ✅ ✅

Markdown output ✅ (via pymupdf4llm) ✅ (native, higher quality)

Install size ~25MB ~3-5GB (PyTorch + models)

Speed Instant ~1-14s/page (CPU), ~0.2s/page (GPU)

Decision: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.

If the user needs marker capabilities but the system lacks ~5GB free disk:

"This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."

pymupdf (lightweight)

pip install pymupdf pymupdf4llm

Via helper script:

python scripts/extract_pymupdf.py document.pdf # Plain text python scripts/extract_pymupdf.py document.pdf --markdown # Markdown python scripts/extract_pymupdf.py document.pdf --tables # Tables python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images python scripts/extract_pymupdf.py document.pdf --metadata # Title, author, pages python scripts/extract_pymupdf.py document.pdf --pages 0-4 # Specific pages

Inline:

python3 -c " import pymupdf doc = pymupdf.open('document.pdf') for page in doc: print(page.get_text()) "

marker-pdf (high-quality OCR)

Check disk space first

python scripts/extract_marker.py --check

pip install marker-pdf

Via helper script:

python scripts/extract_marker.py document.pdf # Markdown python scripts/extract_marker.py document.pdf --json # JSON with metadata python scripts/extract_marker.py document.pdf --output_dir out/ # Save images python scripts/extract_marker.py scanned.pdf # Scanned PDF (OCR) python scripts/extract_marker.py document.pdf --use_llm # LLM-boosted accuracy

CLI (installed with marker-pdf):

marker_single document.pdf --output_dir ./output marker /path/to/folder --workers 4 # Batch

Arxiv Papers

Abstract only (fast)

web_extract(urls=["https://arxiv.org/abs/2402.03300"])

Full paper

web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

Search

web_search(query="arxiv GRPO reinforcement learning 2026")

Notes

web_extract is always first choice for URLs
pymupdf is the safe default — instant, no models, works everywhere
marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed
Both helper scripts accept --help for full usage
marker-pdf downloads ~2.5GB of models to ~/.cache/huggingface/ on first use
For Word docs: pip install python-docx (better than OCR — parses actual structure)
For PowerPoint: see the powerpoint skill (uses python-pptx)

ocr-and-documents

Safety Notice

Copy this and send it to your AI assistant to learn

Check disk space first

Abstract only (fast)

Full paper

Search

Source Transparency

Related Skills

dogfood

learn-anything-in-one-hour

X/Twitter Research