PDF & Document Extraction
For DOCX: use python-docx (parses actual document structure, far better than OCR). For PPTX: see the powerpoint skill (uses python-pptx with full slide/notes support). This skill covers PDFs and scanned documents.
Step 1: Remote URL Available?
If the document has a URL, always try web_extract first:
web_extract(urls=["https://arxiv.org/pdf/2402.03300"]) web_extract(urls=["https://example.com/report.pdf"])
This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.
Only use local extraction when: the file is local, web_extract fails, or you need batch processing.
Step 2: Choose Local Extractor
Feature pymupdf (~25MB) marker-pdf (~3-5GB)
Text-based PDF ✅ ✅
Scanned PDF (OCR) ❌ ✅ (90+ languages)
Tables ✅ (basic) ✅ (high accuracy)
Equations / LaTeX ❌ ✅
Code blocks ❌ ✅
Forms ❌ ✅
Headers/footers removal ❌ ✅
Reading order detection ❌ ✅
Images extraction ✅ (embedded) ✅ (with context)
Images → text (OCR) ❌ ✅
EPUB ✅ ✅
Markdown output ✅ (via pymupdf4llm) ✅ (native, higher quality)
Install size ~25MB ~3-5GB (PyTorch + models)
Speed Instant ~1-14s/page (CPU), ~0.2s/page (GPU)
Decision: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.
If the user needs marker capabilities but the system lacks ~5GB free disk:
"This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."
pymupdf (lightweight)
pip install pymupdf pymupdf4llm
Via helper script:
python scripts/extract_pymupdf.py document.pdf # Plain text python scripts/extract_pymupdf.py document.pdf --markdown # Markdown python scripts/extract_pymupdf.py document.pdf --tables # Tables python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images python scripts/extract_pymupdf.py document.pdf --metadata # Title, author, pages python scripts/extract_pymupdf.py document.pdf --pages 0-4 # Specific pages
Inline:
python3 -c " import pymupdf doc = pymupdf.open('document.pdf') for page in doc: print(page.get_text()) "
marker-pdf (high-quality OCR)
Check disk space first
python scripts/extract_marker.py --check
pip install marker-pdf
Via helper script:
python scripts/extract_marker.py document.pdf # Markdown python scripts/extract_marker.py document.pdf --json # JSON with metadata python scripts/extract_marker.py document.pdf --output_dir out/ # Save images python scripts/extract_marker.py scanned.pdf # Scanned PDF (OCR) python scripts/extract_marker.py document.pdf --use_llm # LLM-boosted accuracy
CLI (installed with marker-pdf):
marker_single document.pdf --output_dir ./output marker /path/to/folder --workers 4 # Batch
Arxiv Papers
Abstract only (fast)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])
Full paper
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
Search
web_search(query="arxiv GRPO reinforcement learning 2026")
Notes
-
web_extract is always first choice for URLs
-
pymupdf is the safe default — instant, no models, works everywhere
-
marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed
-
Both helper scripts accept --help for full usage
-
marker-pdf downloads ~2.5GB of models to ~/.cache/huggingface/ on first use
-
For Word docs: pip install python-docx (better than OCR — parses actual structure)
-
For PowerPoint: see the powerpoint skill (uses python-pptx)