extracting-pdf-text

Extracting PDF Text for LLMs

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "extracting-pdf-text" with this command: npx skills add letta-ai/skills/letta-ai-skills-extracting-pdf-text

Extracting PDF Text for LLMs

This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.

Quick Decision Guide

PDF Type Best Approach Script

Simple text PDF PyMuPDF scripts/extract_pymupdf.py

PDF with tables pdfplumber scripts/extract_pdfplumber.py

Scanned/image PDF (local) pytesseract scripts/extract_with_ocr.py

Complex layout, highest accuracy Mistral OCR API scripts/extract_mistral_ocr.py

End-to-end RAG pipeline marker-pdf pip install marker-pdf

Recommended Workflow

Try PyMuPDF first - fastest, handles most text-based PDFs well
If tables are mangled - switch to pdfplumber
If scanned/image-based - use Mistral OCR API (best accuracy) or local OCR (free but slower)

Local Extraction (No API Required)

PyMuPDF - Fast General Extraction

Best for: Text-heavy PDFs, speed-critical workflows, basic structure preservation.

uv run scripts/extract_pymupdf.py input.pdf output.md

The script outputs markdown with preserved headings and paragraphs. For LLM-optimized output, it uses pymupdf4llm which formats text for RAG systems.

pdfplumber - Table Extraction

Best for: PDFs with tables, financial documents, structured data.

uv run scripts/extract_pdfplumber.py input.pdf output.md

Tables are converted to markdown format. Note: pdfplumber works best on machine-generated PDFs, not scanned documents.

Local OCR - Scanned Documents

Best for: Scanned PDFs when API access is unavailable.

uv run scripts/extract_with_ocr.py input.pdf output.txt

Requires: pytesseract , pdf2image , and Tesseract installed (brew install tesseract on macOS).

API-Based Extraction

Mistral OCR API

Best for: Complex layouts, scanned documents, highest accuracy, multilingual content, math formulas.

Pricing: ~1000 pages per dollar (very cost-effective)

export MISTRAL_API_KEY="your-key" uv run scripts/extract_mistral_ocr.py input.pdf output.md

Features:

Outputs clean markdown
Preserves document structure (headings, lists, tables)
Handles images, math equations, multilingual text
95%+ accuracy on complex documents

For detailed API options and other services, see references/api-services.md.

Output Format Recommendations

For LLM consumption, markdown is preferred:

Preserves semantic structure (headings become context boundaries)
Tables remain readable
Compatible with most RAG chunking strategies

For detailed comparisons of local tools, see references/local-tools.md.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Open in GitHub Open in ClawHub

Related Skills

Related by shared tags or category signals.

General

pdf

Use this skill whenever the user wants to do anything with PDF files. This includes reading or extracting text/tables from PDFs, combining or merging multiple PDFs into one, splitting PDFs apart, rotating pages, adding watermarks, creating new PDFs, filling PDF forms, encrypting/decrypting PDFs, extracting images, and OCR on scanned PDFs to make them searchable. If the user mentions a .pdf file or asks to produce one, use this skill.

Repository Source

89.2K126.1Kanthropics

General

openclaw-secure-linux-cloud

Use when self-hosting OpenClaw on a cloud server, hardening a remote OpenClaw gateway, choosing between SSH tunneling, Tailscale, or reverse-proxy exposure, or reviewing Podman, pairing, sandboxing, token auth, and tool-permission defaults for a secure personal deployment.

Repository Source

80.1K40xixu-me

General

xget

Use when tasks involve Xget URL rewriting, registry/package/container/API acceleration, integrating Xget into Git, download tools, package managers, container builds, AI SDKs, CI/CD, deployment, self-hosting, or adapting commands and config from the live README `Use Cases` section into files, environments, shells, or base URLs.

Repository SourceNeeds Review

77K40xixu-me

General

docx

Use this skill whenever the user wants to create, read, edit, or manipulate Word documents (.docx files). Triggers include: any mention of 'Word doc', 'word document', '.docx', or requests to produce professional documents with formatting like tables of contents, headings, page numbers, or letterheads. Also use when extracting or reorganizing content from .docx files, inserting or replacing images in documents, performing find-and-replace in Word files, working with tracked changes or comments, or converting content into a polished Word document. If the user asks for a 'report', 'memo', 'letter', 'template', or similar deliverable as a Word or .docx file, use this skill. Do NOT use for PDFs, spreadsheets, Google Docs, or general coding tasks unrelated to document generation.

Repository Source

72.7K126.1Kanthropics