Extract PDF Text

Extract text from PDF files using PyMuPDF. Parse tables, forms, and complex layouts. Supports OCR for scanned documents.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "Extract PDF Text" with this command: npx skills add ivangdavila/extract-pdf-text

When to Use

Agent needs to extract text from PDFs. Use PyMuPDF (fitz) for fast local extraction. Works with text-based documents, scanned pages with OCR, forms, and complex layouts.

Quick Reference

TopicFile
Code examplesexamples.md
OCR setupocr.md
Troubleshootingtroubleshooting.md

Core Rules

1. Install PyMuPDF First

pip install PyMuPDF

Import as fitz (historical name):

import fitz  # PyMuPDF

2. Basic Text Extraction

import fitz

doc = fitz.open("document.pdf")
text = ""
for page in doc:
    text += page.get_text()
doc.close()

3. Pick the Right Method

PDF TypeMethod
Text-basedpage.get_text() — fast, accurate
ScannedOCR with pytesseract — slower
MixedCheck each page, use OCR when needed

4. Check for Text Before OCR

def needs_ocr(page):
    text = page.get_text().strip()
    return len(text) < 50  # Likely scanned if very little text

5. Handle Errors Gracefully

try:
    doc = fitz.open(path)
except fitz.FileDataError:
    print("Invalid or corrupted PDF")
except fitz.PasswordError:
    doc = fitz.open(path, password="secret")

Extraction Traps

TrapWhat HappensFix
OCR on text PDFSlow + worse accuracyCheck get_text() first
Forget to close docMemory leakUse with or doc.close()
Assume page orderWrong reading flowUse sort=True in get_text()
Ignore encodingGarbled charactersPyMuPDF handles UTF-8

Scope

This skill provides instructions for using PyMuPDF to extract PDF text.

This skill ONLY:

  • Gives code examples for PyMuPDF
  • Explains OCR setup when needed
  • Troubleshoots common issues

This skill NEVER:

  • Accesses files without user request
  • Sends data externally
  • Modifies original PDFs

Security & Privacy

All processing is local:

  • PyMuPDF runs entirely on your machine
  • No external API calls
  • No data leaves your system

Output Formats

Plain Text

text = page.get_text()

Structured (dict)

blocks = page.get_text("dict")["blocks"]
for b in blocks:
    if b["type"] == 0:  # text block
        for line in b["lines"]:
            for span in line["spans"]:
                print(span["text"], span["size"])

JSON

import json
data = page.get_text("json")
parsed = json.loads(data)

Full Example

import fitz

def extract_pdf(path):
    """Extract text from PDF, with OCR fallback for scanned pages."""
    doc = fitz.open(path)
    results = []
    
    for i, page in enumerate(doc):
        text = page.get_text()
        method = "text"
        
        # If very little text, might be scanned
        if len(text.strip()) < 50:
            # OCR would go here (see ocr.md)
            method = "needs_ocr"
        
        results.append({
            "page": i + 1,
            "text": text,
            "method": method
        })
    
    doc.close()
    return {
        "pages": len(results),
        "content": results,
        "word_count": sum(len(r["text"].split()) for r in results)
    }

# Usage
result = extract_pdf("document.pdf")
print(f"Extracted {result['word_count']} words from {result['pages']} pages")

Feedback

  • Useful? clawhub star extract-pdf-text
  • Stay updated: clawhub sync

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Miaoji Bid Guard Pro

亚马逊广告护城河Pro版,90天ROI预测+多活动协同+季节性出价+关键词攻防矩阵。 从单次调价建议升级为完整的广告战役规划。基础功能可使用 miaoji-bid-guard 免费版。

Registry SourceRecently Updated
General

Miaoji Compliance Copy Pro

亚马逊合规文案Pro版,多市场监管+法律风险评估+Rufus深度优化+季节性合规文案。 从单次文案检测升级为多市场合规体系。基础功能可使用 miaoji-compliance-copy 免费版。

Registry SourceRecently Updated
General

Miaoji Model Shot Pro

亚马逊模特拍摄Pro版,完整拍摄计划+季节性拍摄日历+多场景组合+同类视觉反超方案。 从单次拍摄建议升级为完整视觉战役。基础功能可使用 miaoji-model-shot 免费版。

Registry SourceRecently Updated
General

Miaoji Scene Studio Pro

亚马逊场景工作室Pro版,A/B测试方案+平台适配规格+季节性场景库+同类产品视觉差距分析。 从单次场景建议升级为完整视觉策略。基础功能可使用 miaoji-scene-studio 免费版。

Registry SourceRecently Updated