docs-pdf

Use this skill whenever the user wants to do anything with PDF files. Triggers include: reading or extracting text/tables from PDFs, combining or merging multiple PDFs into one, splitting PDFs apart, rotating pages, adding watermarks, creating new PDFs from scratch, filling PDF forms, encrypting or password-protecting PDFs, extracting images from PDFs, OCR on scanned PDFs, compressing/optimizing PDFs, viewing PDF info/metadata, converting images to PDF, converting PDF to images, comparing two PDFs, reordering pages, repairing corrupted PDFs, and listing fonts. Also trigger when the user uploads a .pdf file and asks you to do something with it, or when they mention "PDF" in any context involving file creation, editing, or data extraction — even if they don't say "PDF skill" explicitly.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "docs-pdf" with this command: npx skills add echodjx/docs-pdf

PDF Skill

Complete guide for PDF operations using Python libraries and CLI tools.

⚡ Feature Cheat Sheet

One-line lookup for every supported operation — find the right tool instantly.

What you want to doCommand / ScriptOne-liner example
📖 Extract textscripts/extract_text.pypython scripts/extract_text.py doc.pdf
📊 Extract tables → Excelscripts/extract_tables.pypython scripts/extract_tables.py report.pdf -o tables.xlsx
🔗 Merge PDFsscripts/merge_pdfs.pypython scripts/merge_pdfs.py "*.pdf" -o merged.pdf
✂️ Split PDFscripts/split_pdf.pypython scripts/split_pdf.py big.pdf --each
🔄 Rotate pagesscripts/batch_convert.py rotatepython scripts/batch_convert.py rotate input.pdf -d 90
🔀 Reorder pagesscripts/reorder_pdf.pypython scripts/reorder_pdf.py input.pdf --order "3,1,2,4-" -o reordered.pdf
💧 Add text watermarkscripts/watermark.pypython scripts/watermark.py doc.pdf -t "CONFIDENTIAL"
🖼️ Add image watermarkscripts/watermark.pypython scripts/watermark.py doc.pdf --image logo.png --alpha 0.3
🔒 Encrypt PDFpypdf (inline)see Password Protect below
📝 Fill PDF formscripts/fill_pdf_form.pypython scripts/fill_pdf_form.py form.pdf -o filled.pdf --set name="Alice"
🔍 Check form fieldsscripts/check_fillable_fields.pypython scripts/check_fillable_fields.py form.pdf
🖼️ OCR scanned PDFscripts/ocr_pdf.pypython scripts/ocr_pdf.py scan.pdf --lang eng
📄 Create PDF from scratchreportlab (inline)see references/create.md
📦 Batch operationsscripts/batch_convert.pypython scripts/batch_convert.py merge --help
📏 Compress / optimizescripts/compress_pdf.pypython scripts/compress_pdf.py input.pdf -o output.pdf --quality medium
ℹ️ View PDF infoscripts/pdf_info.pypython scripts/pdf_info.py input.pdf
🖼️→📄 Images to PDFscripts/images_to_pdf.pypython scripts/images_to_pdf.py "photos/*.jpg" -o album.pdf --page-size A4
📄→🖼️ PDF to imagesscripts/pdf_to_images.pypython scripts/pdf_to_images.py input.pdf -o pages/ --format png --dpi 200
🔎 Compare two PDFsscripts/compare_pdf.pypython scripts/compare_pdf.py old.pdf new.pdf -o diff_report.html
🔧 Repair corrupted PDFscripts/repair_pdf.pypython scripts/repair_pdf.py broken.pdf -o fixed.pdf
🔤 List fontsscripts/list_fonts.pypython scripts/list_fonts.py input.pdf

💡 Run any script with --help to see all available options.


Quick Decision Guide

What do you need?
├── Create a new PDF from scratch     → reportlab  (see references/create.md)
├── Extract text / tables             → pdfplumber  (see references/extract.md)
├── Merge / split / rotate pages      → pypdf or qpdf CLI
├── Reorder pages                     → scripts/reorder_pdf.py
├── Add watermark / encrypt / protect → pypdf
├── Fill out a PDF form               → pdf-lib (JS) or pypdf  (see FORMS.md)
├── Extract images from PDF           → pdfimages CLI or pypdf
├── OCR a scanned PDF                 → pdf2image + pytesseract
├── Compress / reduce file size       → scripts/compress_pdf.py (qpdf + pypdf)
├── View PDF info / metadata          → scripts/pdf_info.py
├── Convert images → PDF              → scripts/images_to_pdf.py (reportlab)
├── Convert PDF → images              → scripts/pdf_to_images.py (pdf2image)
├── Compare / diff two PDFs           → scripts/compare_pdf.py
├── Repair a corrupted PDF            → scripts/repair_pdf.py (qpdf + pypdf)
└── List fonts in a PDF               → scripts/list_fonts.py

Installation

Linux (Ubuntu/Debian)

# Python libraries
pip install pypdf pdfplumber reportlab pdf2image pytesseract Pillow --break-system-packages

# System tools
sudo apt-get install -y poppler-utils tesseract-ocr qpdf

# For Chinese OCR
sudo apt-get install -y tesseract-ocr-chi-sim tesseract-ocr-chi-tra

# Node.js (form filling)
npm install pdf-lib

macOS (Homebrew)

# System tools (required for OCR and CLI operations)
brew install qpdf poppler tesseract

# IMPORTANT: Language packs must be installed separately for non-English OCR
brew install tesseract-lang

# Python libraries
pip install pypdf pdfplumber reportlab pdf2image pytesseract Pillow --break-system-packages

# Node.js (form filling)
npm install pdf-lib

⚠️ macOS 注意: tesseract-lang 必须单独安装,否则中文/日文等非英文 OCR 会失败。安装后运行 tesseract --list-langs 确认可用语言。

Verify Installation

# Check Python libraries
python3 -c "import pypdf, pdfplumber, reportlab, PIL; print('✓ Python libs OK')"

# Check system tools
which qpdf       && echo "✓ qpdf OK"       || echo "✗ qpdf not installed"
which tesseract   && echo "✓ tesseract OK"  || echo "✗ tesseract not installed"
which pdftotext   && echo "✓ poppler OK"    || echo "✗ poppler not installed"

# Check OCR languages
tesseract --list-langs 2>/dev/null | head -5

Core Operations

Read & Extract Text

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        print(page.extract_text())

→ For advanced extraction options, see references/extract.md

Extract Tables → DataFrame

import pdfplumber, pandas as pd

with pdfplumber.open("report.pdf") as pdf:
    for page in pdf.pages:
        for table in page.extract_tables():
            df = pd.DataFrame(table[1:], columns=table[0])
            print(df)

Merge PDFs

from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for path in ["a.pdf", "b.pdf", "c.pdf"]:
    writer.append(PdfReader(path))
with open("merged.pdf", "wb") as f:
    writer.write(f)

Split PDF

from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    w = PdfWriter()
    w.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as f:
        w.write(f)

Rotate Pages

reader = PdfReader("scan.pdf")
writer = PdfWriter()
for page in reader.pages:
    page.rotate(90)   # 90 / 180 / 270
    writer.add_page(page)
with open("rotated.pdf", "wb") as f:
    writer.write(f)

Password Protect

from pypdf import PdfReader, PdfWriter

reader = PdfReader("doc.pdf")
writer = PdfWriter()
for page in reader.pages:
    writer.add_page(page)
writer.encrypt("user_pass", "owner_pass", use_128bit=False)  # AES-256
with open("encrypted.pdf", "wb") as f:
    writer.write(f)

CLI Quick Reference (qpdf)

# Merge
qpdf --empty --pages a.pdf b.pdf -- merged.pdf

# Extract pages 1-5
qpdf input.pdf --pages . 1-5 -- out.pdf

# Rotate all pages 90°
qpdf input.pdf output.pdf --rotate=+90

# Remove password
qpdf --password=secret --decrypt locked.pdf unlocked.pdf

# Linearize (web-optimized)
qpdf --linearize input.pdf output.pdf

Available Scripts

Use these scripts directly — no need to rewrite from scratch:

ScriptPurpose
scripts/extract_text.pyExtract all text, page by page, to .txt
scripts/extract_tables.pyExtract all tables to .xlsx
scripts/merge_pdfs.pyMerge multiple PDFs from a glob pattern
scripts/split_pdf.pySplit by page ranges
scripts/reorder_pdf.pyReorder pages (flexible syntax: "3,1,2,4-")
scripts/watermark.pyAdd text or image watermark
scripts/ocr_pdf.pyFull OCR pipeline for scanned PDFs
scripts/batch_convert.pyBatch operations (merge/split/rotate) CLI
scripts/check_fillable_fields.pyList all form fields in a PDF
scripts/fill_pdf_form.pyFill AcroForm fields programmatically
scripts/create_test_form.pyGenerate a sample fillable PDF form for testing
scripts/compress_pdf.pyCompress / optimize PDF to reduce file size
scripts/pdf_info.pyView PDF metadata, page count, encryption, fonts
scripts/images_to_pdf.pyConvert images (JPG/PNG/etc.) to PDF
scripts/pdf_to_images.pyConvert PDF pages to PNG/JPEG images
scripts/compare_pdf.pyCompare two PDFs and generate diff report
scripts/repair_pdf.pyAttempt to repair corrupted PDF files
scripts/list_fonts.pyList all fonts used in a PDF

Run any script with --help to see its options.


Reference Files

Load these when you need deeper guidance:

  • references/create.md — Building PDFs from scratch with reportlab (Platypus, Canvas, styles, tables, headers/footers)
  • references/extract.md — Advanced text/table/image extraction, coordinate-based cropping, word-level data
  • references/security.md — Watermarks, encryption, permissions, digital signatures
  • references/ocr.md — OCR pipeline, language packs, image preprocessing, quality tuning
  • FORMS.md — Complete guide to PDF form filling (AcroForm + XFA, pdf-lib JS)

Quick Reference Table

TaskBest ToolKey Method
Extract textpdfplumberpage.extract_text()
Extract tablespdfplumberpage.extract_tables()
Merge PDFspypdfwriter.append()
Split PDFspypdfone page per writer
Rotate pagespypdfpage.rotate(90)
Reorder pagespypdfwriter.add_page(reader.pages[i])
Create PDFreportlabPlatypus or Canvas
Watermarkpypdf + reportlabpage.merge_page()
Encryptpypdfwriter.encrypt()
Fill formpypdf / pdf-libsee FORMS.md
OCR scannedpytesseractsee references/ocr.md
Compress PDFqpdf + pypdfcompress_identical_objects()
View PDF infopypdfPdfReader metadata + fields
Images → PDFreportlabcanvas.drawImage()
PDF → imagespdf2imageconvert_from_path()
Compare PDFspdfplumber + difflibtext diff per page
Repair PDFqpdf / pypdfqpdf --linearize or re-write
List fontspypdfpage /Resources/Font
CLI mergeqpdf--empty --pages
Extract imagespypdf / pdfimagespage.images

Common Pitfalls

  • Never use Unicode subscripts/superscripts (₂, ⁰) in reportlab — use <sub> / <super> XML tags instead, or they render as black boxes
  • pdfplumber, not pypdf, for text extraction — pypdf's extract_text() loses layout; pdfplumber is layout-aware
  • Encrypted PDFs: pass password= to PdfReader() and pdfplumber.open()
  • pip in sandbox: always add --break-system-packages flag
  • qpdf for speed: for large batch jobs, prefer qpdf CLI over Python loops
  • macOS OCR 语言包: brew install tesseract 仅含英文;非英文 OCR 需额外执行 brew install tesseract-lang
  • macOS 系统依赖: OCR 和 CLI 操作需先安装 brew install qpdf poppler tesseract
  • 测试表单填充: 没有可填写 PDF 时,先运行 python scripts/create_test_form.py 生成测试表单
  • OCR vs pdfplumber: OCR 只适用于扫描件(图片型 PDF)。对原生文本 PDF 提取内容,应使用 pdfplumber(更快更准)
  • 中文表单填充: pypdf 内置字体不支持 CJK 字符,中文值可能显示为方块。需要中文表单填充时,使用 pdf-lib (JS) 方案(见 FORMS.md)
  • 旋转页面: 没有独立 rotate 脚本,使用 python scripts/batch_convert.py rotate input.pdf -d 90

⛔ Limitations (Not Suitable For)

场景原因替代方案
复杂排版 PDF(杂志、海报)提取会丢失格式布局使用专业排版工具
扫描件中的表格提取OCR 表格精度有限使用专业表格识别工具如 Camelot
CJK 字符的表单填充pypdf 内置字体不含 CJK使用 pdf-lib (JS),见 FORMS.md
超大 PDF (>500MB)内存可能不足用 qpdf CLI 或分批处理

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Digicert

DigiCert integration. Manage Certificates, Orders, Users, Organizations. Use when the user wants to interact with DigiCert data.

Registry SourceRecently Updated
General

Dialpad

Dialpad integration. Manage Users, Groups, Departments, Offices. Use when the user wants to interact with Dialpad data.

Registry SourceRecently Updated
General

Darwinbox

Darwinbox integration. Manage Organizations, Goals, Roles, Projects, Pipelines, Leads and more. Use when the user wants to interact with Darwinbox data.

Registry SourceRecently Updated
General

Creatio

Creatio integration. Manage Leads, Organizations, Users. Use when the user wants to interact with Creatio data.

Registry SourceRecently Updated