pdf

全面的 PDF 操作工具包,用于提取文本和表格、创建新 PDF、合并/拆分文档以及处理表单。当 Claude 需要填写 PDF 表单或以编程方式大规模处理、生成或分析 PDF 文档时使用。

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "pdf" with this command: npx skills add leastbit/claude_skills_zh-cn/leastbit-claude-skills-zh-cn-pdf

PDF 处理指南

概述

本指南涵盖使用 Python 库和命令行工具进行的基本 PDF 处理操作。有关高级功能、JavaScript 库和详细示例,请参阅 reference.md。如果需要填写 PDF 表单,请阅读 forms.md 并按照其说明操作。

快速开始

from pypdf import PdfReader, PdfWriter

# 读取 PDF
reader = PdfReader("document.pdf")
print(f"页数: {len(reader.pages)}")

# 提取文本
text = ""
for page in reader.pages:
    text += page.extract_text()

Python 库

pypdf - 基本操作

合并 PDF

from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

with open("merged.pdf", "wb") as output:
    writer.write(output)

拆分 PDF

reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as output:
        writer.write(output)

提取元数据

reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"标题: {meta.title}")
print(f"作者: {meta.author}")
print(f"主题: {meta.subject}")
print(f"创建者: {meta.creator}")

旋转页面

reader = PdfReader("input.pdf")
writer = PdfWriter()

page = reader.pages[0]
page.rotate(90)  # 顺时针旋转90度
writer.add_page(page)

with open("rotated.pdf", "wb") as output:
    writer.write(output)

pdfplumber - 文本和表格提取

提取带布局的文本

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

提取表格

with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        tables = page.extract_tables()
        for j, table in enumerate(tables):
            print(f"第 {i+1} 页的表格 {j+1}:")
            for row in table:
                print(row)

高级表格提取

import pandas as pd

with pdfplumber.open("document.pdf") as pdf:
    all_tables = []
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            if table:  # 检查表格是否为空
                df = pd.DataFrame(table[1:], columns=table[0])
                all_tables.append(df)

# 合并所有表格
if all_tables:
    combined_df = pd.concat(all_tables, ignore_index=True)
    combined_df.to_excel("extracted_tables.xlsx", index=False)

reportlab - 创建 PDF

基本 PDF 创建

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter

# 添加文本
c.drawString(100, height - 100, "Hello World!")
c.drawString(100, height - 120, "这是用 reportlab 创建的 PDF")

# 添加线条
c.line(100, height - 140, 400, height - 140)

# 保存
c.save()

创建多页 PDF

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []

# 添加内容
title = Paragraph("报告标题", styles['Title'])
story.append(title)
story.append(Spacer(1, 12))

body = Paragraph("这是报告的正文内容。" * 20, styles['Normal'])
story.append(body)
story.append(PageBreak())

# 第2页
story.append(Paragraph("第2页", styles['Heading1']))
story.append(Paragraph("第2页的内容", styles['Normal']))

# 构建 PDF
doc.build(story)

命令行工具

pdftotext (poppler-utils)

# 提取文本
pdftotext input.pdf output.txt

# 提取文本并保留布局
pdftotext -layout input.pdf output.txt

# 提取指定页面
pdftotext -f 1 -l 5 input.pdf output.txt  # 第1-5页

qpdf

# 合并 PDF
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

# 拆分页面
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf

# 旋转页面
qpdf input.pdf output.pdf --rotate=+90:1  # 将第1页旋转90度

# 移除密码
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf

pdftk(如果可用)

# 合并
pdftk file1.pdf file2.pdf cat output merged.pdf

# 拆分
pdftk input.pdf burst

# 旋转
pdftk input.pdf rotate 1east output rotated.pdf

常见任务

从扫描的 PDF 提取文本

# 需要安装: pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path

# 将 PDF 转换为图像
images = convert_from_path('scanned.pdf')

# 对每一页进行 OCR 识别
text = ""
for i, image in enumerate(images):
    text += f"第 {i+1} 页:\n"
    text += pytesseract.image_to_string(image)
    text += "\n\n"

print(text)

添加水印

from pypdf import PdfReader, PdfWriter

# 创建水印(或加载现有的)
watermark = PdfReader("watermark.pdf").pages[0]

# 应用到所有页面
reader = PdfReader("document.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.merge_page(watermark)
    writer.add_page(page)

with open("watermarked.pdf", "wb") as output:
    writer.write(output)

提取图像

# 使用 pdfimages (poppler-utils)
pdfimages -j input.pdf output_prefix

# 这会将所有图像提取为 output_prefix-000.jpg、output_prefix-001.jpg 等

密码保护

from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

# 添加密码
writer.encrypt("userpassword", "ownerpassword")

with open("encrypted.pdf", "wb") as output:
    writer.write(output)

快速参考

任务最佳工具命令/代码
合并 PDFpypdfwriter.add_page(page)
拆分 PDFpypdf每页一个文件
提取文本pdfplumberpage.extract_text()
提取表格pdfplumberpage.extract_tables()
创建 PDFreportlabCanvas 或 Platypus
命令行合并qpdfqpdf --empty --pages ...
OCR 扫描 PDFpytesseract先转换为图像
填写 PDF 表单pdf-lib 或 pypdf(参见 forms.md)参见 forms.md

后续步骤

  • 有关 pypdfium2 的高级用法,请参阅 reference.md
  • 有关 JavaScript 库(pdf-lib),请参阅 reference.md
  • 如果需要填写 PDF 表单,请按照 forms.md 中的说明操作
  • 有关故障排除指南,请参阅 reference.md

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

pptx

No summary provided by upstream source.

Repository SourceNeeds Review
General

frontend-design

No summary provided by upstream source.

Repository SourceNeeds Review
General

xlsx

No summary provided by upstream source.

Repository SourceNeeds Review
General

docx

No summary provided by upstream source.

Repository SourceNeeds Review