PDF Processing Guide
MANDATORY: Working Directory
EVERY command in this skill MUST be executed from THIS skill's directory. Before running ANY command — Python or bash script — you MUST cd into this skill's directory first. Determine the absolute path of this SKILL.md file and use its parent directory.
SKILL_DIR="<absolute-path-to-this-skills-sophnet-pdf-directory>"
cd "$SKILL_DIR"
NEVER run commands from the repository root or any other directory. If you do, uv run won't find pyproject.toml, and python won't have access to required packages.
Overview
This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see REFERENCE.md. If you need to fill out a PDF form, read FORMS.md and follow its instructions.
⚠️ CJK Text in PDFs — READ FIRST
If the PDF will contain ANY Chinese/Japanese/Korean text (including titles, body text, table cells, or watermarks), you MUST handle CJK fonts correctly:
- reportlab: Default fonts (
Helvetica,Times-Roman) render CJK as black boxes (█). You MUST registerSTSong-Lightand calladdMapping()before creating any content. See the "reportlab - Create PDFs" section for the exact setup code. - PyMuPDF (fitz): Use
fontname="china-s"forinsert_text()/insert_textbox(). This is built-in and requires no extra setup. - Rule of thumb: When in doubt, always use CJK-safe fonts.
STSong-Light(reportlab) andchina-s(PyMuPDF) both handle mixed CJK+Latin text correctly.
Python Runtime (uv)
CRITICAL: All Python execution in this skill MUST use uv run --project . from the skill directory. NEVER use bare python3, python, or pip install directly — the required packages (pdfplumber, reportlab, pypdf, etc.) are ONLY available inside the uv virtual environment defined by this skill's pyproject.toml. Direct python3 will fail with ModuleNotFoundError.
IMPORTANT: Only use libraries available in this skill's pyproject.toml (reportlab, pypdf, pdfplumber, pillow, pdf2image, pymupdf). Do NOT import matplotlib, numpy, pandas, or other packages not listed — they will cause ModuleNotFoundError. For charts/graphs, use reportlab.graphics.charts (VerticalBarChart, HorizontalBarChart, Pie, etc.) instead of matplotlib.
First, ensure the environment is set up (run once per session):
cd "$SKILL_DIR"
bash scripts/ensure_uv_env.sh
Then ALL Python commands must use this prefix:
cd "$SKILL_DIR" && uv run --project . python <script-or-module>
This applies to both the provided scripts AND any inline Python code you write. For inline code, use:
cd "$SKILL_DIR" && uv run --project . python -c "import pdfplumber; ..."
Delivery
Local PDF creation/editing/analysis does not require any Sophnet API key.
IMPORTANT: After creating or modifying a PDF, ALWAYS upload it and return the download URL to the user. This is the default behavior — do not skip the upload step.
bash scripts/upload_file.sh --file <absolute-path-to-pdf>
Upload command output contract:
FILE_PATH=<absolute-path>UPLOAD_STATUS=uploaded|skippedDOWNLOAD_URL=<https://...>(present only when uploaded)
Delivery rules:
- ALWAYS call
scripts/upload_file.shafter producing a PDF file, then return theDOWNLOAD_URLto the user. - If
UPLOAD_STATUS=uploaded, return the exactDOWNLOAD_URLvalue to the user as a clickable link. - If
UPLOAD_STATUS=skipped(missing API key), returnFILE_PATHinstead of failing the whole task. - Keep URL output logic independent inside
sophnet-pdf/scripts. Do not call other skills' upload scripts.
Quick Start
from pypdf import PdfReader, PdfWriter
# Read a PDF
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")
# Extract text
text = ""
for page in reader.pages:
text += page.extract_text()
Python Libraries
PyMuPDF (fitz) - Text Search & Replace
This is the ONLY method for replacing text in existing PDFs. Do NOT fall back to nano-pdf, pdftotext, tesseract, or any external CLI tool — use PyMuPDF exclusively.
CRITICAL FONT RULES:
- NEVER use the original PDF's embedded font names (e.g.
Unnamed-T3,NotoSansCJKsc-Regular,LiberationSans) withinsert_text()— they are embedded fonts and will causeException: need font file or buffer. - Always use
page.search_for()to locate text (it works across spans, unlike iteratingget_text("dict")spans which may split multi-word terms). - Use the fallback font picker below to choose a suitable built-in font.
Built-in font aliases (Latin): helv (Helvetica), heit (Helvetica-Bold), tiro (Times-Roman), tibo (Times-Bold), cour (Courier), cobo (Courier-Bold).
Built-in CJK font aliases: china-s (Simplified Chinese), china-t (Traditional Chinese), japan (Japanese), korea (Korean). These also render Latin characters correctly.
Find & Replace Text in PDF (Complete Example)
import fitz # PyMuPDF
# --- Font fallback logic ---
BUILTIN_FONTS = {"helv","heit","tiro","tibo","tibi","toit","cour","cobo","cobi","cout","symb","zadb"}
CJK_FONTS = {"china-s","china-ss","china-t","china-ts","japan","japan-s","korea","korea-s"}
def has_cjk(text):
"""Check if text contains CJK characters."""
return any("\u4e00" <= c <= "\u9fff" or "\u3000" <= c <= "\u303f" for c in text)
def pick_font(original_font, text):
"""Pick a built-in font. Falls back from the original embedded font name."""
if original_font in BUILTIN_FONTS or original_font in CJK_FONTS:
return original_font
if has_cjk(text):
return "china-s" # handles both CJK and Latin
lower = original_font.lower()
if "bold" in lower:
return "heit"
if "mono" in lower or "courier" in lower or "code" in lower:
return "cour"
return "helv"
def get_span_info(page, rect):
"""Get font size, font name, and color from the span overlapping rect."""
for block in page.get_text("dict")["blocks"]:
if "lines" not in block:
continue
for line in block["lines"]:
for span in line["spans"]:
if fitz.Rect(span["bbox"]).intersects(rect):
return span["size"], span["font"], span.get("color", 0)
return 12, "helv", 0
# --- Main replacement logic ---
search_terms = ["OldText", "Old Text"] # all variants to search for
replacement = "NewText"
doc = fitz.open("input.pdf")
for page in doc:
# Step 1: Find all instances using search_for (works across spans)
hits = []
for term in search_terms:
for rect in page.search_for(term):
size, font, color = get_span_info(page, rect)
hits.append({"rect": rect, "size": size, "font": font, "color": color})
# Step 2: Redact (erase) old text
for h in hits:
page.add_redact_annot(h["rect"], fill=(1, 1, 1))
page.apply_redactions()
# Step 3: Insert replacement text with fallback font
for h in hits:
font = pick_font(h["font"], replacement)
c = h["color"]
rgb = (((c>>16)&0xFF)/255, ((c>>8)&0xFF)/255, (c&0xFF)/255) if isinstance(c, int) else (0,0,0)
page.insert_text(
(h["rect"].x0, h["rect"].y0 + h["size"] * 0.85),
replacement,
fontname=font,
fontsize=h["size"],
color=rgb,
)
doc.save("output.pdf", garbage=4, deflate=True)
After replacement, always verify by extracting text from the output PDF:
doc2 = fitz.open("output.pdf")
for i, p in enumerate(doc2):
text = p.get_text()
old_count = text.lower().count("oldtext")
new_count = text.lower().count("newtext")
print(f"Page {i+1}: OldText={old_count}, NewText={new_count}")
Notes:
- PDF text replacement is inherently imperfect — PDFs are presentation-focused, not text-edit-focused.
- The replacement font may look slightly different from the original embedded font. This is expected.
china-sis the best fallback for mixed CJK+Latin text;helvfor Latin-only text.
PyMuPDF (fitz) - Image Operations
Use PyMuPDF for ALL image operations in existing PDFs. Key APIs:
page.insert_image(rect, filename=...)— insert a new imagepage.replace_image(xref, filename=...)— replace an existing image in-place (keeps original rect)page.delete_image(xref)— remove an image (replaces with 1×1 transparent pixel)page.get_images(full=True)— list images on a page (returns xref, etc.)page.get_image_info(xrefs=True)— list images with bounding box info
CRITICAL: Always use page.get_image_info(xrefs=True) (not get_images) when you need the on-page bounding box of each image.
Insert Image into Existing PDF
import fitz
doc = fitz.open("input.pdf")
page = doc[0]
# Define target rectangle (x0, y0, x1, y1) — image will be scaled to fit
# To center horizontally with 50% page width:
pw = page.rect.width
img_w = pw * 0.5
# Load image to get aspect ratio
img = fitz.Pixmap("image.png")
aspect = img.width / img.height
img_h = img_w / aspect
x0 = (pw - img_w) / 2
y0 = 700 # vertical position
rect = fitz.Rect(x0, y0, x0 + img_w, y0 + img_h)
page.insert_image(rect, filename="image.png")
doc.save("output.pdf")
Replace an Existing Image
import fitz
doc = fitz.open("input.pdf")
page = doc[0]
# Find the image to replace
images = page.get_images(full=True)
target_xref = images[0][0] # xref of the first image
# Replace it — the new image fills the SAME rect as the original
page.replace_image(target_xref, filename="new_image.png")
doc.save("output.pdf")
Note: replace_image keeps the original bounding box. The new image is scaled to fit.
Modify Image Layout (Reposition / Resize)
import fitz
doc = fitz.open("input.pdf")
page = doc[0]
# Get image placements with bounding boxes
img_info = page.get_image_info(xrefs=True)
target = img_info[0] # first image
xref = target["xref"]
orig_rect = fitz.Rect(target["bbox"])
# Extract original image bytes
pix = fitz.Pixmap(doc, xref)
img_bytes = pix.tobytes("png")
# Delete the original (replaces with 1x1 transparent pixel)
page.delete_image(xref)
# Re-insert at new position/size (e.g. 70% size, centered)
pw = page.rect.width
new_w = orig_rect.width * 0.7
new_h = orig_rect.height * 0.7
new_x0 = (pw - new_w) / 2
new_rect = fitz.Rect(new_x0, orig_rect.y0, new_x0 + new_w, orig_rect.y0 + new_h)
page.insert_image(new_rect, stream=img_bytes)
doc.save("output.pdf", garbage=4, deflate=True)
Notes on image operations:
delete_imagereplaces the image data with a 1×1 transparent pixel (the page reference remains but is invisible). This is normal PyMuPDF behavior.- For layout changes: extract → delete → re-insert at new rect. Do NOT try
doc.update_image()(does not exist). insert_imageacceptsfilename=(file path) orstream=(bytes). Usestream=when re-inserting extracted images.- Use
garbage=4, deflate=Trueindoc.save()to clean up unused objects and compress.
pypdf - Basic Operations
Merge PDFs
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)
Split PDF
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)
Extract Metadata
reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Subject: {meta.subject}")
print(f"Creator: {meta.creator}")
Rotate Pages
reader = PdfReader("input.pdf")
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90) # Rotate 90 degrees clockwise
writer.add_page(page)
with open("rotated.pdf", "wb") as output:
writer.write(output)
pdfplumber - Text and Table Extraction
Extract Text with Layout
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)
Extract Tables
with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for j, table in enumerate(tables):
print(f"Table {j+1} on page {i+1}:")
for row in table:
print(row)
Advanced Table Extraction
import pandas as pd
with pdfplumber.open("document.pdf") as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table: # Check if table is not empty
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)
# Combine all tables
if all_tables:
combined_df = pd.concat(all_tables, ignore_index=True)
combined_df.to_excel("extracted_tables.xlsx", index=False)
reportlab - Create PDFs
⚠️ If the PDF contains ANY Chinese/Japanese/Korean text, you MUST follow the CJK rules below. Skipping them causes black boxes (█) or crash errors.
FORBIDDEN patterns (these ALL break CJK rendering):
- ❌ Using
Helvetica,Times-Roman, orCourierfonts for CJK text → renders as black boxes (█) - ❌ Registering
STSong-Lightwithout callingaddMapping()→Paragraph()crashes withValueError: Can't map determine family/bold/italic for stsong-light - ❌ Using
<font face="STSong-Light">in Paragraph XML withoutaddMapping()→ same crash - ❌ Setting
style.fontName = "STSong-Light"withoutaddMapping()→ same crash
CJK font setup (MUST be at the top of EVERY script with CJK text):
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.cidfonts import UnicodeCIDFont
from reportlab.lib.fonts import addMapping
# Step 1: Register the CJK font
pdfmetrics.registerFont(UnicodeCIDFont("STSong-Light"))
# Step 2: MANDATORY — register font family mapping (without this Paragraph() crashes)
addMapping("STSong-Light", 0, 0, "STSong-Light") # normal
addMapping("STSong-Light", 1, 0, "STSong-Light") # bold
addMapping("STSong-Light", 0, 1, "STSong-Light") # italic
addMapping("STSong-Light", 1, 1, "STSong-Light") # bold+italic
After setup, use "STSong-Light" as the font EVERYWHERE — Canvas (c.setFont), ParagraphStyle (fontName=), TableStyle (("FONTNAME", ...)). STSong-Light renders both CJK and Latin correctly.
Alternative: If reportlab gives trouble, use PyMuPDF with fontname="china-s" to create PDFs with CJK text (see PyMuPDF section).
Complete CJK PDF Template (Platypus — copy-paste this)
from reportlab.lib.pagesizes import A4
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Table, TableStyle, PageBreak
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib import colors
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.cidfonts import UnicodeCIDFont
from reportlab.lib.fonts import addMapping
# ===== CJK FONT SETUP (MANDATORY) =====
pdfmetrics.registerFont(UnicodeCIDFont("STSong-Light"))
addMapping("STSong-Light", 0, 0, "STSong-Light")
addMapping("STSong-Light", 1, 0, "STSong-Light")
addMapping("STSong-Light", 0, 1, "STSong-Light")
addMapping("STSong-Light", 1, 1, "STSong-Light")
# ===== STYLES (all use STSong-Light) =====
styles = getSampleStyleSheet()
cn_title = ParagraphStyle("CNTitle", parent=styles["Title"],
fontName="STSong-Light", fontSize=28, alignment=1, spaceAfter=20)
cn_body = ParagraphStyle("CNBody", parent=styles["Normal"],
fontName="STSong-Light", fontSize=12, leading=18)
# ===== BUILD CONTENT =====
story = []
story.append(Paragraph("标题文字", cn_title)) # <-- change title here
story.append(Spacer(1, 12))
story.append(Paragraph("正文内容...", cn_body)) # <-- change body here
story.append(PageBreak())
# Table with CJK content
data = [
["列1", "列2", "列3"], # <-- change table data here
["数据A", "数据B", "数据C"],
]
table = Table(data, colWidths=[150, 150, 150])
table.setStyle(TableStyle([
("FONTNAME", (0, 0), (-1, -1), "STSong-Light"), # ALL cells use STSong-Light
("FONTSIZE", (0, 0), (-1, 0), 14),
("BACKGROUND", (0, 0), (-1, 0), colors.HexColor("#4472C4")),
("TEXTCOLOR", (0, 0), (-1, 0), colors.white),
("GRID", (0, 0), (-1, -1), 1, colors.black),
("ALIGN", (0, 0), (-1, -1), "CENTER"),
]))
story.append(table)
doc = SimpleDocTemplate("output.pdf", pagesize=A4)
doc.build(story)
Basic PDF Creation (Canvas API)
Note: This example includes CJK font setup. Even if your content is English-only, including the STSong-Light setup is harmless and avoids CJK issues if content later changes.
from reportlab.lib.pagesizes import A4
from reportlab.pdfgen import canvas
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.cidfonts import UnicodeCIDFont
# Always register CJK font — safe even for English-only content
pdfmetrics.registerFont(UnicodeCIDFont("STSong-Light"))
c = canvas.Canvas("output.pdf", pagesize=A4)
w, h = A4
c.setFont("STSong-Light", 24)
c.drawCentredString(w / 2, h - 80, "Title Here")
c.setFont("STSong-Light", 12)
c.drawString(72, h - 130, "Body text here. 中文也可以正确显示。")
c.save()
Create PDF with Multiple Pages (Platypus)
from reportlab.lib.pagesizes import A4
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.cidfonts import UnicodeCIDFont
from reportlab.lib.fonts import addMapping
# ===== MANDATORY CJK FONT SETUP =====
pdfmetrics.registerFont(UnicodeCIDFont("STSong-Light"))
addMapping("STSong-Light", 0, 0, "STSong-Light")
addMapping("STSong-Light", 1, 0, "STSong-Light")
addMapping("STSong-Light", 0, 1, "STSong-Light")
addMapping("STSong-Light", 1, 1, "STSong-Light")
styles = getSampleStyleSheet()
# Override default styles to use STSong-Light
cn_style = ParagraphStyle("CNNormal", parent=styles["Normal"], fontName="STSong-Light")
cn_title = ParagraphStyle("CNTitle", parent=styles["Title"], fontName="STSong-Light")
cn_h1 = ParagraphStyle("CNH1", parent=styles["Heading1"], fontName="STSong-Light")
doc = SimpleDocTemplate("report.pdf", pagesize=A4)
story = []
story.append(Paragraph("Report Title", cn_title))
story.append(Spacer(1, 12))
story.append(Paragraph("This is the body of the report. " * 20, cn_style))
story.append(PageBreak())
story.append(Paragraph("Page 2", cn_h1))
story.append(Paragraph("Content for page 2", cn_style))
doc.build(story)
Subscripts and Superscripts
IMPORTANT: Never use Unicode subscript/superscript characters (₀₁₂₃₄₅₆₇₈₉, ⁰¹²³⁴⁵⁶⁷⁸⁹) in ReportLab PDFs. The built-in fonts do not include these glyphs, causing them to render as solid black boxes.
Instead, use ReportLab's XML markup tags in Paragraph objects:
from reportlab.platypus import Paragraph
from reportlab.lib.styles import getSampleStyleSheet
styles = getSampleStyleSheet()
# Subscripts: use <sub> tag
chemical = Paragraph("H<sub>2</sub>O", styles['Normal'])
# Superscripts: use <super> tag
squared = Paragraph("x<super>2</super> + y<super>2</super>", styles['Normal'])
For canvas-drawn text (not Paragraph objects), manually adjust font the size and position rather than using Unicode subscripts/superscripts.
Command-Line Tools
pdftotext (poppler-utils)
# Extract text
pdftotext input.pdf output.txt
# Extract text preserving layout
pdftotext -layout input.pdf output.txt
# Extract specific pages
pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5
qpdf
# Merge PDFs
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
# Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf
# Rotate pages
qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees
# Remove password
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
pdftk (if available)
# Merge
pdftk file1.pdf file2.pdf cat output merged.pdf
# Split
pdftk input.pdf burst
# Rotate
pdftk input.pdf rotate 1east output rotated.pdf
Common Tasks
Extract Text from Scanned PDFs
# Requires: uv add pytesseract (pdf2image is already in this skill env)
import pytesseract
from pdf2image import convert_from_path
# Convert PDF to images
images = convert_from_path('scanned.pdf')
# OCR each page
text = ""
for i, image in enumerate(images):
text += f"Page {i+1}:\n"
text += pytesseract.image_to_string(image)
text += "\n\n"
print(text)
Add Watermark (PyMuPDF — the ONLY correct method)
⚠️ MANDATORY: Copy-paste the template below. Do NOT invent your own watermark approach.
FORBIDDEN APIs (these ALL produce broken or removable watermarks):
- ❌
page.add_text_annot()— creates sticky-note pop-ups, not visible text - ❌
page.add_freetext_annot()— creates annotations that any PDF editor can delete;fontsizekeyword may crash - ❌
page.insert_text(rotate=45)— only accepts 0/90/180/270; raisesValueError: bad rotate value - ❌
TextWriter.write_text(morph=...)— text positioning is unreliable, renders off-center - ❌
page.rect.ratio— does NOT exist, raisesAttributeError
The ONLY correct method is Shape overlay: create a temp PDF page, draw text with Shape.insert_text(morph=...), then merge with show_pdf_page.
Complete Watermark Template (copy-paste this)
import fitz
doc = fitz.open("input.pdf")
for page in doc:
r = page.rect
cx, cy = r.width / 2, r.height / 2
# 1. Create a temporary overlay page (same size as original)
wm_doc = fitz.open()
wm_page = wm_doc.new_page(width=r.width, height=r.height)
# 2. Calculate text dimensions
font = fitz.Font("helv") # Use "china-s" for CJK watermark text
text = "CONFIDENTIAL" # <-- change watermark text here
fs = 55 # <-- change font size here
text_width = font.text_length(text, fontsize=fs)
# 3. Draw rotated text via Shape (the ONLY reliable rotation method)
shape = wm_page.new_shape()
shape.insert_text(
fitz.Point(cx - text_width / 2, cy + fs * 0.35), # center text on page
text,
fontname="helv", # Use "china-s" for CJK
fontsize=fs,
color=(0.82, 0.82, 0.82), # light grey
morph=(fitz.Point(cx, cy), fitz.Matrix(45)), # 45° diagonal rotation
)
shape.commit()
# 4. Merge overlay onto original page (bakes watermark into content stream)
page.show_pdf_page(r, wm_doc, 0, overlay=True)
wm_doc.close()
# 5. Save THEN verify (do NOT access doc after save+close)
doc.save("watermarked.pdf", garbage=4, deflate=True)
doc.close()
# 6. Verify — open the SAVED file separately
doc2 = fitz.open("watermarked.pdf")
print(f"Pages: {len(doc2)}")
for i, p in enumerate(doc2):
annots = list(p.annots()) if p.annots() else []
print(f"Page {i+1}: annots={len(annots)} (should be 0)")
doc2.close()
Key points:
- The watermark is permanent — baked into the page content stream, not removable.
- Annotations count MUST be 0 after adding watermark. If > 0, you used the wrong API.
doc.close()before any verification. Open the saved file as a newfitz.open()to verify.- For semi-transparent effect: use a light color like
(0.82, 0.82, 0.82)on white backgrounds. - For CJK watermark text, use
fontname="china-s"andfitz.Font("china-s").
Add Watermark (pypdf — alternative, requires pre-made watermark PDF)
from pypdf import PdfReader, PdfWriter
# Load pre-existing watermark PDF
watermark = PdfReader("watermark.pdf").pages[0]
reader = PdfReader("document.pdf")
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open("watermarked.pdf", "wb") as output:
writer.write(output)
Extract Images
# Using pdfimages (poppler-utils)
pdfimages -j input.pdf output_prefix
# This extracts all images as output_prefix-000.jpg, output_prefix-001.jpg, etc.
Password Protection
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
# Add password
writer.encrypt("userpassword", "ownerpassword")
with open("encrypted.pdf", "wb") as output:
writer.write(output)
Quick Reference
| Task | Best Tool | Command/Code |
|---|---|---|
| Prepare Python environment | uv | bash scripts/ensure_uv_env.sh |
| Find & replace text | PyMuPDF (fitz) | search_for + redact + insert_text (see above) |
| Insert/replace images | PyMuPDF (fitz) | insert_image / replace_image / delete_image (see above) |
| Merge PDFs | pypdf | writer.add_page(page) |
| Split PDFs | pypdf | One page per file |
| Extract text | pdfplumber | page.extract_text() |
| Extract tables | pdfplumber | page.extract_tables() |
| Add watermark | PyMuPDF (fitz) | Shape.insert_text(morph=...) + show_pdf_page overlay |
| Create PDFs | reportlab | Canvas or Platypus (CJK: register STSong-Light first) |
| Command line merge | qpdf | qpdf --empty --pages ... |
| OCR scanned PDFs | pytesseract | Convert to image first |
| Fill PDF forms | pdf-lib or pypdf (see FORMS.md) | See FORMS.md |
| Optional upload for URL | upload script | bash scripts/upload_file.sh --file /abs/path/output.pdf |
Dependencies
uv- Python environment and dependency manager for this skill (bash scripts/ensure_uv_env.sh)qpdf,pdftotext,pdftoppm- optional command-line PDF toolstesseract- required for OCR workflows usingpytesseract
Next Steps
- For advanced pypdfium2 usage, see REFERENCE.md
- For JavaScript libraries (pdf-lib), see REFERENCE.md
- If you need to fill out a PDF form, follow the instructions in FORMS.md
- For troubleshooting guides, see REFERENCE.md