PDF Processing
Quick start
Use pdfplumber to extract text from PDFs:
import pdfplumber
with pdfplumber.open("document.pdf") as pdf: text = pdf.pages[0].extract_text() print(text)
Extracting tables
Extract tables from PDFs with automatic detection:
import pdfplumber
with pdfplumber.open("report.pdf") as pdf: page = pdf.pages[0] tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
Extracting all pages
Process multi-page documents efficiently:
import pdfplumber
with pdfplumber.open("document.pdf") as pdf: full_text = "" for page in pdf.pages: full_text += page.extract_text() + "\n\n"
print(full_text)
Form filling
For PDF form filling, see FORMS.md for the complete guide including field analysis and validation.
Merging PDFs
Combine multiple PDF files:
from pypdf import PdfMerger
merger = PdfMerger()
for pdf in ["file1.pdf", "file2.pdf", "file3.pdf"]: merger.append(pdf)
merger.write("merged.pdf") merger.close()
Splitting PDFs
Extract specific pages or ranges:
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf") writer = PdfWriter()
Extract pages 2-5
for page_num in range(1, 5): writer.add_page(reader.pages[page_num])
with open("output.pdf", "wb") as output: writer.write(output)
Available packages
-
pdfplumber - Text and table extraction (recommended)
-
pypdf - PDF manipulation, merging, splitting
-
pdf2image - Convert PDFs to images (requires poppler)
-
pytesseract - OCR for scanned PDFs (requires tesseract)
Common patterns
Extract and save text:
import pdfplumber
with pdfplumber.open("input.pdf") as pdf: text = "\n\n".join(page.extract_text() for page in pdf.pages)
with open("output.txt", "w") as f: f.write(text)
Extract tables to CSV:
import pdfplumber import csv
with pdfplumber.open("tables.pdf") as pdf: tables = pdf.pages[0].extract_tables()
with open("output.csv", "w", newline="") as f:
writer = csv.writer(f)
for table in tables:
writer.writerows(table)
Error handling
Handle common PDF issues:
import pdfplumber
try: with pdfplumber.open("document.pdf") as pdf: if len(pdf.pages) == 0: print("PDF has no pages") else: text = pdf.pages[0].extract_text() if text is None or text.strip() == "": print("Page contains no extractable text (might be scanned)") else: print(text) except Exception as e: print(f"Error processing PDF: {e}")
Performance tips
-
Process pages in batches for large PDFs
-
Use multiprocessing for multiple files
-
Extract only needed pages rather than entire document
-
Close PDF objects after use