PDF Processing Pro

Production-ready PDF processing toolkit with pre-built scripts, comprehensive error handling, and support for complex workflows.

Quick start

Extract text from PDF

import pdfplumber

with pdfplumber.open("document.pdf") as pdf: text = pdf.pages[0].extract_text() print(text)

Analyze PDF form (using included script)

python scripts/analyze_form.py input.pdf --output fields.json

Returns: JSON with all form fields, types, and positions

Fill PDF form with validation

python scripts/fill_form.py input.pdf data.json output.pdf

Validates all fields before filling, includes error reporting

Extract tables from PDF

python scripts/extract_tables.py report.pdf --output tables.csv

Extracts all tables with automatic column detection

Features

✅ Production-ready scripts

All scripts include:

Error handling: Graceful failures with detailed error messages
Validation: Input validation and type checking
Logging: Configurable logging with timestamps
Type hints: Full type annotations for IDE support
CLI interface: --help flag for all scripts
Exit codes: Proper exit codes for automation

✅ Comprehensive workflows

PDF Forms: Complete form processing pipeline
Table Extraction: Advanced table detection and extraction
OCR Processing: Scanned PDF text extraction
Batch Operations: Process multiple PDFs efficiently
Validation: Pre and post-processing validation

Advanced topics

PDF Form Processing

For complete form workflows including:

Field analysis and detection
Dynamic form filling
Validation rules
Multi-page forms
Checkbox and radio button handling

See FORMS.md

Table Extraction

For complex table extraction:

Multi-page tables
Merged cells
Nested tables
Custom table detection
Export to CSV/Excel

See TABLES.md

OCR Processing

For scanned PDFs and image-based documents:

Tesseract integration
Language support
Image preprocessing
Confidence scoring
Batch OCR

See OCR.md

Included scripts

Form processing

analyze_form.py - Extract form field information

python scripts/analyze_form.py input.pdf [--output fields.json] [--verbose]

fill_form.py - Fill PDF forms with data

python scripts/fill_form.py input.pdf data.json output.pdf [--validate]

validate_form.py - Validate form data before filling

python scripts/validate_form.py data.json schema.json

Table extraction

extract_tables.py - Extract tables to CSV/Excel

python scripts/extract_tables.py input.pdf [--output tables.csv] [--format csv|excel]

Text extraction

extract_text.py - Extract text with formatting preservation

python scripts/extract_text.py input.pdf [--output text.txt] [--preserve-formatting]

Utilities

merge_pdfs.py - Merge multiple PDFs

python scripts/merge_pdfs.py file1.pdf file2.pdf file3.pdf --output merged.pdf

split_pdf.py - Split PDF into individual pages

python scripts/split_pdf.py input.pdf --output-dir pages/

validate_pdf.py - Validate PDF integrity

python scripts/validate_pdf.py input.pdf

Common workflows

Workflow 1: Process form submissions

1. Analyze form structure

python scripts/analyze_form.py template.pdf --output schema.json

2. Validate submission data

python scripts/validate_form.py submission.json schema.json

3. Fill form

python scripts/fill_form.py template.pdf submission.json completed.pdf

4. Validate output

python scripts/validate_pdf.py completed.pdf

Workflow 2: Extract data from reports

1. Extract tables

python scripts/extract_tables.py monthly_report.pdf --output data.csv

2. Extract text for analysis

python scripts/extract_text.py monthly_report.pdf --output report.txt

Workflow 3: Batch processing

import glob from pathlib import Path import subprocess

Process all PDFs in directory

for pdf_file in glob.glob("invoices/*.pdf"): output_file = Path("processed") / Path(pdf_file).name

result = subprocess.run([
    "python", "scripts/extract_text.py",
    pdf_file,
    "--output", str(output_file)
], capture_output=True)

if result.returncode == 0:
    print(f"✓ Processed: {pdf_file}")
else:
    print(f"✗ Failed: {pdf_file} - {result.stderr}")

Error handling

All scripts follow consistent error patterns:

Exit codes

0 - Success

1 - File not found

2 - Invalid input

3 - Processing error

4 - Validation error

Example usage in automation

result = subprocess.run(["python", "scripts/fill_form.py", ...])

if result.returncode == 0: print("Success") elif result.returncode == 4: print("Validation failed - check input data") else: print(f"Error occurred: {result.returncode}")

Dependencies

All scripts require:

pip install pdfplumber pypdf pillow pytesseract pandas

Optional for OCR:

Install tesseract-ocr system package

macOS: brew install tesseract

Ubuntu: apt-get install tesseract-ocr

Windows: Download from GitHub releases

Performance tips

Use batch processing for multiple PDFs
Enable multiprocessing with --parallel flag (where supported)
Cache extracted data to avoid re-processing
Validate inputs early to fail fast
Use streaming for large PDFs (>50MB)

Best practices

Always validate inputs before processing
Use try-except in custom scripts
Log all operations for debugging
Test with sample PDFs before production
Set timeouts for long-running operations
Check exit codes in automation
Backup originals before modification

Troubleshooting

Common issues

"Module not found" errors:

pip install -r requirements.txt

Tesseract not found:

Install tesseract system package (see Dependencies)

Memory errors with large PDFs:

Process page by page instead of loading entire PDF

with pdfplumber.open("large.pdf") as pdf: for page in pdf.pages: text = page.extract_text() # Process page immediately

Permission errors:

chmod +x scripts/*.py

Getting help

All scripts support --help :

python scripts/analyze_form.py --help python scripts/extract_tables.py --help

For detailed documentation on specific topics, see:

FORMS.md - Complete form processing guide
TABLES.md - Advanced table extraction
OCR.md - Scanned PDF processing

pdf processing pro

Safety Notice

Copy this and send it to your AI assistant to learn