PDF Processing Pro
Production-ready PDF processing toolkit with pre-built scripts, comprehensive error handling, and support for complex workflows.
Quick start
Extract text from PDF
import pdfplumber
with pdfplumber.open("document.pdf") as pdf: text = pdf.pages[0].extract_text() print(text)
Analyze PDF form (using included script)
python scripts/analyze_form.py input.pdf --output fields.json
Returns: JSON with all form fields, types, and positions
Fill PDF form with validation
python scripts/fill_form.py input.pdf data.json output.pdf
Validates all fields before filling, includes error reporting
Extract tables from PDF
python scripts/extract_tables.py report.pdf --output tables.csv
Extracts all tables with automatic column detection
Features
✅ Production-ready scripts
All scripts include:
-
Error handling: Graceful failures with detailed error messages
-
Validation: Input validation and type checking
-
Logging: Configurable logging with timestamps
-
Type hints: Full type annotations for IDE support
-
CLI interface: --help flag for all scripts
-
Exit codes: Proper exit codes for automation
✅ Comprehensive workflows
-
PDF Forms: Complete form processing pipeline
-
Table Extraction: Advanced table detection and extraction
-
OCR Processing: Scanned PDF text extraction
-
Batch Operations: Process multiple PDFs efficiently
-
Validation: Pre and post-processing validation
Advanced topics
PDF Form Processing
For complete form workflows including:
-
Field analysis and detection
-
Dynamic form filling
-
Validation rules
-
Multi-page forms
-
Checkbox and radio button handling
See FORMS.md
Table Extraction
For complex table extraction:
-
Multi-page tables
-
Merged cells
-
Nested tables
-
Custom table detection
-
Export to CSV/Excel
See TABLES.md
OCR Processing
For scanned PDFs and image-based documents:
-
Tesseract integration
-
Language support
-
Image preprocessing
-
Confidence scoring
-
Batch OCR
See OCR.md
Included scripts
Form processing
analyze_form.py - Extract form field information
python scripts/analyze_form.py input.pdf [--output fields.json] [--verbose]
fill_form.py - Fill PDF forms with data
python scripts/fill_form.py input.pdf data.json output.pdf [--validate]
validate_form.py - Validate form data before filling
python scripts/validate_form.py data.json schema.json
Table extraction
extract_tables.py - Extract tables to CSV/Excel
python scripts/extract_tables.py input.pdf [--output tables.csv] [--format csv|excel]
Text extraction
extract_text.py - Extract text with formatting preservation
python scripts/extract_text.py input.pdf [--output text.txt] [--preserve-formatting]
Utilities
merge_pdfs.py - Merge multiple PDFs
python scripts/merge_pdfs.py file1.pdf file2.pdf file3.pdf --output merged.pdf
split_pdf.py - Split PDF into individual pages
python scripts/split_pdf.py input.pdf --output-dir pages/
validate_pdf.py - Validate PDF integrity
python scripts/validate_pdf.py input.pdf
Common workflows
Workflow 1: Process form submissions
1. Analyze form structure
python scripts/analyze_form.py template.pdf --output schema.json
2. Validate submission data
python scripts/validate_form.py submission.json schema.json
3. Fill form
python scripts/fill_form.py template.pdf submission.json completed.pdf
4. Validate output
python scripts/validate_pdf.py completed.pdf
Workflow 2: Extract data from reports
1. Extract tables
python scripts/extract_tables.py monthly_report.pdf --output data.csv
2. Extract text for analysis
python scripts/extract_text.py monthly_report.pdf --output report.txt
Workflow 3: Batch processing
import glob from pathlib import Path import subprocess
Process all PDFs in directory
for pdf_file in glob.glob("invoices/*.pdf"): output_file = Path("processed") / Path(pdf_file).name
result = subprocess.run([
"python", "scripts/extract_text.py",
pdf_file,
"--output", str(output_file)
], capture_output=True)
if result.returncode == 0:
print(f"✓ Processed: {pdf_file}")
else:
print(f"✗ Failed: {pdf_file} - {result.stderr}")
Error handling
All scripts follow consistent error patterns:
Exit codes
0 - Success
1 - File not found
2 - Invalid input
3 - Processing error
4 - Validation error
Example usage in automation
result = subprocess.run(["python", "scripts/fill_form.py", ...])
if result.returncode == 0: print("Success") elif result.returncode == 4: print("Validation failed - check input data") else: print(f"Error occurred: {result.returncode}")
Dependencies
All scripts require:
pip install pdfplumber pypdf pillow pytesseract pandas
Optional for OCR:
Install tesseract-ocr system package
macOS: brew install tesseract
Ubuntu: apt-get install tesseract-ocr
Windows: Download from GitHub releases
Performance tips
-
Use batch processing for multiple PDFs
-
Enable multiprocessing with --parallel flag (where supported)
-
Cache extracted data to avoid re-processing
-
Validate inputs early to fail fast
-
Use streaming for large PDFs (>50MB)
Best practices
-
Always validate inputs before processing
-
Use try-except in custom scripts
-
Log all operations for debugging
-
Test with sample PDFs before production
-
Set timeouts for long-running operations
-
Check exit codes in automation
-
Backup originals before modification
Troubleshooting
Common issues
"Module not found" errors:
pip install -r requirements.txt
Tesseract not found:
Install tesseract system package (see Dependencies)
Memory errors with large PDFs:
Process page by page instead of loading entire PDF
with pdfplumber.open("large.pdf") as pdf: for page in pdf.pages: text = page.extract_text() # Process page immediately
Permission errors:
chmod +x scripts/*.py
Getting help
All scripts support --help :
python scripts/analyze_form.py --help python scripts/extract_tables.py --help
For detailed documentation on specific topics, see:
-
FORMS.md - Complete form processing guide
-
TABLES.md - Advanced table extraction
-
OCR.md - Scanned PDF processing