Document Parsers
Purpose: Autonomously parse and extract content from multiple document formats (PDF, DOCX, HTML, Markdown) using industry-standard libraries and AI-powered parsing tools.
Activation Triggers:
-
Building RAG (Retrieval-Augmented Generation) pipelines
-
Extracting text, tables, or metadata from documents
-
Processing large document collections
-
Converting documents to structured formats
-
Handling complex PDFs with tables and layouts
-
OCR for scanned documents
-
Chunking documents for vector embeddings
-
Building document search systems
Key Resources:
-
scripts/setup-llamaparse.sh
-
Install and configure LlamaParse (AI-powered parsing)
-
scripts/setup-unstructured.sh
-
Install Unstructured.io library
-
scripts/parse-pdf.py
-
Functional PDF parser with multiple backend options
-
scripts/parse-docx.py
-
DOCX document parser
-
scripts/parse-html.py
-
HTML to structured text parser
-
templates/multi-format-parser.py
-
Universal document parser template
-
templates/table-extraction.py
-
Specialized table extraction template
-
examples/parse-research-paper.py
-
Research paper parsing with citations
-
examples/parse-legal-document.py
-
Legal document parsing with sections
Parser Comparison & Selection Guide
- LlamaParse (AI-Powered Premium)
Best For:
-
Complex PDFs with tables, charts, and mixed layouts
-
Scanned documents requiring OCR
-
Documents where accuracy is critical
-
Multi-column layouts and scientific papers
-
Financial reports and invoices
Pros:
-
AI-powered layout understanding
-
Excellent table extraction accuracy
-
Built-in OCR support
-
Handles complex formatting
-
Structured output (Markdown/JSON)
Cons:
-
Requires API key (paid service)
-
API rate limits
-
Network dependency
-
Slower than local parsers
Documentation: https://docs.cloud.llamaindex.ai/llamaparse
Setup:
./scripts/setup-llamaparse.sh
Usage Pattern:
from llama_parse import LlamaParse
parser = LlamaParse( api_key="llx-...", result_type="markdown", # or "text" language="en", verbose=True )
documents = parser.load_data("document.pdf") for doc in documents: print(doc.text)
- Unstructured.io (Local Processing)
Best For:
-
Batch processing many documents
-
Multiple format support (PDF, DOCX, HTML, PPTX, Images)
-
Local processing without API dependencies
-
Structured element extraction
-
Production RAG pipelines
Pros:
-
Open-source and free
-
Multi-format support
-
Runs locally (no API keys)
-
Good table detection
-
Element-based chunking
Cons:
-
Requires system dependencies (poppler, tesseract)
-
Complex installation
-
Less accurate than LlamaParse for complex layouts
Documentation: https://unstructured-io.github.io/unstructured/
Setup:
./scripts/setup-unstructured.sh
Usage Pattern:
from unstructured.partition.auto import partition
elements = partition("document.pdf") for element in elements: print(f"{element.category}: {element.text}")
- PyPDF2 (Simple PDF Text Extraction)
Best For:
-
Simple text-based PDFs
-
Quick prototyping
-
Metadata extraction
-
PDF manipulation (merge, split)
Pros:
-
Pure Python (no dependencies)
-
Fast and lightweight
-
Good for simple PDFs
-
Active maintenance
Cons:
-
Poor table extraction
-
Struggles with complex layouts
-
No OCR support
-
Limited formatting preservation
Documentation: https://github.com/py-pdf/pypdf2
Setup:
pip install pypdf2
Usage Pattern:
from PyPDF2 import PdfReader
reader = PdfReader("document.pdf") for page in reader.pages: print(page.extract_text())
- PDFPlumber (Advanced PDF Analysis)
Best For:
-
Table extraction from PDFs
-
PDF with tabular data
-
Financial statements and reports
-
Coordinate-based extraction
Pros:
-
Excellent table extraction
-
Visual debugging tools
-
Coordinate-level control
-
Metadata and layout info
Cons:
-
Slower than PyPDF2
-
Requires pdfminer.six dependency
-
No OCR support
Documentation: https://github.com/jsvine/pdfplumber
Setup:
pip install pdfplumber
Usage Pattern:
import pdfplumber
with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: tables = page.extract_tables() text = page.extract_text()
- python-docx (Word Documents)
Best For:
-
Microsoft Word (.docx) documents
-
Extracting paragraphs, tables, headers
-
Document metadata
-
Template-based document generation
Pros:
-
Native DOCX support
-
Preserves structure (paragraphs, tables, sections)
-
Access to styles and formatting
-
Can also write/modify DOCX
Cons:
-
Only works with .docx (not .doc)
-
Limited image extraction
Documentation: https://github.com/python-openxml/python-docx
Setup:
pip install python-docx
Usage Pattern:
from docx import Document
doc = Document("document.docx") for para in doc.paragraphs: print(para.text) for table in doc.tables: for row in table.rows: print([cell.text for cell in row.cells])
Decision Matrix
Use Case Recommended Parser Alternative
Simple PDF text extraction PyPDF2 Unstructured
Complex PDFs with tables LlamaParse PDFPlumber
Scanned documents (OCR) LlamaParse Unstructured + Tesseract
Word documents (.docx) python-docx Unstructured
HTML to text parse-html.py Unstructured
Multi-format batch processing Unstructured Multi-format-parser
Table extraction PDFPlumber LlamaParse
Research papers LlamaParse Unstructured
Legal documents LlamaParse PDFPlumber
Production RAG pipeline Unstructured LlamaParse
Functional Scripts
- Parse PDF (scripts/parse-pdf.py )
Command-line PDF parser supporting multiple backends:
Using PyPDF2 (default)
python scripts/parse-pdf.py document.pdf
Using PDFPlumber (better for tables)
python scripts/parse-pdf.py document.pdf --backend pdfplumber
Using LlamaParse (AI-powered)
python scripts/parse-pdf.py document.pdf --backend llamaparse --api-key llx-...
Output to file
python scripts/parse-pdf.py document.pdf --output output.txt
Extract tables as JSON
python scripts/parse-pdf.py document.pdf --backend pdfplumber --tables-only --output tables.json
Features:
-
Multiple backend support (PyPDF2, PDFPlumber, LlamaParse)
-
Table extraction
-
Metadata extraction
-
Page range selection
-
JSON/Text output formats
- Parse DOCX (scripts/parse-docx.py )
Word document parser with structure preservation:
Basic extraction
python scripts/parse-docx.py document.docx
Extract with structure
python scripts/parse-docx.py document.docx --preserve-structure
Extract tables only
python scripts/parse-docx.py document.docx --tables-only
Output as JSON
python scripts/parse-docx.py document.docx --output output.json --format json
Features:
-
Paragraph extraction with styles
-
Table extraction
-
Header/footer extraction
-
Metadata (author, created date, etc.)
-
Structured JSON output
- Parse HTML (scripts/parse-html.py )
HTML to clean text converter:
Basic HTML parsing
python scripts/parse-html.py document.html
From URL
python scripts/parse-html.py https://example.com/article
Preserve links
python scripts/parse-html.py document.html --preserve-links
Extract specific selector
python scripts/parse-html.py document.html --selector "article.content"
Features:
-
Clean text extraction (removes scripts, styles)
-
Link preservation
-
CSS selector support
-
URL fetching
-
Markdown output option
Templates
Multi-Format Parser (templates/multi-format-parser.py )
Universal parser handling multiple formats with automatic format detection:
from multi_format_parser import MultiFormatParser
parser = MultiFormatParser( llamaparse_api_key="llx-...", # Optional use_ocr=True, chunk_size=1000 )
Automatic format detection
result = parser.parse_file("document.pdf") print(result.text) print(result.metadata) print(result.tables)
Batch processing
results = parser.parse_directory("./documents/") for filename, result in results.items(): print(f"{filename}: {len(result.text)} characters")
Supports:
-
PDF, DOCX, HTML, Markdown, TXT
-
Automatic chunking for RAG
-
Metadata extraction
-
Table extraction across all formats
-
Error handling and fallbacks
Table Extraction (templates/table-extraction.py )
Specialized table extraction with multiple strategies:
from table_extraction import TableExtractor
extractor = TableExtractor( prefer_llamaparse=True, fallback_to_pdfplumber=True )
Extract all tables from document
tables = extractor.extract_tables("financial_report.pdf")
for i, table in enumerate(tables): print(f"Table {i + 1}:") print(table.to_markdown()) # or .to_csv(), .to_json() print(f"Confidence: {table.confidence}")
Features:
-
Multiple extraction strategies
-
Automatic fallback
-
Table validation
-
Format conversion (CSV, JSON, Markdown, DataFrame)
-
Confidence scoring
Examples
Research Paper Parsing (examples/parse-research-paper.py )
Complete example for parsing academic papers:
Extracts title, abstract, sections, citations, tables, figures
python examples/parse-research-paper.py paper.pdf --output paper.json
Extracts:
-
Title and authors
-
Abstract
-
Section structure (Introduction, Methods, Results, etc.)
-
Citations and references
-
Tables and figures with captions
-
Metadata (DOI, publication date, journal)
Legal Document Parsing (examples/parse-legal-document.py )
Specialized parser for legal documents:
Extracts clauses, sections, definitions, parties
python examples/parse-legal-document.py contract.pdf --output contract.json
Extracts:
-
Document type (contract, agreement, etc.)
-
Parties involved
-
Definitions section
-
Numbered clauses and sections
-
Signature blocks
-
Dates and deadlines
RAG Pipeline Integration
Document Chunking for Embeddings
from multi_format_parser import MultiFormatParser
parser = MultiFormatParser(chunk_size=512, chunk_overlap=50) result = parser.parse_file("document.pdf")
Chunks ready for embedding
for chunk in result.chunks: print(f"Chunk {chunk.id}: {chunk.text[:100]}...") print(f"Metadata: {chunk.metadata}") # Send to embedding model
Batch Processing Pipeline
import glob from multi_format_parser import MultiFormatParser
parser = MultiFormatParser()
Process all documents in directory
for filepath in glob.glob("./documents/**/*", recursive=True): try: result = parser.parse_file(filepath) # Store in vector database store_embeddings(result.chunks) print(f"✓ Processed {filepath}") except Exception as e: print(f"✗ Failed {filepath}: {e}")
Best Practices
Parser Selection:
-
Start with PyPDF2 for simple PDFs, upgrade if needed
-
Use LlamaParse for complex layouts (budget permitting)
-
Use Unstructured for multi-format production systems
-
Use PDFPlumber specifically for table extraction
Performance:
-
Cache parsed results to avoid re-processing
-
Use batch processing for multiple documents
-
Consider async processing for large collections
-
Monitor API rate limits for LlamaParse
Accuracy:
-
Validate table extraction results
-
Implement fallback strategies
-
Log parsing errors for debugging
-
Use confidence scores when available
RAG Optimization:
-
Chunk size: 512-1024 tokens for embeddings
-
Overlap: 10-20% for context preservation
-
Preserve metadata (page numbers, sections) for retrieval
-
Clean extracted text (remove headers/footers)
Troubleshooting
PyPDF2 returns garbled text:
-
Try PDFPlumber or LlamaParse
-
PDF may have non-standard encoding
-
Check if PDF is scanned (needs OCR)
Unstructured installation fails:
-
Install system dependencies: sudo apt-get install poppler-utils tesseract-ocr
-
On macOS: brew install poppler tesseract
LlamaParse API errors:
-
Verify API key is correct
-
Check rate limits in dashboard
-
Ensure document size is within limits
Table extraction misses columns:
-
Try different parser (PDFPlumber vs LlamaParse)
-
Adjust table detection settings
-
Validate table structure manually
DOCX parsing fails:
-
Ensure file is .docx not .doc
-
Check file is not corrupted
-
Try converting to .docx with LibreOffice
Dependencies
Core:
pip install pypdf2 pdfplumber python-docx beautifulsoup4 lxml markdown
Optional (Unstructured):
pip install unstructured[local-inference] sudo apt-get install poppler-utils tesseract-ocr # Linux brew install poppler tesseract # macOS
Optional (LlamaParse):
pip install llama-parse
Requires API key from https://cloud.llamaindex.ai
Supported Formats: PDF, DOCX, HTML, Markdown, TXT Parsers: LlamaParse, Unstructured.io, PyPDF2, PDFPlumber, python-docx Version: 1.0.0