opendataloader-pdf

PDF parsing tool for AI/RAG. Convert PDF to Markdown, JSON, HTML with layout preservation, bounding boxes, and image extraction. Use when you need to extract content from PDF files for AI processing, RAG pipelines, or document analysis.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "opendataloader-pdf" with this command: npx skills add wuxie-guru/opendataloader-pdf-wuxie

opendataloader-pdf Skill

PDF parsing tool for AI/RAG scenarios. Converts PDF to Markdown, JSON, HTML with layout preservation.

Installation

pipx install opendataloader-pdf

Requires Java runtime (bundled JAR is included).

Quick Usage

# PDF to Markdown (most common)
opendataloader-pdf input.pdf -o output_dir -f markdown

# PDF to JSON (with bounding boxes)
opendataloader-pdf input.pdf -o output_dir -f json

# Multiple formats at once
opendataloader-pdf input.pdf -o output_dir -f json,markdown,html

# Extract specific pages
opendataloader-pdf input.pdf -o output_dir -f markdown --pages "1,3,5-10"

# Extract images
opendataloader-pdf input.pdf -o output_dir -f markdown --image-dir images/

# Use PDF structure tree (for tagged PDFs)
opendataloader-pdf input.pdf -o output_dir -f markdown --use-struct-tree

# Output to stdout
opendataloader-pdf input.pdf -f markdown --to-stdout

Output Formats

FormatDescription
jsonStructured JSON with bounding boxes, fonts, reading order
markdownMarkdown text with images as references
htmlHTML with styling
textPlain text
pdfRebuilt PDF
markdown-with-htmlMarkdown with HTML for complex elements
markdown-with-imagesMarkdown with embedded base64 images

Key Options

OptionDescription
--pagesPage range, e.g., "1,3,5-10"
--image-dirDirectory for extracted images
--use-struct-treeUse PDF structure tree for reading order
--table-methodTable detection: default (border-based) or cluster
--reading-orderAlgorithm: off or xycut (default)
--hybridHybrid AI mode: docling-fast for complex tables
--sanitizeRemove sensitive data (emails, phones, etc.)
--include-header-footerInclude page headers/footers

Examples

Basic Conversion

# Convert to markdown
opendataloader-pdf document.pdf -o ./output -f markdown

# Convert to JSON with structure
opendataloader-pdf document.pdf -o ./output -f json --use-struct-tree

Batch Processing

# Multiple files
opendataloader-pdf "file1.pdf" "file2.pdf" "folder/" -o output/

# All PDFs in directory
opendataloader-pdf ./pdfs/ -o ./output/ -f markdown

Advanced Options

# Use AI hybrid mode for complex tables
opendataloader-pdf input.pdf -o output/ -f markdown --hybrid docling-fast

# Extract only pages 1-5
opendataloader-pdf input.pdf -o output/ -f markdown --pages "1-5"

# Sanitize sensitive data
opendataloader-pdf input.pdf -o output/ -f json --sanitize

Performance Notes

  • Each convert() call spawns a JVM process
  • For batch processing, pass multiple files in one call
  • ~6 seconds for typical 300-page PDF
  • Images extracted to {output_name}_images/ directory

Troubleshooting

Java not found

Ensure Java runtime is installed. The tool bundles its own PDFBox JAR.

Font warnings

Warnings about missing fonts are normal and don't affect output quality.

Slow performance

Use batch mode (multiple files in one call) instead of calling repeatedly.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

OpenDataLoader PDF

Parse PDFs into Markdown, JSON, or HTML with OCR, table extraction, and AI-enriched descriptions for building RAG pipelines and knowledge bases.

Registry SourceRecently Updated
870Profile unavailable
General

Microsoft MarkItDown

Convert various document formats (PDF, Word, PowerPoint, Excel, images, audio, HTML, etc.) to Markdown using Microsoft's markitdown tool. Supports OCR, audio...

Registry SourceRecently Updated
1160Profile unavailable
General

Markitdown Converter

使用微软 markitdown 库将多种文档格式(PDF、DOC、DOCX、PPT、HTML等)转换为 Markdown。支持批量转换、保留格式、图片提取等功能。使用场景:(1) "把这个 PDF 转成 Markdown",(2) "批量转换这个文件夹里的文档",(3) "提取文档中的图片"。

Registry Source
3061Profile unavailable
General

Doc Genius

支持PDF、Word、Markdown智能摘要和格式转换,提供批量处理与进度报告,提升文档处理效率。

Registry Source
2980Profile unavailable