MinerU PDF Parser
Parse PDF documents using MinerU MCP to extract structured content including text, tables, and formulas with MLX acceleration on Apple Silicon.
Installation
Option 1: Install MinerU MCP (for Claude Code)
claude mcp add --transport stdio --scope user mineru -- \
uvx --from mcp-mineru python -m mcp_mineru.server
This installs and configures MinerU for all Claude projects. Models are downloaded on first use.
Option 2: Use Direct Tool (preserves files)
The skill includes a direct parsing tool that saves output to a persistent directory:
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py <pdf_path> <output_dir> [options]
Advantages:
- ✅ Files are saved permanently (not auto-deleted)
- ✅ Full control over output location
- ✅ No MCP overhead
- ✅ Works with any Python environment that has MinerU
Quick Start
Method 1: Using the Direct Tool (Recommended)
# Parse entire PDF
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
"/path/to/document.pdf" \
"/path/to/output"
# Parse specific pages
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
"/path/to/document.pdf" \
"/path/to/output" \
--start-page 0 --end-page 2
# Use Apple Silicon optimization
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
"/path/to/document.pdf" \
"/path/to/output" \
--backend vlm-mlx-engine
# Text only (faster)
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
"/path/to/document.pdf" \
"/path/to/output" \
--no-table --no-formula
Method 2: Using MinerU MCP (Temporary Files)
Parse a PDF document
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def parse_pdf():
result = await call_tool(
name='parse_pdf',
arguments={
'file_path': '/path/to/document.pdf',
'backend': 'pipeline',
'formula_enable': True,
'table_enable': True,
'start_page': 0,
'end_page': -1 # -1 for all pages
}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(parse_pdf())
"
Check system capabilities
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def list_backends():
result = await call_tool(
name='list_backends',
arguments={}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(list_backends())
"
Parameters
parse_pdf
Required:
file_path- Absolute path to the PDF file
Optional:
backend- Processing backend (default:pipeline)pipeline- Fast, general-purpose (recommended)vlm-mlx-engine- Fastest on Apple Silicon (M1/M2/M3/M4)vlm-transformers- Slowest but most accurate
formula_enable- Enable formula recognition (default:true)table_enable- Enable table recognition (default:true)start_page- Starting page (0-indexed, default:0)end_page- Ending page (default:-1for all pages)
list_backends
No parameters required. Returns system information and backend recommendations.
Usage Examples
Extract tables from a specific page range
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def parse_pdf():
result = await call_tool(
name='parse_pdf',
arguments={
'file_path': '/path/to/document.pdf',
'backend': 'pipeline',
'table_enable': True,
'start_page': 5,
'end_page': 10
}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(parse_pdf())
"
Parse with formula recognition only (faster)
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def parse_pdf():
result = await call_tool(
name='parse_pdf',
arguments={
'file_path': '/path/to/document.pdf',
'backend': 'vlm-mlx-engine',
'formula_enable': True,
'table_enable': False # Disable for speed
}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(parse_pdf())
"
Parse single page (fastest for testing)
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def parse_pdf():
result = await call_tool(
name='parse_pdf',
arguments={
'file_path': '/path/to/document.pdf',
'backend': 'pipeline',
'formula_enable': False,
'table_enable': False,
'start_page': 0,
'end_page': 0
}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(parse_pdf())
"
Performance
On Apple Silicon M4 (16GB RAM):
pipeline: ~32s/page, CPU-only, good qualityvlm-mlx-engine: ~38s/page, Apple Silicon optimized, excellent qualityvlm-transformers: ~148s/page, highest quality, slowest
Note: First run downloads models (can take 5-10 minutes). Models are cached in ~/.cache/uv/ for faster subsequent runs.
Output Format
Returns structured Markdown with:
- Document metadata (file, backend, pages, settings)
- Extracted text with preserved structure
- Tables formatted as Markdown tables
- Formulas converted to LaTeX
Supported Formats
- PDF documents (
.pdf) - JPEG images (
.jpg,.jpeg) - PNG images (
.png) - Other image formats (WebP, GIF, etc.)
Troubleshooting
Module not found error
If you get "No module named 'mcp_mineru'", make sure you installed it:
claude mcp add --transport stdio --scope user mineru -- \
uvx --from mcp-mineru python -m mcp_mineru.server
Slow processing on first run
This is normal. MinerU downloads ML models on first use. Subsequent runs will be much faster.
Timeout errors
Increase timeout for large documents or use smaller page ranges for testing.
Notes
- Output is returned as Markdown text
- Tables are preserved in Markdown format
- Mathematical formulas are converted to LaTeX
- Works with scanned documents (OCR built-in)
- Optimized for Apple Silicon (M1/M2/M3/M4) with MLX backend
File Persistence
Why Files Get Deleted (MCP Method)
The MinerU MCP server uses Python's tempfile.TemporaryDirectory(), which automatically deletes files when the context exits. This is by design to prevent temporary files from accumulating.
How to Preserve Files
Method A: Use the Direct Tool (Recommended)
The skill provides parse.py which saves files to a persistent directory:
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
/path/to/input.pdf \
/path/to/output_dir
Advantages:
- ✅ Files are never auto-deleted
- ✅ Full control over output location
- ✅ Can be used in batch processing
- ✅ No MCP connection needed
Generated Structure:
/path/to/output_dir/
├── input.pdf_name/
│ └── auto/ # or vlm/ depending on backend
│ ├── input.pdf_name.md
│ └── images/
│ └── *.jpg
└── input.pdf_name_parsed.md # Copy at root for easy access
Method B: Redirect MCP Output
If using the MCP method, capture the output and save it:
# Capture to file
claude -p "Parse this PDF: /path/to/file.pdf" > /tmp/output.md
# Or use within a script that saves the result
Comparison
| Feature | Direct Tool | MCP Method |
|---|---|---|
| Files persisted | ✅ Yes | ❌ No (auto-deleted) |
| Custom output dir | ✅ Yes | ❌ No (temp only) |
| Claude Code integration | ⚠️ Manual | ✅ Native |
| Speed | ✅ Fast | ⚠️ MCP overhead |
| Offline use | ✅ Yes | ⚠️ Needs Claude Code |
Recommendation
- Use Direct Tool when you need to keep the files for later use
- Use MCP Method when working within Claude Code and only need the text content