Document Extraction with mineru

Installation

Linux / macOS

curl -fsSL https://cdn-mineru.openxlab.org.cn/open-api-cli/install.sh | sh

Windows (PowerShell)

irm https://cdn-mineru.openxlab.org.cn/open-api-cli/install.ps1 | iex

Verify installation

mineru version

Authentication

Before using, configure your API token (get one from https://mineru.net):

mineru auth                    # Interactive token setup
export MINERU_TOKEN="your-token"  # Or set via environment variable

Token resolution order: --token flag > MINERU_TOKEN env > ~/.mineru/config.yaml.

Supported input formats

The extract command accepts the following input types:

PDF (.pdf) — primary use case, supports scanned and digital PDFs
Images (.png, .jpg, .jpeg, .webp, .gif,.bmp) — use --ocr for best results on scanned content
DOCX (.docx) — Microsoft Word documents
URLs — remote files are downloaded automatically

The crawl command accepts any HTTP/HTTPS URL and extracts web page content.

Default behavior

Table recognition: ON by default. Tables in documents are extracted and converted to Markdown tables. Use --no-table to disable.
Formula recognition: ON by default. Mathematical formulas are extracted as LaTeX. Use --no-formula to disable.
Language: defaults to ch (Chinese). Use --language en for English documents.
Model: auto-selected. Use --model vlm for complex layouts, --model pipeline for speed.

Quick start

mineru extract report.pdf                    # PDF → Markdown to stdout
mineru extract report.pdf -o ./out/          # Save to file
mineru extract report.pdf -f md,docx         # Multiple formats
mineru crawl https://example.com/article     # Web page → Markdown

Core workflow

Authenticate: mineru auth or set MINERU_TOKEN
Extract: mineru extract <file-or-url> for documents
Crawl: mineru crawl <url> for web pages
Check results: output goes to stdout (default) or -o directory

Commands

extract — Document extraction

Convert PDFs, images, and other documents to Markdown or other formats.

mineru extract report.pdf                         # Markdown to stdout
mineru extract report.pdf -f html                 # HTML to stdout
mineru extract report.pdf -o ./out/               # Save to directory
mineru extract report.pdf -o ./out/ -f md,docx    # Multiple formats
mineru extract *.pdf -o ./results/                # Batch extract
mineru extract --list files.txt -o ./results/     # Batch from file list
mineru extract https://example.com/doc.pdf        # Extract from URL
cat doc.pdf | mineru extract --stdin -o ./out/    # From stdin

extract flags

Flag	Short	Default	Description
`--output`	`-o`	(stdout)	Output path (file or directory)
`--format`	`-f`	`md`	Output formats: `md`, `json`, `html`, `latex`, `docx` (comma-separated)
`--model`		(auto)	Model: `vlm`, `pipeline`, `html`
`--ocr`		`false`	Enable OCR for scanned documents
`--no-formula`		`false`	Disable formula recognition
`--no-table`		`false`	Disable table recognition
`--language`		`ch`	Document language
`--pages`		(all)	Page range, e.g. `1-10,15`
`--timeout`		`300`/`1800`	Timeout in seconds (single/batch)
`--list`			Read input list from file (one path per line)
`--stdin-list`		`false`	Read input list from stdin
`--stdin`		`false`	Read file content from stdin
`--stdin-name`		`stdin.pdf`	Filename hint for stdin mode
`--concurrency`		`0`	Batch concurrency (0 = server default)

crawl — Web page extraction

Fetch web pages and convert to Markdown.

mineru crawl https://example.com/article              # Markdown to stdout
mineru crawl https://example.com/article -f html      # HTML to stdout
mineru crawl https://example.com/article -o ./out/     # Save to file
mineru crawl url1 url2 -o ./pages/                     # Batch crawl
mineru crawl --list urls.txt -o ./pages/               # Batch from file list

crawl flags

Flag	Short	Default	Description
`--output`	`-o`	(stdout)	Output path
`--format`	`-f`	`md`	Output formats: `md`, `json`, `html` (comma-separated)
`--timeout`		`300`/`1800`	Timeout in seconds (single/batch)
`--list`			Read URL list from file (one per line)
`--stdin-list`		`false`	Read URL list from stdin
`--concurrency`		`0`	Batch concurrency

auth — Authentication management

mineru auth              # Interactive token setup
mineru auth --verify     # Verify current token is valid
mineru auth --show       # Show current token source and masked value

status — Async task status

Query the status of a previously submitted extraction task.

mineru status <task-id>                      # Check status once
mineru status <task-id> --wait               # Wait for completion
mineru status <task-id> --wait -o ./out/     # Wait and download results
mineru status <task-id> --wait --timeout 600 # Custom timeout

status flags

Flag	Short	Default	Description
`--wait`		`false`	Wait for task completion
`--output`	`-o`		Download results to directory when done
`--timeout`		`300`	Max wait time in seconds

version — Version info

mineru version    # Show version, commit, build date, Go version, OS/arch

Global flags

These flags apply to all commands:

Flag	Short	Description
`--token`		API token (overrides env and config)
`--base-url`		API base URL (for private deployments)
`--verbose`	`-v`	Verbose mode, print HTTP details

Output behavior

No -o flag: result goes to stdout; status/progress messages go to stderr
With -o flag: result saved to file/directory; progress messages on stderr
Batch mode: requires -o to specify output directory
Binary formats (docx): cannot output to stdout, must use -o
Markdown output includes extracted images saved alongside the .md file

Examples

Single PDF extraction

mineru extract report.pdf -o ./output/
# Output: ./output/report.md + ./output/images/

Extract with OCR and specific pages

mineru extract scanned.pdf --ocr --pages "1-5" -o ./out/

Multi-format output

mineru extract paper.pdf -f md,html,docx -o ./out/
# Output: ./out/paper.md, ./out/paper.html, ./out/paper.docx

Batch processing from file list

# files.txt contains one path per line
mineru extract --list files.txt -o ./results/

Extract to LaTeX

mineru extract paper.pdf -f latex -o ./out/
# Output: ./out/paper.tex

English document with specific language

mineru extract english-report.pdf --language en -o ./out/

Extract Word document to Markdown

mineru extract resume.docx -o ./out/
# Output: ./out/resume.md

Pipe workflow

# Download and extract in one pipeline
curl -sL https://example.com/doc.pdf | mineru extract --stdin --stdin-name doc.pdf

Web crawling

mineru crawl https://example.com/docs/guide -o ./docs/

Batch crawl with URL list

echo -e "https://example.com/page1\nhttps://example.com/page2" | mineru crawl --stdin-list -o ./pages/

Use with other tools

# Extract and pipe to another tool
mineru extract report.pdf | wc -w              # Word count
mineru extract report.pdf | grep "keyword"     # Search content
mineru extract report.pdf -f json | jq '.[]'   # Parse structured output

Agent guidelines

When using this skill on behalf of the user:

Always ask for the file path if the user didn't specify one. Never guess or fabricate a filename.
Quote file paths that contain spaces or special characters with double quotes in commands. Example: mineru extract "report 01.pdf", NOT mineru extract report 01.pdf.
Don't run commands blindly on errors — if the user asks "提取失败了怎么办", explain the exit code and troubleshooting steps instead of re-running the command.
Installation questions ("mineru 怎么安装") should be answered with the install instructions, not by running mineru extract.
DOCX as input is supported — if the user asks "这个 Word 文档能转 Markdown 吗", use mineru extract file.docx.
Table extraction — tables are extracted by default as part of the Markdown output. There is no "tables only" mode; the full document is always extracted.
For stdout mode (no -o), only one text format can be output at a time. If the user wants multiple formats, suggest adding -o.

Default output directory

When the user does NOT specify an output path (-o), the agent MUST generate a default output directory to prevent file overwrites. Use:

~/MinerU-Skill/<name>_<hash>/

Naming rules:

<name>: derived from the source, then sanitized for safe directory names.
- For URLs: last path segment (e.g. https://arxiv.org/pdf/2509.22186 → 2509.22186)
- For local files: filename without extension (e.g. report.pdf → report)
- Sanitization: replace spaces and shell-unsafe characters (space, (, ), [, ], &, ', ", !, #, $, `) with _. Collapse consecutive _ into one. Keep alphanumeric, -, _, ., and CJK characters.
<hash>: first 6 characters of the MD5 hash of the full original source path or URL (before sanitization). This ensures:
- Different URLs with similar basenames get unique directories
- Re-running the same source reuses the same directory (idempotent)

Examples:

Source	`<name>`	Output directory
`https://arxiv.org/pdf/2509.22186`	`2509.22186`	`~/MinerU-Skill/2509.22186_a3f2b1/`
`https://arxiv.org/pdf/2509.200`	`2509.200`	`~/MinerU-Skill/2509.200_c7e9d4/`
`./report.pdf`	`report`	`~/MinerU-Skill/report_8b1a3f/`
`./report 01.pdf`	`report_01`	`~/MinerU-Skill/report_01_f4a1c2/`
`./My Doc (final).pdf`	`My_Doc_final`	`~/MinerU-Skill/My_Doc_final_b9e3d7/`
`./个人简介.docx`	`个人简介`	`~/MinerU-Skill/个人简介_d2a8f5/`

How the agent should generate the hash:

echo -n "https://arxiv.org/pdf/2509.22186" | md5sum | cut -c1-6

Or on macOS:

echo -n "https://arxiv.org/pdf/2509.22186" | md5 | cut -c1-6

When the user specifies -o: use the user's path as-is, do NOT override with the default directory.

Exit codes

Code	Meaning	Recovery
0	Success	—
1	General API or unknown error	Check network connectivity; retry; use `--verbose` for details
2	Invalid parameters / usage error	Check command syntax and flag values
3	Authentication error	Run `mineru auth` to reconfigure token, or check token expiration
4	File too large or page limit exceeded	Split the file or use `--pages` to extract a subset
5	Extraction failed	The document may be corrupted or unsupported; try a different `--model`
6	Timeout	Increase with `--timeout`; large files may need 600+ seconds
7	Quota exceeded	Check API quota at https://mineru.net; wait or upgrade plan

Troubleshooting

"no API token found": Run mineru auth or set MINERU_TOKEN env variable
Timeout on large files: Increase with --timeout 600 (seconds)
Batch fails partially: Check stderr for per-file status; succeeded files are still saved
Binary format to stdout: Use -o flag; docx cannot stream to stdout
Private deployment: Use --base-url https://your-server.com/api
Extraction quality is poor: Try --model vlm for complex layouts, or --ocr for scanned documents
Formula not recognized: Ensure --no-formula is NOT set; try --model vlm for better formula support

Notes

All status/progress messages go to stderr; only document content goes to stdout
Batch mode automatically polls the API with exponential backoff
Token is stored in ~/.mineru/config.yaml after mineru auth
The CLI wraps the MinerU Open SDK (github.com/OpenDataLab/mineru-open-sdk)