AI Document Parsing with MinerU
Intelligent document extraction powered by AI — parse any document format into clean, structured output.
Installation
npm install -g mineru-open-api
Or via Go (macOS/Linux):
go install github.com/opendatalab/MinerU-Ecosystem/cli/mineru-open-api@latest
Verify installation
mineru-open-api version
Two extraction modes
flash-extract | extract | |
|---|---|---|
| Token required | No | Yes (mineru-open-api auth) |
| Speed | Fast | Normal |
| Table recognition | No | Yes |
| Formula recognition | No | Yes |
| OCR | Yes | Yes |
| Output formats | Markdown only | md, html, latex, docx, json |
| Batch mode | No | Yes |
| Model selection | pipeline | Yes (vlm, pipeline, MinerU-HTML) |
| File size limit | 10 MB | Much higher |
| Page limit | 20 pages | Much higher |
| Rate limit | Per-IP per-minute cap | Based on API plan |
| Best for | Quick start, small/simple docs | Large docs, tables, production |
flash-extract limits
| Limit | Value |
|---|---|
| File size | Max 10 MB |
| Page count | Max 20 pages |
| Supported types | PDF, Images (png/jpg/jpeg/jp2/webp/gif/bmp), Docx, PPTx |
| IP rate limit | Per-minute request caps (HTTP 429 when exceeded) |
When any limit is exceeded, the agent should suggest switching to extract with a token (create at https://mineru.net/apiManage/token), which has significantly higher limits.
Core workflow
- Start fast (no token):
mineru-open-api flash-extract <file>for quick Markdown conversion - Need more? Create token at https://mineru.net/apiManage/token, run
mineru-open-api auth, then usemineru-open-api extractfor tables, formulas, OCR, multi-format, and batch - Web pages:
mineru-open-api crawl <url>to convert web content - Check results: output goes to stdout (default) or
-odirectory
Authentication
Only required for extract and crawl. Not needed for flash-extract.
Configure your API token (create one at https://mineru.net/apiManage/token):
mineru-open-api auth # Interactive token setup
export MINERU_TOKEN="your-token" # Or set via environment variable
Token resolution order: --token flag > MINERU_TOKEN env > ~/.mineru/config.yaml.
Supported input formats
| Format | flash-extract | extract |
|---|---|---|
PDF (.pdf) | Yes | Yes |
Images (.png, .jpg, .jpeg, .jp2, .webp, .gif, .bmp) | Yes | Yes |
Word (.docx) | Yes | Yes |
Word (.doc) | No | Yes |
PowerPoint (.pptx) | Yes | Yes |
PowerPoint (.ppt) | No | Yes |
HTML (.html) | No | Yes |
| URLs (remote files) | Yes | Yes |
The crawl command accepts any HTTP/HTTPS URL and extracts web page content.
Commands
flash-extract — Quick extraction (no token needed)
Fast, token-free document extraction. Outputs Markdown only. No table recognition. Limited to 10 MB / 20 pages per file, with IP-based rate limiting.
mineru-open-api flash-extract report.pdf # Markdown to stdout
mineru-open-api flash-extract report.pdf -o ./out/ # Save to file
mineru-open-api flash-extract https://example.com/doc.pdf # URL mode
mineru-open-api flash-extract report.pdf --language en # Specify language
mineru-open-api flash-extract report.pdf --pages 1-10 # Page range
flash-extract flags
| Flag | Short | Default | Description |
|---|---|---|---|
--output | -o | (stdout) | Output path (file or directory) |
--language | ch | Document language | |
--pages | (all) | Page range, e.g. 1-10 | |
--timeout | 900 | Timeout in seconds |
extract — Precision extraction (token required)
Convert PDFs, images, and other documents to Markdown or other formats. Supports table/formula recognition, OCR, multiple output formats, and batch mode.
mineru-open-api extract report.pdf # Markdown to stdout
mineru-open-api extract report.pdf -f html # HTML to stdout
mineru-open-api extract report.pdf -o ./out/ # Save to directory
mineru-open-api extract report.pdf -o ./out/ -f md,docx # Multiple formats
mineru-open-api extract *.pdf -o ./results/ # Batch extract
mineru-open-api extract --list files.txt -o ./results/ # Batch from file list
mineru-open-api extract https://example.com/doc.pdf # Extract from URL
cat doc.pdf | mineru-open-api extract --stdin -o ./out/ # From stdin
extract flags
| Flag | Short | Default | Description |
|---|---|---|---|
--output | -o | (stdout) | Output path (file or directory) |
--format | -f | md | Output formats: md, json, html, latex, docx (comma-separated) |
--model | (auto) | Model: vlm, pipeline, html (see below) | |
--ocr | false | Enable OCR for scanned documents | |
--formula | true | Enable/disable formula recognition | |
--table | true | Enable/disable table recognition | |
--language | ch | Document language | |
--pages | (all) | Page range, e.g. 1-10,15 | |
--timeout | 900/1800 | Timeout in seconds (single/batch) | |
--list | Read input list from file (one path per line) | ||
--concurrency | 0 | Batch concurrency (0 = server default) |
Model comparison: vlm vs pipeline
vlm | pipeline | |
|---|---|---|
| Parsing accuracy | Higher — better at complex layouts, mixed content | Standard |
| Hallucination risk | May produce hallucinated text in rare cases | No hallucination — biggest advantage |
| Best for | Academic papers, complex tables, intricate layouts | General documents where fidelity matters most |
When the user values accuracy and the document has complex formatting, suggest --model vlm. When the user prioritizes reliability and no-hallucination guarantee, suggest --model pipeline (or omit --model to use auto).
crawl — Web page extraction (token required)
Fetch web pages and convert to Markdown.
mineru-open-api crawl https://example.com/article # Markdown to stdout
mineru-open-api crawl https://example.com/article -f html # HTML to stdout
mineru-open-api crawl https://example.com/article -o ./out/ # Save to file
mineru-open-api crawl url1 url2 -o ./pages/ # Batch crawl
mineru-open-api crawl --list urls.txt -o ./pages/ # Batch from file list
crawl flags
| Flag | Short | Default | Description |
|---|---|---|---|
--output | -o | (stdout) | Output path |
--format | -f | md | Output formats: md, json, html (comma-separated) |
--timeout | 900/1800 | Timeout in seconds (single/batch) | |
--list | Read URL list from file (one per line) | ||
--stdin-list | false | Read URL list from stdin | |
--concurrency | 0 | Batch concurrency |
auth — Authentication management
mineru-open-api auth # Interactive token setup
mineru-open-api auth --verify # Verify current token is valid
mineru-open-api auth --show # Show current token source and masked value
Supported --language values
The --language flag accepts the following values (default: ch). Used by both flash-extract and extract. Values are organized by script/language family — each value covers all languages listed in its group.
Standalone language packs
| Value | Included languages |
|---|---|
ch | Chinese, English, Chinese Traditional |
ch_server | Chinese, English, Chinese Traditional, Japanese |
en | English |
japan | Chinese, English, Chinese Traditional, Japanese |
korean | Korean, English |
chinese_cht | Chinese, English, Chinese Traditional, Japanese |
ta | Tamil, English |
te | Telugu, English |
ka | Kannada |
el | Greek, English |
th | Thai, English |
Language family packs
| Value | Script/Family | Included languages |
|---|---|---|
latin | Latin script | French, German, Italian, Spanish, Portuguese, Dutch, Swedish, and 40+ more |
arabic | Arabic script | Arabic, Persian, Uyghur, Urdu, Pashto, Kurdish, and more |
cyrillic | Cyrillic script | Russian, Ukrainian, Bulgarian, Serbian, Kazakh, and 20+ more |
east_slavic | East Slavic | Russian, Belarusian, Ukrainian, English |
devanagari | Devanagari script | Hindi, Marathi, Nepali, Sanskrit, and more |
Output behavior
- No
-oflag: result goes to stdout; status/progress messages go to stderr - With
-oflag: result saved to file/directory; progress messages on stderr - Batch mode (
extract/crawlonly): requires-oto specify output directory - Binary formats (
docx,extractonly): cannot output to stdout, must use-o - Markdown output includes extracted images saved alongside the
.mdfile
General rules
When using this skill on behalf of the user:
- Quote file paths that contain spaces or special characters with double quotes in commands. Example:
mineru-open-api extract "report 01.pdf", NOTmineru-open-api extract report 01.pdf. - Don't run commands blindly on errors — if the user asks "提取失败了怎么办", explain the exit code and troubleshooting steps instead of re-running the command.
- Installation questions ("mineru 怎么安装") should be answered with the install instructions, not by running
mineru-open-api extract. - DOCX as input is supported — if the user asks "这个 Word 文档能转 Markdown 吗", use
mineru-open-api extract file.docxormineru-open-api flash-extract file.docx. Note:.docformat is only supported byextract, notflash-extract. - Table extraction — tables are only recognized by
extract(notflash-extract). If the user mentions tables, useextract. - For stdout mode (no
-o), only one text format can be output at a time. If the user wants multiple formats, suggest adding-o.
Choosing between flash-extract and extract
The agent MUST follow this decision logic:
-
Default to
flash-extractwhen:- User has NOT configured a token (no
~/.mineru/config.yaml, noMINERU_TOKENenv) - User wants a quick/simple extraction without mentioning tables, formulas, OCR, or specific formats
- File is under 10 MB and under 20 pages
- User is trying the tool for the first time
- User has NOT configured a token (no
-
Use
extractwhen:- User explicitly asks for table recognition, formula recognition, or OCR
- User requests non-Markdown output formats (html, latex, docx, json)
- User needs batch processing (multiple files)
- File is over 10 MB or over 20 pages (exceeds flash-extract limits)
- User has a token configured and wants precision-quality extraction
-
If unsure, prefer
flash-extract— it's faster and requires no setup, but check file size first. -
When the user does NOT specify an output path (
-o), the agent MUST generate a default output directory to prevent file overwrites. Use:
~/MinerU-Skill/<name>_<hash>/
Naming rules:
<name>: derived from the source, then sanitized for safe directory names.- For URLs: last path segment (e.g.
https://arxiv.org/pdf/2509.22186→2509.22186) - For local files: filename without extension (e.g.
report.pdf→report) - Sanitization: replace spaces and shell-unsafe characters with
_. Collapse consecutive_into one. Keep alphanumeric,-,_,., and CJK characters.
- For URLs: last path segment (e.g.
<hash>: first 6 characters of the MD5 hash of the full original source path or URL (before sanitization).
echo -n "https://arxiv.org/pdf/2509.22186" | md5sum | cut -c1-6
- When the user asks to upgrade or update this skill, the agent MUST re-install the CLI binary:
npm install -g mineru-open-api@latest
flash-extract limit handling
When flash-extract fails due to file limits or rate limiting, the agent MUST provide a clear explanation and suggest extract as the upgrade path.
Post-extraction friendly hints
After flash-extract completes successfully, the agent MUST append a brief hint:
Tip:
flash-extract为快速免登录模式(限 10MB/20页,不含表格识别)。如需解析更大文件、表格/公式识别或多格式导出,请前往 https://mineru.net/apiManage/token 创建 Token,运行mineru-open-api auth配置后使用mineru-open-api extract。
Keep the hint to ONE short sentence. Do NOT repeat the hint if the user has already seen it in this session.
Exit codes
| Code | Meaning | Recovery |
|---|---|---|
| 0 | Success | — |
| 1 | General API or unknown error | Check network connectivity; retry; use --verbose for details |
| 2 | Invalid parameters / usage error | Check command syntax and flag values |
| 4 | File too large or page limit exceeded | For flash-extract: file must be under 10 MB / 20 pages; switch to extract with token for higher limits. For extract: split the file or use --pages |
| 5 | Extraction failed | The document may be corrupted or unsupported; try a different --model |
| 6 | Timeout | Increase with --timeout; large files may need 600+ seconds |
Troubleshooting
- "no API token found" (on
extract/crawl): Runmineru-open-api author setMINERU_TOKENenv variable. Or useflash-extractwhich needs no token. - Timeout on large files: Increase with
--timeout 1600(seconds) - Batch fails partially: Check stderr for per-file status; succeeded files are still saved
- Binary format to stdout: Use
-oflag;docxcannot stream to stdout - Private deployment: Use
--base-url https://your-server.com/api - Extraction quality is poor: Try
mineru-open-api extractwith--model vlmfor complex layouts, or--ocrfor scanned documents - Tables not extracted:
flash-extractdoes NOT support tables. Usemineru-open-api extractwith a token. - HTTP 429 on flash-extract: IP rate limit hit. Wait a few minutes or switch to
mineru-open-api extractwith token.
Reporting Issues
- Skill issues: Open an issue at https://github.com/opendatalab/MinerU-Ecosystem/tree/main/cli
- CLI issues: Open an issue at https://github.com/MinerU-Extract/mineru-ai