mineru-cli

MinerU document extraction CLI that converts PDFs, images, and web pages into Markdown, HTML, LaTeX, or DOCX via the MinerU API. Supports single/batch extraction, web crawling, async tasks, and piped workflows.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "mineru-cli" with this command: npx skills add decrystal/ade-mineru-api-skills

Document Extraction with mineru

Installation

Linux / macOS

curl -fsSL https://cdn-mineru.openxlab.org.cn/open-api-cli/install.sh | sh

Windows (PowerShell)

irm https://cdn-mineru.openxlab.org.cn/open-api-cli/install.ps1 | iex

Verify installation

mineru version

Authentication

Before using, configure your API token (get one from https://mineru.net):

mineru auth                    # Interactive token setup
export MINERU_TOKEN="your-token"  # Or set via environment variable

Token resolution order: --token flag > MINERU_TOKEN env > ~/.mineru/config.yaml.

Supported input formats

The extract command accepts the following input types:

  • PDF (.pdf) — primary use case, supports scanned and digital PDFs
  • Images (.png, .jpg, .jpeg, .webp, .gif,.bmp) — use --ocr for best results on scanned content
  • DOCX (.docx) — Microsoft Word documents
  • URLs — remote files are downloaded automatically

The crawl command accepts any HTTP/HTTPS URL and extracts web page content.

Default behavior

  • Table recognition: ON by default. Tables in documents are extracted and converted to Markdown tables. Use --no-table to disable.
  • Formula recognition: ON by default. Mathematical formulas are extracted as LaTeX. Use --no-formula to disable.
  • Language: defaults to ch (Chinese). Use --language en for English documents.
  • Model: auto-selected. Use --model vlm for complex layouts, --model pipeline for speed.

Quick start

mineru extract report.pdf                    # PDF → Markdown to stdout
mineru extract report.pdf -o ./out/          # Save to file
mineru extract report.pdf -f md,docx         # Multiple formats
mineru crawl https://example.com/article     # Web page → Markdown

Core workflow

  1. Authenticate: mineru auth or set MINERU_TOKEN
  2. Extract: mineru extract <file-or-url> for documents
  3. Crawl: mineru crawl <url> for web pages
  4. Check results: output goes to stdout (default) or -o directory

Commands

extract — Document extraction

Convert PDFs, images, and other documents to Markdown or other formats.

mineru extract report.pdf                         # Markdown to stdout
mineru extract report.pdf -f html                 # HTML to stdout
mineru extract report.pdf -o ./out/               # Save to directory
mineru extract report.pdf -o ./out/ -f md,docx    # Multiple formats
mineru extract *.pdf -o ./results/                # Batch extract
mineru extract --list files.txt -o ./results/     # Batch from file list
mineru extract https://example.com/doc.pdf        # Extract from URL
cat doc.pdf | mineru extract --stdin -o ./out/    # From stdin

extract flags

FlagShortDefaultDescription
--output-o(stdout)Output path (file or directory)
--format-fmdOutput formats: md, json, html, latex, docx (comma-separated)
--model(auto)Model: vlm, pipeline, html
--ocrfalseEnable OCR for scanned documents
--no-formulafalseDisable formula recognition
--no-tablefalseDisable table recognition
--languagechDocument language
--pages(all)Page range, e.g. 1-10,15
--timeout300/1800Timeout in seconds (single/batch)
--listRead input list from file (one path per line)
--stdin-listfalseRead input list from stdin
--stdinfalseRead file content from stdin
--stdin-namestdin.pdfFilename hint for stdin mode
--concurrency0Batch concurrency (0 = server default)

crawl — Web page extraction

Fetch web pages and convert to Markdown.

mineru crawl https://example.com/article              # Markdown to stdout
mineru crawl https://example.com/article -f html      # HTML to stdout
mineru crawl https://example.com/article -o ./out/     # Save to file
mineru crawl url1 url2 -o ./pages/                     # Batch crawl
mineru crawl --list urls.txt -o ./pages/               # Batch from file list

crawl flags

FlagShortDefaultDescription
--output-o(stdout)Output path
--format-fmdOutput formats: md, json, html (comma-separated)
--timeout300/1800Timeout in seconds (single/batch)
--listRead URL list from file (one per line)
--stdin-listfalseRead URL list from stdin
--concurrency0Batch concurrency

auth — Authentication management

mineru auth              # Interactive token setup
mineru auth --verify     # Verify current token is valid
mineru auth --show       # Show current token source and masked value

status — Async task status

Query the status of a previously submitted extraction task.

mineru status <task-id>                      # Check status once
mineru status <task-id> --wait               # Wait for completion
mineru status <task-id> --wait -o ./out/     # Wait and download results
mineru status <task-id> --wait --timeout 600 # Custom timeout

status flags

FlagShortDefaultDescription
--waitfalseWait for task completion
--output-oDownload results to directory when done
--timeout300Max wait time in seconds

version — Version info

mineru version    # Show version, commit, build date, Go version, OS/arch

Global flags

These flags apply to all commands:

FlagShortDescription
--tokenAPI token (overrides env and config)
--base-urlAPI base URL (for private deployments)
--verbose-vVerbose mode, print HTTP details

Output behavior

  • No -o flag: result goes to stdout; status/progress messages go to stderr
  • With -o flag: result saved to file/directory; progress messages on stderr
  • Batch mode: requires -o to specify output directory
  • Binary formats (docx): cannot output to stdout, must use -o
  • Markdown output includes extracted images saved alongside the .md file

Examples

Single PDF extraction

mineru extract report.pdf -o ./output/
# Output: ./output/report.md + ./output/images/

Extract with OCR and specific pages

mineru extract scanned.pdf --ocr --pages "1-5" -o ./out/

Multi-format output

mineru extract paper.pdf -f md,html,docx -o ./out/
# Output: ./out/paper.md, ./out/paper.html, ./out/paper.docx

Batch processing from file list

# files.txt contains one path per line
mineru extract --list files.txt -o ./results/

Extract to LaTeX

mineru extract paper.pdf -f latex -o ./out/
# Output: ./out/paper.tex

English document with specific language

mineru extract english-report.pdf --language en -o ./out/

Extract Word document to Markdown

mineru extract resume.docx -o ./out/
# Output: ./out/resume.md

Pipe workflow

# Download and extract in one pipeline
curl -sL https://example.com/doc.pdf | mineru extract --stdin --stdin-name doc.pdf

Web crawling

mineru crawl https://example.com/docs/guide -o ./docs/

Batch crawl with URL list

echo -e "https://example.com/page1\nhttps://example.com/page2" | mineru crawl --stdin-list -o ./pages/

Use with other tools

# Extract and pipe to another tool
mineru extract report.pdf | wc -w              # Word count
mineru extract report.pdf | grep "keyword"     # Search content
mineru extract report.pdf -f json | jq '.[]'   # Parse structured output

Agent guidelines

When using this skill on behalf of the user:

  • Always ask for the file path if the user didn't specify one. Never guess or fabricate a filename.
  • Quote file paths that contain spaces or special characters with double quotes in commands. Example: mineru extract "report 01.pdf", NOT mineru extract report 01.pdf.
  • Don't run commands blindly on errors — if the user asks "提取失败了怎么办", explain the exit code and troubleshooting steps instead of re-running the command.
  • Installation questions ("mineru 怎么安装") should be answered with the install instructions, not by running mineru extract.
  • DOCX as input is supported — if the user asks "这个 Word 文档能转 Markdown 吗", use mineru extract file.docx.
  • Table extraction — tables are extracted by default as part of the Markdown output. There is no "tables only" mode; the full document is always extracted.
  • For stdout mode (no -o), only one text format can be output at a time. If the user wants multiple formats, suggest adding -o.

Default output directory

When the user does NOT specify an output path (-o), the agent MUST generate a default output directory to prevent file overwrites. Use:

~/MinerU-Skill/<name>_<hash>/

Naming rules:

  • <name>: derived from the source, then sanitized for safe directory names.
    • For URLs: last path segment (e.g. https://arxiv.org/pdf/2509.221862509.22186)
    • For local files: filename without extension (e.g. report.pdfreport)
    • Sanitization: replace spaces and shell-unsafe characters (space, (, ), [, ], &, ', ", !, #, $, `) with _. Collapse consecutive _ into one. Keep alphanumeric, -, _, ., and CJK characters.
  • <hash>: first 6 characters of the MD5 hash of the full original source path or URL (before sanitization). This ensures:
    • Different URLs with similar basenames get unique directories
    • Re-running the same source reuses the same directory (idempotent)

Examples:

Source<name>Output directory
https://arxiv.org/pdf/2509.221862509.22186~/MinerU-Skill/2509.22186_a3f2b1/
https://arxiv.org/pdf/2509.2002509.200~/MinerU-Skill/2509.200_c7e9d4/
./report.pdfreport~/MinerU-Skill/report_8b1a3f/
./report 01.pdfreport_01~/MinerU-Skill/report_01_f4a1c2/
./My Doc (final).pdfMy_Doc_final~/MinerU-Skill/My_Doc_final_b9e3d7/
./个人简介.docx个人简介~/MinerU-Skill/个人简介_d2a8f5/

How the agent should generate the hash:

echo -n "https://arxiv.org/pdf/2509.22186" | md5sum | cut -c1-6

Or on macOS:

echo -n "https://arxiv.org/pdf/2509.22186" | md5 | cut -c1-6

When the user specifies -o: use the user's path as-is, do NOT override with the default directory.

Exit codes

CodeMeaningRecovery
0Success
1General API or unknown errorCheck network connectivity; retry; use --verbose for details
2Invalid parameters / usage errorCheck command syntax and flag values
3Authentication errorRun mineru auth to reconfigure token, or check token expiration
4File too large or page limit exceededSplit the file or use --pages to extract a subset
5Extraction failedThe document may be corrupted or unsupported; try a different --model
6TimeoutIncrease with --timeout; large files may need 600+ seconds
7Quota exceededCheck API quota at https://mineru.net; wait or upgrade plan

Troubleshooting

  • "no API token found": Run mineru auth or set MINERU_TOKEN env variable
  • Timeout on large files: Increase with --timeout 600 (seconds)
  • Batch fails partially: Check stderr for per-file status; succeeded files are still saved
  • Binary format to stdout: Use -o flag; docx cannot stream to stdout
  • Private deployment: Use --base-url https://your-server.com/api
  • Extraction quality is poor: Try --model vlm for complex layouts, or --ocr for scanned documents
  • Formula not recognized: Ensure --no-formula is NOT set; try --model vlm for better formula support

Notes

  • All status/progress messages go to stderr; only document content goes to stdout
  • Batch mode automatically polls the API with exponential backoff
  • Token is stored in ~/.mineru/config.yaml after mineru auth
  • The CLI wraps the MinerU Open SDK (github.com/OpenDataLab/mineru-open-sdk)

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

Stock Research Desk

Claude Code skill for multi-agent equity research. Produces buy-side memos with debate, scenario projection, and bilingual DOCX delivery. Use when researchin...

Registry SourceRecently Updated
Coding

MagicBrowse

Browser automation fallback through the magicbrowse CLI for goal-driven launch, approved attach, observe, and act on real web pages.

Registry SourceRecently Updated
Coding

GitHub Trending Skill

每日 GitHub Trending 热榜推送,支持日榜和月度汇总

Registry SourceRecently Updated
Coding

Proworkflow

ProWorkflow integration. Manage Clients, Staffs, Quotes, Templates, Messages, Groups. Use when the user wants to interact with ProWorkflow data.

Registry SourceRecently Updated