mineru-precision-extract

MinerU precision extract — high-accuracy document extraction with full feature set. Convert PDFs, scanned documents, images, Word (DOC/DOCX), PowerPoint (PPT/PPTX), and HTML files into Markdown, HTML, LaTeX, DOCX, or JSON with table recognition, formula recognition (LaTeX), and advanced OCR. Choose between vlm model for highest accuracy on complex layouts, academic papers, and intricate tables, or pipeline model for zero-hallucination reliable extraction. Supports batch processing of hundreds of files, web page crawling to Markdown, and multi-format output in a single command. Use this skill when you need to: extract tables from PDFs, recognize formulas in academic papers, convert PDF to HTML or LaTeX, batch process document files, OCR scanned documents with high precision, convert documents to DOCX format, crawl web pages to structured Markdown, or process documents with complex layouts. Supports 80+ languages across Latin, Arabic, Cyrillic, Devanagari, CJK, and more script families. Handles large files with no size or page limits, unlike quick extraction modes. Built for researchers, data engineers, academic institutions, and production document pipelines that demand accuracy and reliability. Works as a Claude Code skill, MCP tool, or standalone CLI. 高精度PDF提取、表格识别、公式识别、PDF转HTML、PDF转LaTeX、PDF转DOCX、批量PDF处理、扫描件OCR、学术论文解析、多格式文档转换。支持VLM高精度模型和零幻觉Pipeline模型,80+语言支持,适用于学术研究、数据工程和生产环境文档处理。

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "mineru-precision-extract" with this command: npx skills add mineru-extract/mineru-precision-extract

Precision Document Extraction with mineru-open-api

Full-featured document extraction with table/formula recognition, OCR, multi-format output, batch processing, and web crawling.

Why use extract?

  • Table recognition — accurately extracts tables from PDFs and images
  • Formula recognition — preserves mathematical formulas as LaTeX
  • Multi-format output — Markdown, HTML, LaTeX, DOCX, JSON
  • Model selection — choose vlm for highest accuracy or pipeline for zero-hallucination
  • Batch processing — process hundreds of files in one command
  • Web crawling — convert web pages to structured Markdown
  • All file formats — PDF, images, DOC, DOCX, PPT, PPTX, HTML
  • Higher limits — much larger file size and page count than quick mode
  • 80+ languages — full language coverage across all script families

Installation

npm install -g mineru-open-api

Or via Go (macOS/Linux):

go install github.com/opendatalab/MinerU-Ecosystem/cli/mineru-open-api@latest

Verify installation

mineru-open-api version

Authentication

Create a token at https://mineru.net/apiManage/token, then configure:

mineru-open-api auth                         # Interactive token setup
export MINERU_TOKEN="your-token"             # Or set via environment variable

Token resolution order: --token flag > MINERU_TOKEN env > ~/.mineru/config.yaml.

Quick start

mineru-open-api extract report.pdf                         # Markdown to stdout
mineru-open-api extract report.pdf -o ./out/               # Save to directory
mineru-open-api extract report.pdf -f md,html,docx -o ./   # Multi-format
mineru-open-api extract report.pdf --model vlm -o ./out/   # High-accuracy mode
mineru-open-api extract *.pdf -o ./results/                # Batch extract
mineru-open-api crawl https://example.com/article          # Web page → Markdown

Supported input formats

FormatSupported
PDF (.pdf)Yes
Images (.png, .jpg, .jpeg, .jp2, .webp, .gif, .bmp)Yes
Word (.doc, .docx)Yes
PowerPoint (.ppt, .pptx)Yes
HTML (.html)Yes
URLs (remote files)Yes

Commands

extract — Precision extraction

mineru-open-api extract <file-or-url> [...] [flags]

Examples

mineru-open-api extract report.pdf                         # Markdown to stdout
mineru-open-api extract report.pdf -f html                 # HTML to stdout
mineru-open-api extract report.pdf -o ./out/               # Save to directory
mineru-open-api extract report.pdf -o ./out/ -f md,docx    # Multiple formats
mineru-open-api extract report.pdf -f latex -o ./out/      # LaTeX output
mineru-open-api extract report.pdf --model vlm -o ./out/   # High-accuracy mode
mineru-open-api extract report.pdf --ocr -o ./out/         # OCR for scanned docs
mineru-open-api extract report.pdf --language en -o ./out/ # Specify language
mineru-open-api extract report.pdf --pages "1-10" -o ./out/  # Page range
mineru-open-api extract *.pdf -o ./results/                # Batch extract
mineru-open-api extract --list files.txt -o ./results/     # Batch from file list
mineru-open-api extract https://example.com/doc.pdf        # Extract from URL
cat doc.pdf | mineru-open-api extract --stdin -o ./out/    # From stdin

extract flags

FlagShortDefaultDescription
--output-o(stdout)Output path (file or directory)
--format-fmdOutput formats: md, json, html, latex, docx (comma-separated)
--model(auto)Model: vlm, pipeline, html (see below)
--ocrfalseEnable OCR for scanned documents
--formulatrueEnable/disable formula recognition
--tabletrueEnable/disable table recognition
--languagechDocument language
--pages(all)Page range, e.g. 1-10,15
--timeout900/1800Timeout in seconds (single/batch)
--listRead input list from file (one path per line)
--concurrency0Batch concurrency (0 = server default)

Model comparison: vlm vs pipeline

vlmpipeline
Parsing accuracyHigher — better at complex layouts, mixed contentStandard
Hallucination riskMay produce hallucinated text in rare casesNo hallucination — biggest advantage
Best forAcademic papers, complex tables, intricate layoutsGeneral documents where fidelity matters most

When the user values accuracy and the document has complex formatting, suggest --model vlm. When the user prioritizes reliability and no-hallucination guarantee, suggest --model pipeline (or omit --model to use auto).

crawl — Web page extraction

Fetch web pages and convert to structured Markdown.

mineru-open-api crawl https://example.com/article              # Markdown to stdout
mineru-open-api crawl https://example.com/article -f html      # HTML to stdout
mineru-open-api crawl https://example.com/article -o ./out/    # Save to file
mineru-open-api crawl url1 url2 -o ./pages/                    # Batch crawl
mineru-open-api crawl --list urls.txt -o ./pages/              # Batch from file list

crawl flags

FlagShortDefaultDescription
--output-o(stdout)Output path
--format-fmdOutput formats: md, json, html (comma-separated)
--timeout900/1800Timeout in seconds (single/batch)
--listRead URL list from file (one per line)
--stdin-listfalseRead URL list from stdin
--concurrency0Batch concurrency

auth — Authentication management

mineru-open-api auth              # Interactive token setup
mineru-open-api auth --verify     # Verify current token is valid
mineru-open-api auth --show       # Show current token source and masked value

Supported --language values

Values are organized by script/language family — each value covers all languages in its group.

Standalone language packs

ValueIncluded languages说明
chChinese, English, Chinese Traditional中英文(默认值)
ch_serverChinese, English, Chinese Traditional, Japanese繁体、手写体
enEnglish纯英文
japanChinese, English, Chinese Traditional, Japanese日文为主
koreanKorean, English韩文
chinese_chtChinese, English, Chinese Traditional, Japanese繁体中文为主
taTamil, English泰米尔文
teTelugu, English泰卢固文
kaKannada卡纳达文
elGreek, English希腊文
thThai, English泰文

Language family packs

ValueScript/FamilyIncluded languages
latinLatin script (拉丁语系)French, German, Afrikaans, Italian, Spanish, Bosnian, Portuguese, Czech, Welsh, Danish, Estonian, Irish, Croatian, Uzbek, Hungarian, Serbian (Latin), Indonesian, Occitan, Icelandic, Lithuanian, Maori, Malay, Dutch, Norwegian, Polish, Slovak, Slovenian, Albanian, Swedish, Swahili, Tagalog, Turkish, Latin, Azerbaijani, Kurdish, Latvian, Maltese, Pali, Romanian, Vietnamese, Finnish, Basque, Galician, Luxembourgish, Romansh, Catalan, Quechua
arabicArabic script (阿拉伯语系)Arabic, Persian, Uyghur, Urdu, Pashto, Kurdish, Sindhi, Balochi, English
cyrillicCyrillic script (西里尔语系)Russian, Belarusian, Ukrainian, Serbian (Cyrillic), Bulgarian, Mongolian, Abkhazian, Adyghe, Kabardian, Avar, Dargin, Ingush, Chechen, Lak, Lezgin, Tabasaran, Kazakh, Kyrgyz, Tajik, Macedonian, Tatar, Chuvash, Bashkir, Malian, Moldovan, Udmurt, Komi, Ossetian, Buryat, Kalmyk, Tuvan, Sakha, Karakalpak, English
east_slavicEast Slavic (东斯拉夫语系)Russian, Belarusian, Ukrainian, English
devanagariDevanagari script (天城文语系)Hindi, Marathi, Nepali, Bihari, Maithili, Angika, Bhojpuri, Magahi, Santali, Newari, Konkani, Sanskrit, Haryanvi, English

Global flags

FlagShortDescription
--tokenAPI token (overrides env and config)
--base-urlAPI base URL (for private deployments)
--verbose-vVerbose mode, print HTTP details

Output behavior

  • No -o flag: result goes to stdout; status/progress messages go to stderr
  • With -o flag: result saved to file/directory; progress messages on stderr
  • Batch mode (extract/crawl): requires -o to specify output directory
  • Binary formats (docx): cannot output to stdout, must use -o
  • Markdown output includes extracted images saved alongside the .md file

Agent guidelines

When using this skill on behalf of the user:

  • Quote file paths that contain spaces or special characters with double quotes. Example: mineru-open-api extract "report 01.pdf".
  • Don't run commands blindly on errors — explain the exit code and troubleshooting steps.
  • Installation questions ("mineru 怎么安装") should be answered with the install instructions above.
  • For stdout mode (no -o), only one text format can be output at a time. If the user wants multiple formats, suggest adding -o.
  • If the user hasn't authenticated yet, guide them to create a token at https://mineru.net/apiManage/token and run mineru-open-api auth.

Default output directory

When the user does NOT specify -o, generate a default output directory:

~/MinerU-Skill/<name>_<hash>/
  • <name>: derived from the source, then sanitized (replace spaces and shell-unsafe characters with _, collapse consecutive _).
    • For URLs: last path segment (e.g. https://arxiv.org/pdf/2509.221862509.22186)
    • For local files: filename without extension (e.g. report.pdfreport)
  • <hash>: first 6 characters of MD5 hash of the full original source.
echo -n "source" | md5sum | cut -c1-6   # Linux
echo -n "source" | md5 | cut -c1-6      # macOS

When the user specifies -o: use the user's path as-is.

Skill upgrade = CLI upgrade

When the user asks to upgrade this skill, re-install the CLI first:

npm install -g mineru-open-api@latest

Exit codes

CodeMeaningRecovery
0Success
1General API or unknown errorCheck network; retry; use --verbose
2Invalid parameters / usage errorCheck command syntax and flag values
3Authentication errorCreate or refresh token at https://mineru.net/apiManage/token, then run mineru-open-api auth
4File too large or page limit exceededSplit the file or use --pages
5Extraction failedDocument may be corrupted; try a different --model
6TimeoutIncrease with --timeout; large files may need 1600+ seconds

Troubleshooting

  • "no API token found": Run mineru-open-api auth or set MINERU_TOKEN env variable. Create token at https://mineru.net/apiManage/token.
  • Timeout on large files: Increase with --timeout 1600
  • Batch fails partially: Check stderr for per-file status; succeeded files are still saved
  • Binary format to stdout: Use -o flag; docx cannot stream to stdout
  • Private deployment: Use --base-url https://your-server.com/api
  • Extraction quality is poor: Try --model vlm for complex layouts, or --ocr for scanned documents
  • Tables not extracted correctly: Try --model vlm for better table recognition

Reporting Issues

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

通义晓蜜 - 智能外呼

触发阿里云晓蜜外呼机器人任务,自动批量拨打电话。适用于批量外呼、客户回访、满意度调查、简历筛查约面试等场景。可从前置工具或节点获取外呼名单。

Registry SourceRecently Updated
General

Letterboxd Watchlist

Scrape a public Letterboxd user's watchlist into a CSV/JSONL list of titles and film URLs without logging in. Use when a user asks to export, scrape, or mirror a Letterboxd watchlist, or to build watch-next queues.

Registry SourceRecently Updated
General

Seedance Video Generation

Generate AI videos using ByteDance Seedance. Use when the user wants to: (1) generate videos from text prompts, (2) generate videos from images (first frame, first+last frame, reference images), or (3) query/manage video generation tasks. Supports Seedance 1.5 Pro (with audio), 1.0 Pro, 1.0 Pro Fast, and 1.0 Lite models.

Registry SourceRecently Updated
4.2K17jackycser
General

Universal Skills Manager

The master coordinator for AI skills. Discovers skills from multiple sources (SkillsMP.com, SkillHub, and ClawHub), manages installation, and synchronization...

Registry SourceRecently Updated