html-parse

Parse HTML documents into structured Markdown using MinerU. Analyzes HTML structure and converts it into well-organized Markdown preserving hierarchy and formatting. Features: structural HTML parsing. Preserves headings, lists, tables, and nested elements. Converts HTML to structured Markdown. Works with local files and URLs. Use when you need to: parse HTML structure, convert HTML to structured Markdown, analyze HTML document layout, extract structured content from web pages. Use when asked: 'how do I parse this HTML', 'convert HTML to Markdown with structure', 'I want to understand this HTML layout', 'can my agent parse HTML files', 'is there a skill for HTML parsing'. Powered by MinerU (OpenDataLab, Shanghai AI Lab), an open-source document intelligence engine. Ideal for developers, content managers, and data pipelines that need to parse and restructure HTML content.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "html-parse" with this command: npx skills add mzlzyca/html-parse

HTML Parse

Parse local HTML files into structured Markdown using MinerU. Preserves document hierarchy. For live web pages, use mineru-open-api crawl.

Install

npm install -g mineru-open-api
# or via Go (macOS/Linux):
go install github.com/opendatalab/MinerU-Ecosystem/cli/mineru-open-api@latest

Quick Start

# Parse a local HTML file (requires token)
mineru-open-api extract page.html -o ./out/

# Parse a remote HTML URL (requires token)
mineru-open-api extract https://example.com/page.html -o ./out/

# Parse a live web page (requires token)
mineru-open-api crawl https://example.com/article -o ./out/

Authentication

Token required:

mineru-open-api auth             # Interactive token setup
export MINERU_TOKEN="your-token" # Or via environment variable

Create token at: https://mineru.net/apiManage/token

Capabilities

  • Supported input: local .html file or remote HTML URL
  • HTML requires extract or crawl (token required)
  • HTML is NOT supported by flash-extract
  • Language hint with --language (default: ch, use en for English)

Notes

  • HTML is NOT supported by flash-extract — use extract or crawl
  • For live web pages with dynamic content, use crawl instead of extract
  • Output goes to stdout by default; use -o <dir> to save to a file or directory
  • All progress/status messages go to stderr; document content goes to stdout
  • MinerU is open-source by OpenDataLab (Shanghai AI Lab): https://github.com/opendatalab/MinerU

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Img2img

Generate images from text descriptions using DALL-E 3 while adhering to usage policies and avoiding realistic human faces.

Registry SourceRecently Updated
General

Habitat-GS-Navigator

Navigate and interact with photo-realistic 3DGS environments via the Habitat-GS Bridge. Use when: user asks to explore a 3D scene, perform embodied navigatio...

Registry SourceRecently Updated
General

Memory Palace

持久化记忆管理。Use when: 用户告诉你个人信息/偏好/习惯、需要记住项目状态/技术决策、完成任务后有可复用经验、用户说"记住""别忘了""下次注意"、需要回忆之前的对话内容。支持语义搜索和时间推理。

Registry SourceRecently Updated
General

Podcast Transcript Mining Authority Positioning

Extract guest appearances, speaking topics, and soundbites from podcast transcripts to build authority portfolios and generate podcast pitch templates. Use w...

Registry SourceRecently Updated