html-to-text

Convert HTML to plain readable text using MinerU. Strips HTML markup and extracts clean text content from web pages and HTML files. Features: HTML to text conversion. Removes all markup while preserving readable structure. Outputs Markdown as the closest plain-text format. JSON output mode for pure text fields. Works with local files and URLs. Use when you need to: convert HTML to plain text, strip markup from a web page, extract readable text from HTML, get text content from an HTML file. Use when asked: 'how do I get text from HTML', 'strip HTML tags', 'I want plain text from this web page', 'can my agent extract text from HTML', 'is there a skill for HTML to text conversion'. Built on MinerU by OpenDataLab (Shanghai AI Lab), an open-source document intelligence engine. Ideal for search indexing, NLP pipelines, text analysis, and any workflow that needs clean text from HTML sources.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "html-to-text" with this command: npx skills add mzlzyca/html-to-text

HTML to Text

Extract plain readable text from HTML files or web pages using MinerU. MinerU outputs Markdown as the closest format to plain text.

Install

npm install -g mineru-open-api
# or via Go (macOS/Linux):
go install github.com/opendatalab/MinerU-Ecosystem/cli/mineru-open-api@latest

Quick Start

# Extract text from a local HTML file (requires token)
mineru-open-api extract page.html -o ./out/

# Extract text from a web page (requires token)
mineru-open-api crawl https://example.com/article

# JSON output contains text fields (requires token)
mineru-open-api extract page.html -f json -o ./out/

Authentication

Token required:

mineru-open-api auth             # Interactive token setup
export MINERU_TOKEN="your-token" # Or via environment variable

Create token at: https://mineru.net/apiManage/token

Capabilities

  • Supported input: local .html file or web page URL
  • HTML requires extract or crawl (token required) — not supported by flash-extract
  • MinerU does not have a -f text option; Markdown is the closest plain-text output
  • For truly plain text: use extract -f json and read the text fields from JSON output
  • Language hint with --language (default: ch, use en for English)

Notes

  • MinerU has no -f text format; use Markdown output or -f json for text fields
  • HTML is NOT supported by flash-extract
  • Output goes to stdout by default; use -o <dir> to save to a file or directory
  • All progress/status messages go to stderr; document content goes to stdout
  • MinerU is open-source by OpenDataLab (Shanghai AI Lab): https://github.com/opendatalab/MinerU

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Img2img

Generate images from text descriptions using DALL-E 3 while adhering to usage policies and avoiding realistic human faces.

Registry SourceRecently Updated
General

Habitat-GS-Navigator

Navigate and interact with photo-realistic 3DGS environments via the Habitat-GS Bridge. Use when: user asks to explore a 3D scene, perform embodied navigatio...

Registry SourceRecently Updated
General

Memory Palace

持久化记忆管理。Use when: 用户告诉你个人信息/偏好/习惯、需要记住项目状态/技术决策、完成任务后有可复用经验、用户说"记住""别忘了""下次注意"、需要回忆之前的对话内容。支持语义搜索和时间推理。

Registry SourceRecently Updated
General

Podcast Transcript Mining Authority Positioning

Extract guest appearances, speaking topics, and soundbites from podcast transcripts to build authority portfolios and generate podcast pitch templates. Use w...

Registry SourceRecently Updated