markitdown

Convert various document formats (PDF, Word, PowerPoint, Excel, images, audio, HTML, etc.) to Markdown using Microsoft's markitdown tool. Supports OCR, audio transcription, and YouTube video extraction.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "markitdown" with this command: npx skills add lanyasheng/ms-markitdown

MarkItDown

Microsoft's lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines.

Installation

Prerequisites

  • Python 3.10+
  • Java 11+ (for some converters)

Install via pipx (recommended)

pipx install 'markitdown[all]'

Install via pip

pip install 'markitdown[all]'

Minimal install (specific formats only)

pip install 'markitdown[pdf,docx,pptx]'

Supported Formats

FormatExtensionNotes
PDF.pdfPreserves structure, tables, links
Word.docxHeadings, lists, tables
PowerPoint.pptxSlides to Markdown
Excel.xlsx, .xlsTable data
Images.png, .jpg, etc.EXIF metadata + OCR
Audio.wav, .mp3Speech transcription
HTML.html, .htmWeb content
YouTubeURLVideo transcription
ZIP.zipIterates over contents
EPub.epubE-books
Text.csv, .json, .xmlText-based formats

CLI Usage

Basic Conversion

# PDF to Markdown
markitdown document.pdf > output.md

# Word to Markdown
markitdown document.docx -o output.md

# PowerPoint to Markdown
markitdown presentation.pptx -o output.md

Pipe Input

cat document.pdf | markitdown

Image OCR

markitdown screenshot.png -o text.md

YouTube Video

markitdown "https://youtube.com/watch?v=..." -o transcript.md

Python API Usage

from markitdown import MarkItDown

# Initialize
md = MarkItDown()

# Convert file
result = md.convert("document.pdf")
print(result.text_content)

# Convert from stream
with open("document.pdf", "rb") as f:
    result = md.convert_stream(f)
    print(result.text_content)

Options

OptionDescriptionExample
-o, --outputOutput file-o output.md
--formatOutput format (default: markdown)--format json
--pagesSpecific pages--pages "1,3,5-7"
--image-outputImage handling--image-output external
--quietSuppress output--quiet

MCP Server

MarkItDown provides an MCP (Model Context Protocol) server for integration with LLM applications:

pip install markitdown-mcp

Best Practices

  1. Batch processing: Process multiple files in one call for efficiency
  2. Format selection: Use minimal install if only specific formats needed
  3. OCR quality: Ensure 300 DPI+ for scanned documents
  4. Output review: Always verify Markdown output for complex documents

Troubleshooting

Java not found

Install Java 11+:

# macOS
brew install openjdk@17

# Ubuntu
sudo apt install openjdk-17-jdk

Permission denied

Use pipx or virtual environment:

python3 -m venv ~/.venvs/markitdown
source ~/.venvs/markitdown/bin/activate
pip install 'markitdown[all]'

References

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Doc Genius

支持PDF、Word、Markdown智能摘要和格式转换,提供批量处理与进度报告,提升文档处理效率。

Registry Source
3010Profile unavailable
General

Markdown to Word (.docx) Converter

Convert Markdown files to formatted Word (.docx) documents with automatic template style detection. Use this skill whenever the user mentions converting Mark...

Registry Source
2051Profile unavailable
General

Markitdown Converter

使用微软 markitdown 库将多种文档格式(PDF、DOC、DOCX、PPT、HTML等)转换为 Markdown。支持批量转换、保留格式、图片提取等功能。使用场景:(1) "把这个 PDF 转成 Markdown",(2) "批量转换这个文件夹里的文档",(3) "提取文档中的图片"。

Registry Source
3081Profile unavailable
Automation

MiniMax Office Pack

MiniMax Office办公技能包 - Word/Excel/PDF/PPT四大专业文档生成。基于.NET OpenXML SDK、底层XML直操作、Playwright+ReportLab双引擎。一键生成可直接交付的专业办公文档。

Registry SourceRecently Updated
1280Profile unavailable