Document to Markdown Conversion
Overview
Convert various document formats to clean Markdown using Microsoft's MarkItDown tool. Optimized for LLM processing, content extraction, and document analysis workflows.
Supported Formats: PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx/.xls), Images (with OCR/LLM), HTML, Audio (with transcription), CSV, JSON, XML, ZIP archives, EPubs
Quick Start
Basic Usage
from markitdown import MarkItDown
md = MarkItDown() result = md.convert("document.pdf") print(result.text_content)
Command Line
Convert single file
markitdown document.pdf > output.md markitdown document.pdf -o output.md
Pipe input
cat document.pdf | markitdown
🔒 Security Considerations
Before using in production:
-
✅ Validate file types (MIME, not extension)
-
✅ Limit file sizes (prevent DoS)
-
✅ Sanitize file paths (prevent traversal)
-
✅ Protect API keys (never hardcode)
-
✅ Consider data privacy (external services)
See patterns.md for implementation details.
API Key Security
❌ NEVER:
-
Hardcode keys in code
-
Commit .env files to git
-
Log environment variables
✅ ALWAYS:
- Use environment variables: export OPENAI_API_KEY="sk-..."
pragma: allowlist secret
-
Use secret management (AWS Secrets Manager, Azure Key Vault)
-
Rotate keys regularly
Common Patterns
PDF Documents
Basic PDF conversion
md = MarkItDown() result = md.convert("report.pdf")
With Azure Document Intelligence (better quality)
md = MarkItDown(docintel_endpoint="<your-endpoint>") result = md.convert("report.pdf")
Office Documents
Word documents - preserves structure
result = md.convert("document.docx")
Excel - converts tables to markdown tables
result = md.convert("spreadsheet.xlsx")
PowerPoint - extracts slide content
result = md.convert("presentation.pptx")
Images with Descriptions
✅ SECURE: Using environment variables for API keys
import os from openai import OpenAI
api_key = os.getenv("OPENAI_API_KEY") if not api_key: raise RuntimeError("OPENAI_API_KEY not set")
client = OpenAI(api_key=api_key) md = MarkItDown(llm_client=client, llm_model="gpt-4o") result = md.convert("diagram.jpg") # Gets AI-generated description
Batch Processing
from pathlib import Path
md = MarkItDown() documents = Path(".").glob("*.pdf")
for doc in documents: result = md.convert(str(doc)) output_path = doc.with_suffix(".md") output_path.write_text(result.text_content)
Installation
Full installation (all features)
pip install 'markitdown[all]'
Selective features
pip install 'markitdown[pdf, docx, pptx]'
Requirements: Python 3.10 or higher
Key Features
-
Structure Preservation: Maintains headings, lists, tables, links
-
Plugin System: Extend with custom converters
-
Docker Support: Containerized deployments
-
MCP Integration: Model Context Protocol server for LLM apps
When to Read Supporting Files
reference.md - Read when you need:
-
Complete API reference and all configuration options
-
Azure Document Intelligence integration details
-
Plugin development guide
-
Docker and MCP server setup
-
Troubleshooting and error handling
examples.md - Read when you need:
-
Working examples for specific file types
-
Batch processing workflows
-
Error handling patterns
-
Integration with existing pipelines
patterns.md - Read when you need:
-
Production deployment patterns
-
Performance optimization strategies
-
Security considerations
-
Anti-patterns to avoid
Quick Reference
File Type Use Case Command
PDF Reports, papers md.convert("file.pdf")
Word Documents md.convert("file.docx")
Excel Data tables md.convert("file.xlsx")
PowerPoint Presentations md.convert("file.pptx")
Images Diagrams with OCR md = MarkItDown(llm_client=client); md.convert("img.jpg")
HTML Web pages md.convert("page.html")
ZIP Archives md.convert("archive.zip")
- processes contents
⚠️ Common Mistakes to Avoid
Anti-Pattern 1: Hardcoded API Keys
❌ NEVER DO THIS
md = MarkItDown(llm_client=OpenAI(api_key="sk-hardcoded-key"))
✅ ALWAYS DO THIS
api_key = os.getenv("OPENAI_API_KEY") md = MarkItDown(llm_client=OpenAI(api_key=api_key))
Anti-Pattern 2: Unvalidated File Paths
❌ Vulnerable to path traversal
user_input = "../../../etc/passwd" md.convert(user_input)
✅ Validate and sanitize
from pathlib import Path safe_path = Path(user_input).resolve() if not safe_path.is_relative_to(allowed_dir): raise ValueError("Invalid path") md.convert(str(safe_path))
Anti-Pattern 3: Ignoring File Size Limits
❌ Can cause DoS
md.convert("huge_file.pdf") # No size check
✅ Check size first
max_size = 50 * 1024 * 1024 # 50MB if Path("file.pdf").stat().st_size > max_size: raise ValueError("File too large")
Common Issues
Import Error: Ensure Python >= 3.10 and markitdown installed Missing Dependencies: Install with pip install 'markitdown[all]'
Image Descriptions Not Working: Requires LLM client (OpenAI or compatible)
For detailed troubleshooting, see reference.md.