Everything2Markdown - Convert Anything to Markdown

A powerful document conversion tool based on Microsoft MarkItDown, specifically optimized for AGENT and LLM workflows.

Core Features

✅ Universal Support: PDF, DOCX, PPTX, XLSX, EPUB, HTML, CSV, JSON, XML
✅ Rich Media: Image OCR, audio transcription, YouTube subtitle extraction
✅ Structure Preservation: Headings, tables, lists, links maintained
✅ Metadata Extraction: Author, creation date, page count, etc.
✅ AGENT Optimized: Clean Markdown output perfect for LLM processing

Supported Formats

Format	Extension	Notes
PDF	.pdf	Full text extraction with structure
Word	.docx, .doc	Preserves headings and tables
PowerPoint	.pptx, .ppt	Slide-by-slide conversion
Excel	.xlsx, .xls	Sheet-to-table conversion
EPUB	.epub	E-book format
HTML	.html, .htm	Web page conversion
Images	.png, .jpg, .gif	OCR text extraction
Audio	.mp3, .wav, .m4a	Speech-to-text
Archives	.zip	Iterates contents
YouTube	URL	Subtitle and metadata

Quick Start

Single File Conversion

markitdown document.pdf -o output.md

Batch Conversion

# Convert all PDFs in directory
for f in *.pdf; do
  markitdown "$f" -o "${f%.pdf}.md"
done

# Or with find
find . -name "*.pdf" -o -name "*.docx" | while read f; do
  out="${f%.*}.md"
  markitdown "$f" -o "$out"
  echo "✓ Converted: $f → $out"
done

Advanced Options

# Verbose output
markitdown document.pdf -o output.md --verbose

# Keep intermediate files
markitdown document.pdf -o output.md --keep-temp

# Specify encoding
markitdown document.pdf -o output.md --encoding utf-8

Python API

Basic Usage

from markitdown import MarkItDown

# Initialize
md = MarkItDown()

# Convert file
result = md.convert("document.pdf")

# Access content
print(result.text_content)  # Markdown text
print(result.metadata)      # Document metadata

Advanced API

from markitdown import MarkItDown
from markitdown.converters import DocumentConverter

# Custom configuration
md = MarkItDown(
    enable_plugins=True,
    custom_converters=[MyCustomConverter()]
)

# Convert with options
result = md.convert(
    "document.pdf",
    keep_formatting=True,
    extract_images=True
)

# Access structured data
print(f"Title: {result.metadata.get('title')}")
print(f"Author: {result.metadata.get('author')}")
print(f"Pages: {result.metadata.get('pages')}")
print(f"Word count: {len(result.text_content.split())}")

AGENT Workflow Best Practices

1. Document Preprocessing Pipeline

import re
from markitdown import MarkItDown

def preprocess_for_llm(file_path):
    """
    Preprocess document for optimal LLM consumption.
    Cleans noise, normalizes formatting, extracts structure.
    """
    # Convert to markdown
    md = MarkItDown()
    result = md.convert(file_path)
    
    text = result.text_content
    
    # Clean excessive formatting
    text = re.sub(r'\*{4,}', '***', text)  # Limit asterisks
    text = re.sub(r'\-{4,}', '---', text)  # Normalize horizontal rules
    text = re.sub(r'\n{4,}', '\n\n\n', text)  # Limit blank lines
    
    # Normalize headings
    text = re.sub(r'^#{7,}', '######', text, flags=re.MULTILINE)
    
    return {
        'content': text,
        'metadata': result.metadata,
        'original_length': len(result.text_content),
        'processed_length': len(text)
    }

2. Structured Section Extraction

import re
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class Section:
    level: int
    title: str
    content: str
    start_line: int
    end_line: int

def extract_sections(md_text: str) -> List[Section]:
    """
    Extract hierarchical sections from Markdown.
    Preserves document structure for AGENT processing.
    """
    lines = md_text.split('\n')
    sections = []
    
    # Find all headers
    header_pattern = re.compile(r'^(#{1,6})\s+(.+)$')
    headers = []
    
    for i, line in enumerate(lines):
        match = header_pattern.match(line)
        if match:
            level = len(match.group(1))
            title = match.group(2).strip()
            headers.append({'level': level, 'title': title, 'line': i})
    
    # Extract sections
    for i, header in enumerate(headers):
        start_line = header['line']
        end_line = headers[i + 1]['line'] if i + 1 < len(headers) else len(lines)
        
        content = '\n'.join(lines[start_line + 1:end_line]).strip()
        
        sections.append(Section(
            level=header['level'],
            title=header['title'],
            content=content,
            start_line=start_line,
            end_line=end_line
        ))
    
    return sections

3. RAG-Optimized Chunking

from typing import List, Dict
import hashlib

def chunk_for_rag(
    md_text: str,
    max_chunk_size: int = 1500,
    overlap: int = 200,
    preserve_headers: bool = True
) -> List[Dict]:
    """
    Chunk Markdown for optimal RAG retrieval.
    Preserves semantic boundaries and provides rich metadata.
    """
    chunks = []
    current_chunk = []
    current_size = 0
    chunk_index = 0
    
    # Split by natural boundaries
    paragraphs = md_text.split('\n\n')
    
    for para in paragraphs:
        para = para.strip()
        if not para:
            continue
        
        para_size = len(para)
        
        # Check if adding this paragraph exceeds limit
        if current_size + para_size > max_chunk_size and current_chunk:
            # Save current chunk
            chunk_text = '\n\n'.join(current_chunk)
            chunks.append(create_chunk_dict(
                chunk_text, chunk_index, md_text
            ))
            
            # Start new chunk with overlap
            if overlap > 0:
                overlap_text = '\n\n'.join(current_chunk[-2:]) if len(current_chunk) >= 2 else chunk_text[-overlap:]
                current_chunk = [overlap_text, para]
                current_size = len(overlap_text) + para_size
            else:
                current_chunk = [para]
                current_size = para_size
            
            chunk_index += 1
        else:
            current_chunk.append(para)
            current_size += para_size + 2  # +2 for newlines
    
    # Don't forget the last chunk
    if current_chunk:
        chunk_text = '\n\n'.join(current_chunk)
        chunks.append(create_chunk_dict(
            chunk_text, chunk_index, md_text
        ))
    
    return chunks

def create_chunk_dict(text: str, index: int, source: str) -> Dict:
    """Create a chunk dictionary with metadata."""
    return {
        'index': index,
        'text': text,
        'length': len(text),
        'hash': hashlib.md5(text.encode()).hexdigest()[:8],
        'source_length': len(source),
        'percent_start': 0 if index == 0 else round((sum(len(c) for c in source[:text]) / len(source)) * 100, 1)
    }

4. Document Analysis Pipeline

from typing import Dict, Any
from dataclasses import dataclass, asdict

@dataclass
class DocumentAnalysis:
    """Complete document analysis for AGENT consumption."""
    file_path: str
    file_type: str
    word_count: int
    char_count: int
    section_count: int
    heading_levels: Dict[int, int]  # level -> count
    has_tables: bool
    has_images: bool
    has_links: bool
    estimated_reading_time: int  # minutes
    summary: str
    keywords: list
    metadata: Dict[str, Any]

class DocumentAnalyzer:
    """Analyze documents for AGENT workflows."""
    
    def __init__(self):
        self.md = MarkItDown()
    
    def analyze(self, file_path: str) -> DocumentAnalysis:
        """Complete document analysis."""
        # Convert to markdown
        result = self.md.convert(file_path)
        text = result.text_content
        
        # Basic stats
        word_count = len(text.split())
        char_count = len(text)
        
        # Section analysis
        sections = extract_sections(text)
        section_count = len(sections)
        
        # Heading levels
        heading_levels = {}
        for s in sections:
            heading_levels[s.level] = heading_levels.get(s.level, 0) + 1
        
        # Features detection
        has_tables = '|' in text and '---' in text
        has_images = '![' in text
        has_links = 'http' in text or '[' in text
        
        # Reading time (average 200 wpm)
        reading_time = max(1, word_count // 200)
        
        # Extract keywords (simple approach)
        words = re.findall(r'\b[A-Za-z][a-z]{4,}\b', text)
        word_freq = {}
        for w in words:
            w = w.lower()
            if w not in ['would', 'could', 'should', 'there', 'their', 'about']:
                word_freq[w] = word_freq.get(w, 0) + 1
        keywords = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)[:10]
        
        # Generate summary
        summary = self._generate_summary(text, sections)
        
        return DocumentAnalysis(
            file_path=file_path,
            file_type=file_path.split('.')[-1].lower(),
            word_count=word_count,
            char_count=char_count,
            section_count=section_count,
            heading_levels=heading_levels,
            has_tables=has_tables,
            has_images=has_images,
            has_links=has_links,
            estimated_reading_time=reading_time,
            summary=summary,
            keywords=[k[0] for k in keywords],
            metadata=result.metadata
        )
    
    def _generate_summary(self, text: str, sections: List[Section]) -> str:
        """Generate a brief document summary."""
        if not sections:
            return text[:500] + "..." if len(text) > 500 else text
        
        # Get main sections
        main_sections = [s for s in sections if s.level <= 2]
        if main_sections:
            section_list = ", ".join([s.title for s in main_sections[:5]])
            return f"Document covers: {section_list}"
        
        return text[:300] + "..." if len(text) > 300 else text

Integration with OpenClaw

Skill Definition

# SKILL.md
name: doc2markdown
metadata:
  emoji: 📝
  requires:
    python_packages: [markitdown]
  install:
    - command: pip3 install 'markitdown[all]'

Tool Usage

# In OpenClaw agent
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("/path/to/document.pdf")

# Use in RAG pipeline
chunks = chunk_for_rag(result.text_content)
store_in_vector_db(chunks)

Performance Tips

Batch Processing: Process multiple files in parallel
Memory Management: For large documents, use streaming
Caching: Cache converted documents to avoid reprocessing
Selective Conversion: Only convert needed sections

Security Considerations

Sanitize file paths before processing
Validate file types to prevent injection
Handle sensitive content appropriately
Consider file size limits to prevent DoS

Troubleshooting

Issue	Solution
ImportError	Ensure `pip install 'markitdown[all]'`
OCR fails	Install Tesseract: `apt install tesseract-ocr`
Audio fails	Install ffmpeg: `apt install ffmpeg`
Memory error	Process in smaller chunks

License

MIT License - See LICENSE file for details.

everything2markdown

Safety Notice

Copy this and send it to your AI assistant to learn

Everything2Markdown - Convert Anything to Markdown

Core Features

Supported Formats

Quick Start

Single File Conversion

Batch Conversion

Advanced Options

Python API

Basic Usage

Advanced API

AGENT Workflow Best Practices

1. Document Preprocessing Pipeline

2. Structured Section Extraction

3. RAG-Optimized Chunking

4. Document Analysis Pipeline

Integration with OpenClaw

Skill Definition

Tool Usage

Performance Tips

Security Considerations

Troubleshooting

License

Source Transparency

Related Skills

FormatFerry Markdown Converter

memect-ppx

Microsoft MarkItDown

LLM Knowledge Bases