Office Document Extractor
Zero-dependency converter for Microsoft Office documents. Extracts text and structure from DOCX, XLSX, and PPTX files into clean Markdown.
Quick Start
# Single file
python3 scripts/main.py report.docx -o report.md
# Batch convert a directory
python3 scripts/main.py ./documents --batch -o ./markdown
Supported Formats
| Format | Extension | Output |
|---|---|---|
| Word | .docx | Headings, paragraphs |
| Excel | .xlsx | Tables (one per sheet) |
| PowerPoint | .pptx | Slides as sections |
How It Works
- DOCX: Parses the ZIP archive's XML directly using Python's
zipfileandxml.etree - XLSX: Uses bundled
openpyxl(pure Python, no C extensions) - PPTX: Parses the ZIP archive's slide XML directly
No external commands, no network calls, no pip install required.
Usage
Single File
python3 scripts/main.py <input_file> [-o <output.md>]
Auto-detects format from file extension. If -o is omitted, outputs to <input>.md.
Batch Conversion
python3 scripts/main.py <input_directory> --batch [-o <output_directory>]
Converts all .docx, .xlsx, .pptx files in the directory. Results saved to markdown_output/ by default.
Resources
scripts/
- main.py — Unified CLI for single-file and batch conversion
- docx_extractor.py — DOCX → Markdown (standard library only)
- xlsx_extractor.py — XLSX → Markdown tables (bundled openpyxl)
- pptx_extractor.py — PPTX → Markdown (standard library only)
Bundled Dependencies
- openpyxl/ — Pure Python Excel library (v3.1.5)
- et_xmlfile/ — openpyxl dependency (pure Python)
Limitations
- Does not extract images or embedded objects (text only)
- Does not preserve complex formatting (colors, fonts, layouts)
- Does not handle encrypted/password-protected files
- No OCR for scanned documents (use OpenClaw's native
pdftool for that)
Why This Skill?
Existing markitdown-based skills require pip install or external CLI tools, which triggers ClawHub security warnings. This skill is 100% self-contained — install it and use it immediately, even offline.