office-doc-extractor

Convert Microsoft Office documents (DOCX, XLSX, PPTX) to Markdown without any external dependencies. Use when the user needs to extract text from Word documents, Excel spreadsheets, or PowerPoint presentations for analysis, indexing, or LLM processing. Pure Python implementation — no pip install, no subprocess calls, no network downloads required. Works offline.

Office Document Extractor

Zero-dependency converter for Microsoft Office documents. Extracts text and structure from DOCX, XLSX, and PPTX files into clean Markdown.

Quick Start

# Single file
python3 scripts/main.py report.docx -o report.md

# Batch convert a directory
python3 scripts/main.py ./documents --batch -o ./markdown

Supported Formats

Format	Extension	Output
Word	.docx	Headings, paragraphs
Excel	.xlsx	Tables (one per sheet)
PowerPoint	.pptx	Slides as sections

How It Works

DOCX: Parses the ZIP archive's XML directly using Python's zipfile and xml.etree
XLSX: Uses bundled openpyxl (pure Python, no C extensions)
PPTX: Parses the ZIP archive's slide XML directly

No external commands, no network calls, no pip install required.

Usage

Single File

python3 scripts/main.py <input_file> [-o <output.md>]

Auto-detects format from file extension. If -o is omitted, outputs to <input>.md.

Batch Conversion

python3 scripts/main.py <input_directory> --batch [-o <output_directory>]

Converts all .docx, .xlsx, .pptx files in the directory. Results saved to markdown_output/ by default.

Resources

scripts/

main.py — Unified CLI for single-file and batch conversion
docx_extractor.py — DOCX → Markdown (standard library only)
xlsx_extractor.py — XLSX → Markdown tables (bundled openpyxl)
pptx_extractor.py — PPTX → Markdown (standard library only)

Bundled Dependencies

openpyxl/ — Pure Python Excel library (v3.1.5)
et_xmlfile/ — openpyxl dependency (pure Python)

Limitations

Does not extract images or embedded objects (text only)
Does not preserve complex formatting (colors, fonts, layouts)
Does not handle encrypted/password-protected files
No OCR for scanned documents (use OpenClaw's native pdf tool for that)

Why This Skill?

Existing markitdown-based skills require pip install or external CLI tools, which triggers ClawHub security warnings. This skill is 100% self-contained — install it and use it immediately, even offline.

office-doc-extractor

Safety Notice

Copy this and send it to your AI assistant to learn

Office Document Extractor

Quick Start

Supported Formats

How It Works

Usage

Single File

Batch Conversion

Resources

scripts/

Bundled Dependencies

Limitations

Why This Skill?

Source Transparency

Related Skills

Word Converter

Office Toolkit

文件总结 File Summary & Analysis

Joe's Markdown to DOCX Converter