docx-toolkit

Extract text, tables, and images from .docx and legacy .doc files. Handles large documents, CJK text, and complex table structures. Includes deduplication and filtering for extracted images.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "docx-toolkit" with this command: npx skills add zacjiang/docx-toolkit

DOCX Toolkit

A complete toolkit for processing Microsoft Word documents (.docx and legacy .doc formats).

Capabilities

1. Text + Table Extraction (.docx)

python3 {baseDir}/scripts/extract_text.py input.docx output.txt

Extracts all paragraphs and tables with structure preserved. Tables are formatted as pipe-delimited rows for easy parsing.

2. Text Extraction (Legacy .doc)

python3 {baseDir}/scripts/extract_doc_text.py input.doc output.txt

Handles legacy OLE2 .doc format using olefile. Extracts Unicode text from the WordDocument stream.

3. Image Extraction (.docx)

python3 {baseDir}/scripts/extract_images.py input.docx output_dir/

Extracts all embedded images with:

Automatic deduplication (MD5 hash comparison)
Size filtering (skips tiny icons <5KB by default)
Sequential renaming (img_001.png, img_002.jpg, etc.)

4. Image Compression

python3 {baseDir}/scripts/resize_images.py input_dir/ output_dir/ [--max-width 1024]

Batch resize/compress images for API processing (saves 50-70% on vision API costs).

Dependencies

Python 3.6+
python-docx — for .docx processing
olefile — for legacy .doc processing
Pillow — for image resizing (optional, only needed for resize script)

Install:

pip3 install python-docx olefile Pillow

Use Cases

Document analysis: Extract text for AI review/summarization
Migration: Pull content from Word docs into other formats
Image audit: Extract and review all embedded images
Cost optimization: Compress images before sending to vision APIs
Batch processing: Process multiple documents in a pipeline

Notes

Large .doc files (>200MB) may require significant RAM for olefile processing
Image extraction preserves original format (png/jpg/gif/etc.)
Deduplication catches exact duplicates; near-duplicates still pass through
CJK (Chinese/Japanese/Korean) text is fully supported in both extractors

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Open Registry Record Open in ClawHub

Related Skills

Related by shared tags or category signals.

General

Read Microsoft Word documents (.docx and .doc) with Chinese support

Read Microsoft Word documents (.docx and .doc) with Chinese support. Extract text, search keywords, and save as UTF-8 text files. No Microsoft Word installat...

Registry SourceRecently Updated

2.1K0Profile unavailable

Coding

Word to MD

Convert Word documents (.docx, .doc) to clean Markdown using the MinerU API. This skill uses mineru-open-api CLI to transform Word files into well-formatted...

Registry SourceRecently Updated

1140Profile unavailable

Coding

Word Converter

Universal Word document converter powered by MinerU API. Convert .docx and .doc files to Markdown, HTML, LaTeX, DOCX, or JSON using mineru-open-api CLI. Supp...

Registry SourceRecently Updated

1180Profile unavailable

Coding

Word Parser

Parse and extract structured content from Word documents (.docx, .doc) using the MinerU API. This skill uses mineru-open-api CLI to parse Word files into str...

Registry SourceRecently Updated

1250Profile unavailable