docx-toolkit

Extract text, tables, and images from .docx and legacy .doc files. Handles large documents, CJK text, and complex table structures. Includes deduplication and filtering for extracted images.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "docx-toolkit" with this command: npx skills add zacjiang/docx-toolkit

DOCX Toolkit

A complete toolkit for processing Microsoft Word documents (.docx and legacy .doc formats).

Capabilities

1. Text + Table Extraction (.docx)

python3 {baseDir}/scripts/extract_text.py input.docx output.txt

Extracts all paragraphs and tables with structure preserved. Tables are formatted as pipe-delimited rows for easy parsing.

2. Text Extraction (Legacy .doc)

python3 {baseDir}/scripts/extract_doc_text.py input.doc output.txt

Handles legacy OLE2 .doc format using olefile. Extracts Unicode text from the WordDocument stream.

3. Image Extraction (.docx)

python3 {baseDir}/scripts/extract_images.py input.docx output_dir/

Extracts all embedded images with:

  • Automatic deduplication (MD5 hash comparison)
  • Size filtering (skips tiny icons <5KB by default)
  • Sequential renaming (img_001.png, img_002.jpg, etc.)

4. Image Compression

python3 {baseDir}/scripts/resize_images.py input_dir/ output_dir/ [--max-width 1024]

Batch resize/compress images for API processing (saves 50-70% on vision API costs).

Dependencies

  • Python 3.6+
  • python-docx — for .docx processing
  • olefile — for legacy .doc processing
  • Pillow — for image resizing (optional, only needed for resize script)

Install:

pip3 install python-docx olefile Pillow

Use Cases

  • Document analysis: Extract text for AI review/summarization
  • Migration: Pull content from Word docs into other formats
  • Image audit: Extract and review all embedded images
  • Cost optimization: Compress images before sending to vision APIs
  • Batch processing: Process multiple documents in a pipeline

Notes

  • Large .doc files (>200MB) may require significant RAM for olefile processing
  • Image extraction preserves original format (png/jpg/gif/etc.)
  • Deduplication catches exact duplicates; near-duplicates still pass through
  • CJK (Chinese/Japanese/Korean) text is fully supported in both extractors

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

DOCX Formatter

生成符合中国公文格式规范的Word文档,支持标题、正文样式、自动格式排版和中文引号配对。

Registry SourceRecently Updated
0431
Profile unavailable
General

Docx Cn

Word 文档处理 | Word Document Processing. 创建、读取、编辑 Word 文档 | Create, read, edit Word documents. 支持 .docx 格式、格式化、表格、图片 | Supports .docx format, formatting, tables...

Registry SourceRecently Updated
12.6K
Profile unavailable
General

Word Reader

读取 Word 文档(.docx 和 .doc 格式)并提取文本内容。支持文档解析、表格提取、图片处理等功能。使用当用户需要分析 Word 文档内容、提取文本信息或批量处理文档时。

Registry SourceRecently Updated
21.4K
Profile unavailable