DOCX Toolkit
A complete toolkit for processing Microsoft Word documents (.docx and legacy .doc formats).
Capabilities
1. Text + Table Extraction (.docx)
python3 {baseDir}/scripts/extract_text.py input.docx output.txt
Extracts all paragraphs and tables with structure preserved. Tables are formatted as pipe-delimited rows for easy parsing.
2. Text Extraction (Legacy .doc)
python3 {baseDir}/scripts/extract_doc_text.py input.doc output.txt
Handles legacy OLE2 .doc format using olefile. Extracts Unicode text from the WordDocument stream.
3. Image Extraction (.docx)
python3 {baseDir}/scripts/extract_images.py input.docx output_dir/
Extracts all embedded images with:
- Automatic deduplication (MD5 hash comparison)
- Size filtering (skips tiny icons <5KB by default)
- Sequential renaming (img_001.png, img_002.jpg, etc.)
4. Image Compression
python3 {baseDir}/scripts/resize_images.py input_dir/ output_dir/ [--max-width 1024]
Batch resize/compress images for API processing (saves 50-70% on vision API costs).
Dependencies
- Python 3.6+
python-docx— for .docx processingolefile— for legacy .doc processingPillow— for image resizing (optional, only needed for resize script)
Install:
pip3 install python-docx olefile Pillow
Use Cases
- Document analysis: Extract text for AI review/summarization
- Migration: Pull content from Word docs into other formats
- Image audit: Extract and review all embedded images
- Cost optimization: Compress images before sending to vision APIs
- Batch processing: Process multiple documents in a pipeline
Notes
- Large .doc files (>200MB) may require significant RAM for olefile processing
- Image extraction preserves original format (png/jpg/gif/etc.)
- Deduplication catches exact duplicates; near-duplicates still pass through
- CJK (Chinese/Japanese/Korean) text is fully supported in both extractors