๐ง HWP Reader โ Read & Analyze Korean HWP/HWPX Documents
Author: ๋ฌดํญ์ด ๐ง | v1.0.0
Description
Read and extract text content from Korean HWP (ํ๊ธ) and HWPX files. Supports both legacy HWP format (via pyhwp) and modern HWPX format (ZIP-based XML).
When to Use
- User asks to read/analyze a .hwp or .hwpx file
- Government support application forms (์ ๋ถ์ง์์ฌ์ ์ ์ฒญ์)
- Any Korean document in Hangul Word Processor format
How It Works
HWP Files (Legacy Format)
python3 -c "
from hwp5.hwp5txt import main
import sys
sys.argv = ['hwp5txt', 'FILE_PATH']
main()
"
HWPX Files (Modern XML Format)
python3 -c "
import zipfile
z = zipfile.ZipFile('FILE_PATH')
# Quick preview text
if 'Preview/PrvText.txt' in z.namelist():
print(z.read('Preview/PrvText.txt').decode('utf-8'))
# Full content from section XMLs
import xml.etree.ElementTree as ET
for name in sorted(z.namelist()):
if name.startswith('Contents/section') and name.endswith('.xml'):
root = ET.fromstring(z.read(name))
for elem in root.iter():
if elem.text and elem.text.strip():
print(elem.text.strip())
"
Capabilities
| Feature | HWP | HWPX |
|---|---|---|
| Text extraction | โ pyhwp | โ ZIP+XML |
| Table detection | โ ๏ธ <ํ> markers | โ XML tags |
| Image extraction | โ | โ from BinData/ |
| Metadata | โ via hwp5 | โ from version.xml |
Dependencies
- pyhwp (
pip install pyhwp) โ installed at/Users/mupeng/Library/Python/3.9/lib/python/site-packages/hwp5/ - Python 3.9+ โ standard library
zipfile,xml.etree.ElementTree
Limitations
- HWP text extraction loses table structure (shows
<ํ>placeholder) - HWPX Preview/PrvText.txt is truncated to ~1KB; use section XMLs for full content
- Complex formatting (colors, fonts, page layout) not preserved in text mode
- Encrypted/password-protected HWP files not supported
Usage Examples
Read a government application form
"์ด HWP ํ์ผ ์ฝ์ด์ค: /path/to/์ ์ฒญ์.hwp"
โ Extract text โ Analyze structure โ Summarize sections
Compare two versions
"v1.hwp์ v2.hwp ์ฐจ์ด์ ๋ถ์ํด์ค"
โ Extract both โ Diff content โ Report changes
Fill in a template
"์ด ์์์ ์ฐ๋ฆฌ ์ฌ์
๋ด์ฉ ์ฑ์์ค"
โ Read template โ Identify blanks โ Generate content suggestions
๐ง ๋ฌดํญ์ด โ Making Korean documents accessible to AI agents