Tesseract OCR Skill
Extract text content from images using the Tesseract engine directly via command line.
Features
- Extract text from image files using native tesseract CLI
- Support multi-language recognition (Chinese, English, etc.)
- No Python dependencies required
- Simple and fast
Dependencies
Install Tesseract OCR system package:
# Ubuntu/Debian:
sudo apt-get install tesseract-ocr tesseract-ocr-chi-sim
# macOS:
brew install tesseract tesseract-lang
Usage
Basic Usage
# Use default language (English)
tesseract /path/to/image.png stdout
# Specify language (Chinese + English)
tesseract /path/to/image.png stdout -l chi_sim+eng
# Save to file
tesseract /path/to/image.png output.txt -l chi_sim+eng
# Multiple languages
tesseract /path/to/image.png stdout -l chi_sim+eng+jpn
Common Language Codes
| Language | Code |
|---|---|
| Simplified Chinese | chi_sim |
| Traditional Chinese | chi_tra |
| English | eng |
| Japanese | jpn |
| Korean | kor |
| Chinese + English | chi_sim+eng |
Quick Examples
# OCR with Chinese support
tesseract image.jpg stdout -l chi_sim
# OCR with mixed Chinese and English
tesseract image.png stdout -l chi_sim+eng
# Save to file instead of stdout
tesseract document.png result -l chi_sim+eng
# Creates result.txt
Notes
- OCR accuracy depends on image quality; use clear images for best results
- Complex layouts (tables, multi-column) may require post-processing
- Chinese recognition requires the tesseract-ocr-chi-sim language pack
- Language packs must be installed separately on your system