document-ocr-processing

Document OCR Processing

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "document-ocr-processing" with this command: npx skills add findinfinitelabs/chuuk/findinfinitelabs-chuuk-document-ocr-processing

Document OCR Processing

Overview

Specialized OCR processing for documents containing Chuukese text, with enhanced accuracy for accented characters, traditional formatting patterns, and multilingual content. Designed to handle the unique challenges of digitizing historical and contemporary Chuukese documents.

Capabilities

  • Chuukese-Aware OCR: Enhanced recognition of accented characters (á, é, í, ó, ú, ā, ē, ī, ō, ū)

  • Traditional Format Recognition: Handle traditional document layouts and formatting

  • Multilingual Processing: Process documents with both Chuukese and English text

  • Quality Enhancement: Post-processing to improve OCR accuracy

  • Batch Processing: Efficiently process multiple documents

  • Format Preservation: Maintain original document structure and layout

Core Components

  1. OCR Engine Setup

import pytesseract from PIL import Image import cv2 import numpy as np

class ChuukeseOCRProcessor: def init(self): # Configure Tesseract for multi-language support self.tesseract_config = { 'chuukese_optimized': '--oem 3 --psm 6 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzáéíóúāēīōū0123456789.,!?;:()-"' ', 'multilingual': '--oem 3 --psm 6', 'preserve_structure': '--oem 3 --psm 1' }

    # Chuukese character mappings for OCR corrections
    self.ocr_corrections = {
        # Common OCR mistakes for accented characters
        'a´': 'á', 'a`': 'à', 'a¯': 'ā',
        'e´': 'é', 'e`': 'è', 'e¯': 'ē',
        'i´': 'í', 'i`': 'ì', 'i¯': 'ī',
        'o´': 'ó', 'o`': 'ò', 'o¯': 'ō',
        'u´': 'ú', 'u`': 'ù', 'u¯': 'ū',
        
        # Common character confusions
        '0': 'o', '1': 'l', '5': 's',
        'rn': 'm', 'cl': 'd', 'ck': 'ch'
    }

def preprocess_image(self, image_path):
    """Preprocess image for better OCR accuracy"""
    # Load image
    image = cv2.imread(image_path)
    
    # Convert to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    
    # Noise removal
    denoised = cv2.medianBlur(gray, 3)
    
    # Contrast enhancement
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
    enhanced = clahe.apply(denoised)
    
    # Binarization
    _, binary = cv2.threshold(enhanced, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    
    return binary

2. Post-Processing for Chuukese Text

class ChuukeseOCRPostProcessor: def init(self, dictionary_path=None): self.dictionary = {} if dictionary_path: self.load_chuukese_dictionary(dictionary_path)

    # Common OCR error patterns for Chuukese
    self.error_patterns = {
        # Accent corrections
        r'a[\'\`\´]': 'á',
        r'e[\'\`\´]': 'é',
        r'i[\'\`\´]': 'í',
        r'o[\'\`\´]': 'ó',
        r'u[\'\`\´]': 'ú',
        
        # Common character substitutions
        r'\b0(?=[aeiou])': 'o',  # 0 at start of word -> o
        r'(?<=[aeiou])0\b': 'o',  # 0 at end after vowel -> o
        r'\brn(?=[aeiou])': 'm',   # rn -> m
    }

def correct_ocr_errors(self, text):
    """Apply OCR error corrections specific to Chuukese"""
    corrected = text
    
    # Apply pattern-based corrections
    for pattern, replacement in self.error_patterns.items():
        corrected = re.sub(pattern, replacement, corrected)
    
    return corrected

Usage Examples

Process Single Document

Initialize processor

processor = BatchOCRProcessor("output/ocr_results")

Process single document

result = processor.process_document("scanned_chuukese_dictionary.jpg")

Access extracted text

extracted_text = result['extracted_text'] dictionary_entries = result['document_structure']['dictionary_entries']

Batch Process Directory

Process all images in a directory

batch_results = processor.process_batch( "scanned_documents/", file_patterns=['.jpg', '.png'] )

print(f"Processed {batch_results['successfully_processed']} documents")

Best Practices

Image Preprocessing

  • Quality assessment: Check image quality before processing

  • Resolution optimization: Ensure minimum 300 DPI for OCR

  • Noise reduction: Apply appropriate filtering for cleaner text

  • Orientation correction: Detect and correct page rotation

OCR Accuracy

  • Language-specific tuning: Optimize for Chuukese character set

  • Confidence thresholds: Filter low-confidence results

  • Multiple engine comparison: Use different OCR engines for comparison

  • Human validation: Sample-based quality checking

Dependencies

  • pytesseract : OCR engine interface

  • opencv-python : Image preprocessing

  • Pillow : Image handling and manipulation

  • numpy : Numerical operations for image processing

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

css-styling-standards

No summary provided by upstream source.

Repository SourceNeeds Review
General

database-management-operations

No summary provided by upstream source.

Repository SourceNeeds Review
General

bible-epub-processing

No summary provided by upstream source.

Repository SourceNeeds Review