invoice-extractor

Extract structured financial data from PDF invoices, receipts, and image files (JPG/PNG/WEBP/BMP) using any OpenAI-compatible Vision Language Model. Supports async concurrent batch processing. Outputs extraction results and token usage reports to a results/ directory.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "invoice-extractor" with this command: npx skills add ontos-ai/invoice-extract-skill/ontos-ai-invoice-extract-skill-invoice-extractor

Invoice Extractor Skill

What It Does

Automatically extracts the following fields from PDF invoices/receipts and image files (JPG/PNG/WEBP/BMP) into structured JSON/CSV/Excel:

  • Document: type, number, date, due date
  • Vendor: name, tax ID
  • Customer: name
  • Financials: currency, subtotal (excl. VAT), VAT amount, total (incl. VAT)
  • Line items: description, quantity, unit price, VAT rate, line total
  • Status: payment status, notes
  • Warnings: row-level flags for data inconsistencies (e.g. amount mismatches)
  • Usage report: per-file input/output tokens and API call duration

Prerequisites

  1. An API key for any OpenAI-compatible VLM provider (Qwen, DeepSeek, OpenAI, etc.)

  2. Install dependencies:

pip install -r requirements.txt

Configuration

Create a .env file in the project root:

API_KEY=your-api-key-here
BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
MODEL=qwen-vl-max-latest

Swap providers by changing these three values:

ProviderMODELBASE_URL
Qwen (default)qwen-vl-max-latesthttps://dashscope.aliyuncs.com/compatible-mode/v1
DeepSeekdeepseek-chathttps://api.deepseek.com/v1
OpenAIgpt-4ohttps://api.openai.com/v1

All settings can also be overridden via CLI flags (--api-key, --base-url, --model).

Usage

Basic

# Batch extract all files in a directory (PDF + images, 5 concurrent)
python cli.py -i "/path/to/invoices/"

# Single file (PDF or image)
python cli.py -i "invoice.pdf"
python cli.py -i "receipt.jpg"

Results are automatically saved to results/:

  • extraction_<timestamp>.csv — Extracted invoice data
  • usage_report_<timestamp>.json — Token usage and timing report

Output Format

Choose your preferred format with --format:

python cli.py -i "folder/" --format json
python cli.py -i "folder/" --format xlsx
python cli.py -i "folder/" --format csv   # default

Custom CSV Columns

Only output the columns you need. Omit --columns to get all 30 default columns.

python cli.py -i "folder/" --format csv \
    --columns item_seq,source_file,vendor_name,invoice_number,total_incl_vat,warnings

Available columns: item_seq, source_file, document_type, invoice_number, invoice_date, due_date, currency, vendor_name, vendor_tax_id, customer_name, subtotal_excl_vat, total_vat_amount, total_incl_vat, item_description, item_quantity, item_unit_price, item_vat_rate, item_line_total, payment_status, notes, warnings

Custom Output Directory

python cli.py -i "folder/" -o my_output/

Using a Different Provider

# DeepSeek
python cli.py -i "folder/" \
    --api-key sk-xxx --base-url https://api.deepseek.com/v1 --model deepseek-chat

# OpenAI
python cli.py -i "folder/" \
    --api-key sk-xxx --base-url https://api.openai.com/v1 --model gpt-4o

Advanced Options

python cli.py -i "folder/" --format xlsx \
    --dpi 300 \
    --max-pages 10 \
    -c 10 \
    -v  # verbose logging
  • -c, --concurrency N: Max concurrent VLM API calls during batch processing (default: 5)

Environment Check

python scripts/check_env.py

Verifies: Python version, required packages, API key configuration.

Output

All outputs are saved to the results/ directory (or custom --output-dir):

results/
├── extraction_20260207_222600.csv       # Extracted invoice data
└── usage_report_20260207_222600.json    # Token usage report

Usage Report Format

{
  "summary": {
    "total_files": 3,
    "success_count": 3,
    "failure_count": 0,
    "total_input_tokens": 15234,
    "total_output_tokens": 2456,
    "total_tokens": 17690,
    "total_duration_seconds": 32.5
  },
  "per_file": [
    {
      "source_file": "invoice_001.pdf",
      "input_tokens": 5078,
      "output_tokens": 819,
      "total_tokens": 5897,
      "duration_seconds": 10.8
    }
  ]
}

Programmatic Integration

from invoice_extractor.extractor import InvoiceExtractor
from invoice_extractor.exporter import to_csv, to_excel, export_usage_report

# Initialize with any OpenAI-compatible API
extractor = InvoiceExtractor(
    api_key="your-key",
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    model="qwen-vl-max-latest",
)

# Extract single file
result = extractor.extract_single("invoice.pdf")
print(result.data.model_dump_json(indent=2))
if result.warnings:
    print(f"Warnings: {result.warnings}")
if result.usage:
    print(f"Tokens: in={result.usage.input_tokens}, out={result.usage.output_tokens}")

# Batch extract → custom CSV + usage report
batch = extractor.extract_batch("/path/to/folder/")
to_csv(batch, "results/output.csv", columns=[
    "item_seq", "source_file", "vendor_name",
    "invoice_number", "total_incl_vat", "warnings",
])
export_usage_report(batch, "results/usage_report.json")

# Batch extract → Excel
to_excel(batch, "results/output.xlsx")

Warning System

  • Amount mismatch: Flags when subtotal + VAT ≠ total (tolerance: 0.05)
  • Missing fields: Left blank in CSV (no "null" or "None" strings)
  • Extraction failure: Entire row shows the failure reason in the warnings column

Supported Document Types

TypeLanguagesExamples
Digital invoices (PDF)NL/DE/FR/ENStandard electronic invoice PDFs
Scanned receipts (PDF/JPG/PNG)AnyPhotographed or scanned paper receipts
Credit notesNL/DE/FR/ENCreditnota / Gutschrift
Payment confirmationsNL/DE/FR/ENBetalingsbevestiging
Image files (JPG/PNG/WEBP/BMP)AnyDirect photo of invoice/receipt

Notes

  • Compatible with any OpenAI-compatible VLM API (default: Qwen qwen-vl-max-latest)
  • Supports PDF and image files (JPG, PNG, WEBP, BMP) as input
  • Async concurrent batch processing (default: 5 concurrent, configurable via -c)
  • European number formats auto-handled (comma = decimal, period = thousands separator)
  • Basic amount consistency validation (subtotal + tax = total)
  • Config priority: CLI flags > environment variables > config.yaml
  • Token usage tracked per file; usage report saved as JSON alongside results

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

image-gen

Generate AI images from text prompts. Triggers on: "生成图片", "画一张", "AI图", "generate image", "配图", "create picture", "draw", "visualize", "generate an image".

Archived SourceRecently Updated
General

explainer

Create explainer videos with narration and AI-generated visuals. Triggers on: "解说视频", "explainer video", "explain this as a video", "tutorial video", "introduce X (video)", "解释一下XX(视频形式)".

Archived SourceRecently Updated
General

asr

Transcribe audio files to text using local speech recognition. Triggers on: "转录", "transcribe", "语音转文字", "ASR", "识别音频", "把这段音频转成文字".

Archived SourceRecently Updated