Invoice Extractor Skill

What It Does

Automatically extracts the following fields from PDF invoices/receipts and image files (JPG/PNG/WEBP/BMP) into structured JSON/CSV/Excel:

Document: type, number, date, due date
Vendor: name, tax ID
Customer: name
Financials: currency, subtotal (excl. VAT), VAT amount, total (incl. VAT)
Line items: description, quantity, unit price, VAT rate, line total
Status: payment status, notes
Warnings: row-level flags for data inconsistencies (e.g. amount mismatches)
Usage report: per-file input/output tokens and API call duration

Prerequisites

An API key for any OpenAI-compatible VLM provider (Qwen, DeepSeek, OpenAI, etc.)
Install dependencies:

pip install -r requirements.txt

Configuration

Create a .env file in the project root:

API_KEY=your-api-key-here
BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
MODEL=qwen-vl-max-latest

Swap providers by changing these three values:

Provider	MODEL	BASE_URL
Qwen (default)	`qwen-vl-max-latest`	`https://dashscope.aliyuncs.com/compatible-mode/v1`
DeepSeek	`deepseek-chat`	`https://api.deepseek.com/v1`
OpenAI	`gpt-4o`	`https://api.openai.com/v1`

All settings can also be overridden via CLI flags (--api-key, --base-url, --model).

Usage

Basic

# Batch extract all files in a directory (PDF + images, 5 concurrent)
python cli.py -i "/path/to/invoices/"

# Single file (PDF or image)
python cli.py -i "invoice.pdf"
python cli.py -i "receipt.jpg"

Results are automatically saved to results/:

extraction_<timestamp>.csv — Extracted invoice data
usage_report_<timestamp>.json — Token usage and timing report

Output Format

Choose your preferred format with --format:

python cli.py -i "folder/" --format json
python cli.py -i "folder/" --format xlsx
python cli.py -i "folder/" --format csv   # default

Custom CSV Columns

Only output the columns you need. Omit --columns to get all 30 default columns.

python cli.py -i "folder/" --format csv \
    --columns item_seq,source_file,vendor_name,invoice_number,total_incl_vat,warnings

Available columns: item_seq, source_file, document_type, invoice_number, invoice_date, due_date, currency, vendor_name, vendor_tax_id, customer_name, subtotal_excl_vat, total_vat_amount, total_incl_vat, item_description, item_quantity, item_unit_price, item_vat_rate, item_line_total, payment_status, notes, warnings

Custom Output Directory

python cli.py -i "folder/" -o my_output/

Using a Different Provider

# DeepSeek
python cli.py -i "folder/" \
    --api-key sk-xxx --base-url https://api.deepseek.com/v1 --model deepseek-chat

# OpenAI
python cli.py -i "folder/" \
    --api-key sk-xxx --base-url https://api.openai.com/v1 --model gpt-4o

Advanced Options

python cli.py -i "folder/" --format xlsx \
    --dpi 300 \
    --max-pages 10 \
    -c 10 \
    -v  # verbose logging

-c, --concurrency N: Max concurrent VLM API calls during batch processing (default: 5)

Environment Check

python scripts/check_env.py

Verifies: Python version, required packages, API key configuration.

Output

All outputs are saved to the results/ directory (or custom --output-dir):

results/
├── extraction_20260207_222600.csv       # Extracted invoice data
└── usage_report_20260207_222600.json    # Token usage report

Usage Report Format

{
  "summary": {
    "total_files": 3,
    "success_count": 3,
    "failure_count": 0,
    "total_input_tokens": 15234,
    "total_output_tokens": 2456,
    "total_tokens": 17690,
    "total_duration_seconds": 32.5
  },
  "per_file": [
    {
      "source_file": "invoice_001.pdf",
      "input_tokens": 5078,
      "output_tokens": 819,
      "total_tokens": 5897,
      "duration_seconds": 10.8
    }
  ]
}

Programmatic Integration

from invoice_extractor.extractor import InvoiceExtractor
from invoice_extractor.exporter import to_csv, to_excel, export_usage_report

# Initialize with any OpenAI-compatible API
extractor = InvoiceExtractor(
    api_key="your-key",
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    model="qwen-vl-max-latest",
)

# Extract single file
result = extractor.extract_single("invoice.pdf")
print(result.data.model_dump_json(indent=2))
if result.warnings:
    print(f"Warnings: {result.warnings}")
if result.usage:
    print(f"Tokens: in={result.usage.input_tokens}, out={result.usage.output_tokens}")

# Batch extract → custom CSV + usage report
batch = extractor.extract_batch("/path/to/folder/")
to_csv(batch, "results/output.csv", columns=[
    "item_seq", "source_file", "vendor_name",
    "invoice_number", "total_incl_vat", "warnings",
])
export_usage_report(batch, "results/usage_report.json")

# Batch extract → Excel
to_excel(batch, "results/output.xlsx")

Warning System

Amount mismatch: Flags when subtotal + VAT ≠ total (tolerance: 0.05)
Missing fields: Left blank in CSV (no "null" or "None" strings)
Extraction failure: Entire row shows the failure reason in the warnings column

Supported Document Types

Type	Languages	Examples
Digital invoices (PDF)	NL/DE/FR/EN	Standard electronic invoice PDFs
Scanned receipts (PDF/JPG/PNG)	Any	Photographed or scanned paper receipts
Credit notes	NL/DE/FR/EN	Creditnota / Gutschrift
Payment confirmations	NL/DE/FR/EN	Betalingsbevestiging
Image files (JPG/PNG/WEBP/BMP)	Any	Direct photo of invoice/receipt

Notes

Compatible with any OpenAI-compatible VLM API (default: Qwen qwen-vl-max-latest)
Supports PDF and image files (JPG, PNG, WEBP, BMP) as input
Async concurrent batch processing (default: 5 concurrent, configurable via -c)
European number formats auto-handled (comma = decimal, period = thousands separator)
Basic amount consistency validation (subtotal + tax = total)
Config priority: CLI flags > environment variables > config.yaml
Token usage tracked per file; usage report saved as JSON alongside results