kreuzberg

Extract text, tables, metadata, and images from 91+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, Rust, or CLI. Covers installation, extraction (sync/async), configuration (OCR, chunking, output format), batch processing, error handling, and plugins.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "kreuzberg" with this command: npx skills add kreuzberg-dev/kreuzberg/kreuzberg-dev-kreuzberg-kreuzberg

Kreuzberg Document Extraction

Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 91+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.

Use this skill when writing code that:

  • Extracts text or metadata from documents
  • Performs OCR on scanned documents or images
  • Batch-processes multiple files
  • Configures extraction options (output format, chunking, OCR, language detection)
  • Implements custom plugins (post-processors, validators, OCR backends)

Installation

Python

pip install kreuzberg
# Optional OCR backends:
pip install kreuzberg[easyocr]    # EasyOCR

Node.js

npm install @kreuzberg/node

Rust

# Cargo.toml
[dependencies]
kreuzberg = { version = "4", features = ["tokio-runtime"] }
# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,
#           embeddings, language-detection, keywords-yake, keywords-rake

CLI

# Download from GitHub releases, or:
cargo install kreuzberg-cli

Quick Start

Python (Async)

from kreuzberg import extract_file

result = await extract_file("document.pdf")
print(result.content)       # extracted text
print(result.metadata)      # document metadata
print(result.tables)        # extracted tables

Python (Sync)

from kreuzberg import extract_file_sync

result = extract_file_sync("document.pdf")
print(result.content)

Node.js

import { extractFile } from '@kreuzberg/node';

const result = await extractFile('document.pdf');
console.log(result.content);
console.log(result.metadata);
console.log(result.tables);

Node.js (Sync)

import { extractFileSync } from '@kreuzberg/node';

const result = extractFileSync('document.pdf');

Rust (Async)

use kreuzberg::{extract_file, ExtractionConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let result = extract_file("document.pdf", None, &config).await?;
    println!("{}", result.content);
    Ok(())
}

Rust (Sync) — requires tokio-runtime feature

use kreuzberg::{extract_file_sync, ExtractionConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let result = extract_file_sync("document.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}

CLI

kreuzberg extract document.pdf
kreuzberg extract document.pdf --format json
kreuzberg extract document.pdf --output-format markdown

Configuration

All languages use the same configuration structure with language-appropriate naming conventions.

Python (snake_case)

from kreuzberg import (
    ExtractionConfig, OcrConfig, TesseractConfig,
    PdfConfig, ChunkingConfig,
)

config = ExtractionConfig(
    ocr=OcrConfig(
        backend="tesseract",
        language="eng",
        tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
    ),
    pdf_options=PdfConfig(passwords=["secret123"]),
    chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
    output_format="markdown",
)

result = await extract_file("document.pdf", config=config)

Node.js (camelCase)

import { extractFile, type ExtractionConfig } from '@kreuzberg/node';

const config: ExtractionConfig = {
    ocr: { backend: 'tesseract', language: 'eng' },
    pdfOptions: { passwords: ['secret123'] },
    chunking: { maxChars: 1000, maxOverlap: 200 },
    outputFormat: 'markdown',
};

const result = await extractFile('document.pdf', null, config);

Rust (snake_case)

use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};

let config = ExtractionConfig {
    ocr: Some(OcrConfig {
        backend: "tesseract".into(),
        language: "eng".into(),
        ..Default::default()
    }),
    chunking: Some(ChunkingConfig {
        max_characters: 1000,
        overlap: 200,
        ..Default::default()
    }),
    output_format: OutputFormat::Markdown,
    ..Default::default()
};

let result = extract_file("document.pdf", None, &config).await?;

Config File (TOML)

output_format = "markdown"

[ocr]
backend = "tesseract"
language = "eng"

[chunking]
max_chars = 1000
max_overlap = 200

[pdf_options]
passwords = ["secret123"]
# CLI: auto-discovers kreuzberg.toml in current/parent directories
kreuzberg extract doc.pdf
# or explicit:
kreuzberg extract doc.pdf --config kreuzberg.toml
kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'

Batch Processing

Python

from kreuzberg import batch_extract_files, batch_extract_files_sync

# Async
results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])

# Sync
results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])

for result in results:
    print(f"{len(result.content)} chars extracted")

Node.js

import { batchExtractFiles } from '@kreuzberg/node';

const results = await batchExtractFiles(['doc1.pdf', 'doc2.docx']);

Rust — requires tokio-runtime feature

use kreuzberg::{batch_extract_file, ExtractionConfig};

let config = ExtractionConfig::default();
let paths = vec!["doc1.pdf", "doc2.docx"];
let results = batch_extract_file(paths, &config).await?;

CLI

kreuzberg batch *.pdf --format json
kreuzberg batch docs/*.docx --output-format markdown

OCR

OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).

Backends

  • Tesseract (default): Built-in native binding. All Tesseract languages supported.
  • EasyOCR (Python only): pip install kreuzberg[easyocr]. Pass easyocr_kwargs={"gpu": True}.
  • PaddleOCR (Python only): Bundled since 4.8.5, no extra install needed. Pass paddleocr_kwargs={"use_angle_cls": True}.
  • Guten (Node.js only): Built-in OCR backend via GutenOcrBackend.

Language Codes

config = ExtractionConfig(ocr=OcrConfig(language="eng"))       # English
config = ExtractionConfig(ocr=OcrConfig(language="eng+deu"))   # Multiple
config = ExtractionConfig(ocr=OcrConfig(language="all"))       # All installed

Force OCR

config = ExtractionConfig(force_ocr=True)  # OCR even if text is extractable

ExtractionResult Fields

FieldPythonNode.jsRustDescription
Text contentresult.contentresult.contentresult.contentExtracted text (str/String)
MIME typeresult.mime_typeresult.mimeTyperesult.mime_typeInput document MIME type
Metadataresult.metadataresult.metadataresult.metadataDocument metadata (dict/object/HashMap)
Tablesresult.tablesresult.tablesresult.tablesExtracted tables with cells + markdown
Languagesresult.detected_languagesresult.detectedLanguagesresult.detected_languagesDetected languages (if enabled)
Chunksresult.chunksresult.chunksresult.chunksText chunks (if chunking enabled)
Imagesresult.imagesresult.imagesresult.imagesExtracted images (if enabled)
Elementsresult.elementsresult.elementsresult.elementsSemantic elements (if element_based format)
Pagesresult.pagesresult.pagesresult.pagesPer-page content (if page extraction enabled)
Keywordsresult.keywordsresult.keywordsresult.keywordsExtracted keywords (if enabled)

Error Handling

Python

from kreuzberg import (
    extract_file_sync, KreuzbergError, ParsingError,
    OCRError, ValidationError, MissingDependencyError,
)

try:
    result = extract_file_sync("file.pdf")
except ParsingError as e:
    print(f"Failed to parse: {e}")
except OCRError as e:
    print(f"OCR failed: {e}")
except ValidationError as e:
    print(f"Invalid input: {e}")
except MissingDependencyError as e:
    print(f"Missing dependency: {e}")
except KreuzbergError as e:
    print(f"Extraction failed: {e}")

Node.js

import {
    extractFile, KreuzbergError, ParsingError,
    OcrError, ValidationError, MissingDependencyError,
} from '@kreuzberg/node';

try {
    const result = await extractFile('file.pdf');
} catch (e) {
    if (e instanceof ParsingError) { /* ... */ }
    else if (e instanceof OcrError) { /* ... */ }
    else if (e instanceof ValidationError) { /* ... */ }
    else if (e instanceof KreuzbergError) { /* ... */ }
}

Rust

use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};

let config = ExtractionConfig::default();
match extract_file("file.pdf", None, &config).await {
    Ok(result) => println!("{}", result.content),
    Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),
    Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),
    Err(e) => eprintln!("Error: {e}"),
}

Common Pitfalls

  1. Python ChunkingConfig fields: Use max_chars and max_overlap, NOT max_characters or overlap.
  2. Rust extract_file signature: Third argument is &ExtractionConfig (a reference), not Option. Use &ExtractionConfig::default() for defaults.
  3. Rust feature gates: extract_file_sync, batch_extract_file, and batch_extract_file_sync all require features = ["tokio-runtime"] in Cargo.toml.
  4. Rust async context: extract_file is async. Use #[tokio::main] or call from an async context.
  5. CLI --format vs --output-format: --format controls CLI output (text/json). --output-format controls content format (plain/markdown/djot/html).
  6. Node.js extractFile signature: extractFile(path, mimeType?, config?) — mimeType is the second arg (pass null to skip).
  7. Python detect_mime_type: The function for detecting from bytes is detect_mime_type(data). For paths use detect_mime_type_from_path(path).
  8. Config file field names: Use snake_case in TOML/YAML/JSON config files (e.g., max_chars, max_overlap, pdf_options).

Supported Formats (Summary)

CategoryExtensions
PDF.pdf
Word.docx, .odt
Spreadsheets.xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .ods
Presentations.pptx, .ppt, .ppsx
eBooks.epub, .fb2
Images.png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif, .jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm, .svg
Markup.html, .htm, .xhtml, .xml
Data.json, .yaml, .yml, .toml, .csv, .tsv
Text.txt, .md, .markdown, .djot, .rst, .org, .rtf
Email.eml, .msg
Archives.zip, .tar, .tgz, .gz, .7z
Academic.bib, .biblatex, .ris, .nbib, .enw, .csl, .tex, .latex, .typ, .jats, .ipynb, .docbook, .opml, .pod, .mdoc, .troff

See references/supported-formats.md for the complete format reference with MIME types.

Additional Resources

Detailed reference files for specific topics:

Full documentation: https://docs.kreuzberg.dev GitHub: https://github.com/kreuzberg-dev/kreuzberg

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

arxiv-paper-writer

Use this skill whenever the user wants Claude Code to write, scaffold, compile, debug, or review an arXiv-style academic paper, especially survey papers with LaTeX, BibTeX citations, TikZ figures, tables, and PDF output. This skill should trigger for requests like writing a full paper, creating an arXiv paper project, turning a research topic into a LaTeX manuscript, reproducing the Paper-Write-Skill-Test agent-survey workflow, or setting up a Windows/Linux Claude Code paper-writing loop.

Archived SourceRecently Updated
Coding

cli-proxy-troubleshooting

排查 CLI Proxy API(codex-api-proxy)的配置、认证、模型注册和请求问题。适用场景包括:(1) AI 请求报错 unknown provider for model, (2) 模型列表中缺少预期模型, (3) codex-api-key/auth-dir 配置不生效, (4) CLI Proxy 启动后 AI 无法调用, (5) 认证成功但请求失败或超时。包含源码级排查方法:模型注册表架构、认证加载链路、 SanitizeCodexKeys 规则、常见错误的真实根因。

Archived SourceRecently Updated
Coding

visual-summary-analysis

Performs AI analysis on input video clips/image content and generates a smooth, natural scene description. | 视觉摘要智述技能,对传入的视频片段/图片内容进行AI分析,生成一段通顺自然的场景描述内容

Archived SourceRecently Updated
Coding

frontend-skill

全能高级前端研发工程师技能。擅长AI时代前沿技术栈(React最新 + shadcn/ui + Tailwind CSS v4 + TypeScript + Next.js),精通动效库与交互特效开发。采用Glue Code风格快速实现代码,强调高质量产品体验与高度友好的UI视觉规范。在组件调用、交互特效、全局Theme上保持高度规范:绝不重复造轮子,相同逻辑出现两次即封装为组件。具备安全意识,防范各类注入攻击。开发页面具有高度自适应能力,响应式设计贯穿始终。当用户无特殊技术栈要求时,默认采用主流前沿技术栈。

Archived SourceRecently Updated