langextract — LLM-Powered Structured Information Extraction

Extract structured data from unstructured text with character-level provenance. Every extracted entity traces back to exact character offsets in the source document.

When to use this skill

Extracting entities, relationships, or facts from unstructured text
Processing clinical notes, legal documents, research papers, or reports
Building NLP pipelines that need citation-level traceability (not just extracted values)
Long-document extraction (chunking + parallel workers + multi-pass for recall)
Replacing fragile regex/rule-based extraction with LLM-driven schema enforcement
Generating interactive HTML visualizations of annotated text

Installation

Standard install (Gemini backend — default)

pip install langextract

With OpenAI support

pip install langextract[openai]

Development

pip install -e ".[dev]"

API key setup:

export LANGEXTRACT_API_KEY="your-gemini-or-openai-key"

Gemini keys: https://aistudio.google.com/app/apikey

OpenAI keys: https://platform.openai.com/api-keys

Core concepts

Concept Description

Source grounding Every extraction carries (start, end) char offsets into original text

Controlled generation Gemini uses schema-constrained decoding; no hallucinated field names

Few-shot examples Schema is inferred from ExampleData objects — zero fine-tuning needed

Multi-pass extraction extraction_passes=N runs N independent passes; results are merged

Parallel chunking max_workers=N processes text chunks concurrently

Basic extraction

import langextract as lx import textwrap

prompt = textwrap.dedent("""
Extract characters, emotions, and relationships in order of appearance. Use exact text for extractions. Do not paraphrase or overlap entities. Provide meaningful attributes for each entity to add context.""")

examples = [ lx.data.ExampleData( text="ROMEO. But soft! What light through yonder window breaks?", extractions=[ lx.data.Extraction( extraction_class="character", extraction_text="ROMEO", attributes={"emotional_state": "wonder"} ), ] ) ]

result = lx.extract( text_or_documents="Lady Juliet gazed longingly at the stars...", prompt_description=prompt, examples=examples, model_id="gemini-2.5-flash", )

Access results

for extraction in result.extractions: print(f"[{extraction.extraction_class}] '{extraction.extraction_text}' " f"@ chars {extraction.start}–{extraction.end}")

Long-document extraction (URL input, multi-pass, parallel)

result = lx.extract( text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt", prompt_description=prompt, examples=examples, model_id="gemini-2.5-flash", extraction_passes=3, # 3 independent runs, results merged max_workers=20, # parallel chunk processing max_char_buffer=1000 # smaller focused context windows )

Romeo & Juliet (147k chars / ~44k tokens) → 4,088 entities extracted

OpenAI backend

import os, langextract as lx

result = lx.extract( text_or_documents=input_text, prompt_description=prompt, examples=examples, model_id="gpt-4o", api_key=os.environ.get("OPENAI_API_KEY"), fence_output=True, use_schema_constraints=False )

Local LLMs via Ollama

result = lx.extract( text_or_documents=input_text, prompt_description=prompt, examples=examples, model_id="gemma2:2b", model_url="http://localhost:11434", fence_output=False, use_schema_constraints=False )

Visualize results

lx.io.save_annotated_documents([result], output_name="results.jsonl", output_dir=".") html_content = lx.visualize("results.jsonl") with open("visualization.html", "w") as f: f.write(html_content.data if hasattr(html_content, "data") else html_content)

Open visualization.html in browser → color-coded annotations over source text

Key parameters reference

Parameter Type Description

text_or_documents

str / URL Raw text, URL to fetch, or list of documents

prompt_description

str

Natural language extraction instructions

examples

list[ExampleData]

Few-shot examples that define the schema

model_id

str

gemini-2.5-flash , gpt-4o , gemma2:2b , …

api_key

str

API key (overrides LANGEXTRACT_API_KEY env var)

model_url

str

Base URL for Ollama or custom endpoints

extraction_passes

int

Independent extraction runs (default: 1)

max_workers

int

Parallel chunk workers (default: 1)

max_char_buffer

int

Characters per chunk

fence_output

bool

Use JSON fencing instead of constrained decoding

use_schema_constraints

bool

Controlled generation — Gemini default: True

Custom provider plugin

import langextract as lx

@lx.providers.registry.register(r'^mymodel', r'^custom') class MyProviderLanguageModel(lx.inference.BaseLanguageModel): def init(self, model_id: str, api_key: str = None, **kwargs): self.client = MyProviderClient(api_key=api_key)

def infer(self, batch_prompts, **kwargs):
    for prompt in batch_prompts:
        result = self.client.generate(prompt, **kwargs)
        yield [lx.inference.ScoredOutput(score=1.0, output=result)]

Package as a PyPI plugin with entry point:

[project.entry-points."langextract.providers"] myprovider = "langextract_myprovider:MyProviderLanguageModel"

Disable all plugins: LANGEXTRACT_DISABLE_PLUGINS=1

Use cases

Domain Example

Medical/clinical Medication names, dosages, routes from clinical notes

Legal Clause extraction, party identification from contracts

Literary analysis Character, emotion, relationship graphs

Finance Structured data extraction from earnings reports

Radiology Free-text radiology reports → structured format

Research Entity/relation extraction from academic papers

Best practices

Write precise prompts — specify "use exact text, do not paraphrase" to keep offsets accurate
Use few-shot examples — 2–3 examples covering edge cases dramatically improves accuracy
Tune max_char_buffer — smaller values (500–1000) give more focused context; larger values reduce API calls
Use extraction_passes=3 for long docs — independent runs catch entities missed in single pass
Set max_workers — parallelization dramatically speeds up long-document processing
Verify offsets — result.text[extraction.start:extraction.end] must equal extraction_text
Use visualization — HTML output makes it easy to spot extraction errors and coverage gaps

References

GitHub: google/langextract
PyPI: langextract
Google Developers Blog announcement
Provider Plugin README
Long-document Example
License: Apache 2.0

langextract

Safety Notice

Copy this and send it to your AI assistant to learn

Standard install (Gemini backend — default)

With OpenAI support

Development

Gemini keys: https://aistudio.google.com/app/apikey

OpenAI keys: https://platform.openai.com/api-keys

Access results

Romeo & Juliet (147k chars / ~44k tokens) → 4,088 entities extracted

Open visualization.html in browser → color-coded annotations over source text

Source Transparency

Related Skills

omc

vibe-kanban

plannotator

ralph