langextract

langextract — LLM-Powered Structured Information Extraction

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "langextract" with this command: npx skills add akillness/oh-my-gods/akillness-oh-my-gods-langextract

langextract — LLM-Powered Structured Information Extraction

Extract structured data from unstructured text with character-level provenance. Every extracted entity traces back to exact character offsets in the source document.

When to use this skill

  • Extracting entities, relationships, or facts from unstructured text

  • Processing clinical notes, legal documents, research papers, or reports

  • Building NLP pipelines that need citation-level traceability (not just extracted values)

  • Long-document extraction (chunking + parallel workers + multi-pass for recall)

  • Replacing fragile regex/rule-based extraction with LLM-driven schema enforcement

  • Generating interactive HTML visualizations of annotated text

  1. Installation

Standard install (Gemini backend — default)

pip install langextract

With OpenAI support

pip install langextract[openai]

Development

pip install -e ".[dev]"

API key setup:

export LANGEXTRACT_API_KEY="your-gemini-or-openai-key"

Gemini keys: https://aistudio.google.com/app/apikey

OpenAI keys: https://platform.openai.com/api-keys

  1. Core concepts

Concept Description

Source grounding Every extraction carries (start, end) char offsets into original text

Controlled generation Gemini uses schema-constrained decoding; no hallucinated field names

Few-shot examples Schema is inferred from ExampleData objects — zero fine-tuning needed

Multi-pass extraction extraction_passes=N runs N independent passes; results are merged

Parallel chunking max_workers=N processes text chunks concurrently

  1. Basic extraction

import langextract as lx import textwrap

prompt = textwrap.dedent("""
Extract characters, emotions, and relationships in order of appearance. Use exact text for extractions. Do not paraphrase or overlap entities. Provide meaningful attributes for each entity to add context.""")

examples = [ lx.data.ExampleData( text="ROMEO. But soft! What light through yonder window breaks?", extractions=[ lx.data.Extraction( extraction_class="character", extraction_text="ROMEO", attributes={"emotional_state": "wonder"} ), ] ) ]

result = lx.extract( text_or_documents="Lady Juliet gazed longingly at the stars...", prompt_description=prompt, examples=examples, model_id="gemini-2.5-flash", )

Access results

for extraction in result.extractions: print(f"[{extraction.extraction_class}] '{extraction.extraction_text}' " f"@ chars {extraction.start}–{extraction.end}")

  1. Long-document extraction (URL input, multi-pass, parallel)

result = lx.extract( text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt", prompt_description=prompt, examples=examples, model_id="gemini-2.5-flash", extraction_passes=3, # 3 independent runs, results merged max_workers=20, # parallel chunk processing max_char_buffer=1000 # smaller focused context windows )

Romeo & Juliet (147k chars / ~44k tokens) → 4,088 entities extracted

  1. OpenAI backend

import os, langextract as lx

result = lx.extract( text_or_documents=input_text, prompt_description=prompt, examples=examples, model_id="gpt-4o", api_key=os.environ.get("OPENAI_API_KEY"), fence_output=True, use_schema_constraints=False )

  1. Local LLMs via Ollama

result = lx.extract( text_or_documents=input_text, prompt_description=prompt, examples=examples, model_id="gemma2:2b", model_url="http://localhost:11434", fence_output=False, use_schema_constraints=False )

  1. Visualize results

lx.io.save_annotated_documents([result], output_name="results.jsonl", output_dir=".") html_content = lx.visualize("results.jsonl") with open("visualization.html", "w") as f: f.write(html_content.data if hasattr(html_content, "data") else html_content)

Open visualization.html in browser → color-coded annotations over source text

  1. Key parameters reference

Parameter Type Description

text_or_documents

str / URL Raw text, URL to fetch, or list of documents

prompt_description

str

Natural language extraction instructions

examples

list[ExampleData]

Few-shot examples that define the schema

model_id

str

gemini-2.5-flash , gpt-4o , gemma2:2b , …

api_key

str

API key (overrides LANGEXTRACT_API_KEY env var)

model_url

str

Base URL for Ollama or custom endpoints

extraction_passes

int

Independent extraction runs (default: 1)

max_workers

int

Parallel chunk workers (default: 1)

max_char_buffer

int

Characters per chunk

fence_output

bool

Use JSON fencing instead of constrained decoding

use_schema_constraints

bool

Controlled generation — Gemini default: True

  1. Custom provider plugin

import langextract as lx

@lx.providers.registry.register(r'^mymodel', r'^custom') class MyProviderLanguageModel(lx.inference.BaseLanguageModel): def init(self, model_id: str, api_key: str = None, **kwargs): self.client = MyProviderClient(api_key=api_key)

def infer(self, batch_prompts, **kwargs):
    for prompt in batch_prompts:
        result = self.client.generate(prompt, **kwargs)
        yield [lx.inference.ScoredOutput(score=1.0, output=result)]

Package as a PyPI plugin with entry point:

[project.entry-points."langextract.providers"] myprovider = "langextract_myprovider:MyProviderLanguageModel"

Disable all plugins: LANGEXTRACT_DISABLE_PLUGINS=1

  1. Use cases

Domain Example

Medical/clinical Medication names, dosages, routes from clinical notes

Legal Clause extraction, party identification from contracts

Literary analysis Character, emotion, relationship graphs

Finance Structured data extraction from earnings reports

Radiology Free-text radiology reports → structured format

Research Entity/relation extraction from academic papers

Best practices

  • Write precise prompts — specify "use exact text, do not paraphrase" to keep offsets accurate

  • Use few-shot examples — 2–3 examples covering edge cases dramatically improves accuracy

  • Tune max_char_buffer — smaller values (500–1000) give more focused context; larger values reduce API calls

  • Use extraction_passes=3 for long docs — independent runs catch entities missed in single pass

  • Set max_workers — parallelization dramatically speeds up long-document processing

  • Verify offsets — result.text[extraction.start:extraction.end] must equal extraction_text

  • Use visualization — HTML output makes it easy to spot extraction errors and coverage gaps

References

  • GitHub: google/langextract

  • PyPI: langextract

  • Google Developers Blog announcement

  • Provider Plugin README

  • Long-document Example

  • License: Apache 2.0

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

omc

No summary provided by upstream source.

Repository SourceNeeds Review
General

vibe-kanban

No summary provided by upstream source.

Repository SourceNeeds Review
General

plannotator

No summary provided by upstream source.

Repository SourceNeeds Review
General

ralph

No summary provided by upstream source.

Repository SourceNeeds Review