hkgb

This skill should be used when building hybrid Knowledge Graphs that integrate structured data (CSV, databases) with automatically extracted entities from unstructured documents (PDFs, text). The pattern establishes a reliable join key between domain graphs and lexical graphs, enabling GraphRAG, document ingestion with metadata enrichment, and Knowledge Graph construction from heterogeneous sources using neo4j-graphrag SimpleKGPipeline.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "hkgb" with this command: npx skills add zpankz/mcp-skillset/zpankz-mcp-skillset-hkgb

Hybrid Knowledge Graph Bridge

Integration pattern for linking structured domain data to LLM-extracted lexical graphs

Problem

When building Knowledge Graphs from heterogeneous sources, two distinct graph types often need coexistence:

  1. Domain Graph — Structured, curated data from CSV/databases representing business entities and relationships
  2. Lexical Graph — Entities and relationships automatically extracted from unstructured documents via LLM

These graphs speak different languages: one is schema-driven and deterministic, the other is probabilistic and emergent. Without a deliberate bridge, they remain disconnected silos.

Solution

The solution establishes a reliable join key between both graphs through five steps.

Step 1: Specify the lexical graph schema

Before extraction, define the ontology that guides the LLM. This specification comprises three elements.

Node Types — The entities to extract. Some are simple labels, others are enriched with descriptions (to guide the LLM) and typed properties:

NODE_TYPES = [
    "Entity",           # Simple label
    "Concept",
    "Process",
    {                   # Enriched with description
        "label": "Outcome",
        "description": "A result, benefit, or consequence of a process or action."
    },
    {                   # With typed properties
        "label": "Reference",
        "description": "An external resource such as a document, article, or dataset.",
        "properties": [
            {"name": "name", "type": "STRING", "required": True},
            {"name": "type", "type": "STRING"}
        ]
    },
]

Relationship Types — The possible verbs between entities:

RELATIONSHIP_TYPES = [
    "RELATED_TO",
    "PART_OF",
    "USED_IN",
    "LEADS_TO",
    "REFERENCES"
]

Patterns — The valid combinations. The LLM can only extract conforming triplets:

PATTERNS = [
    ("Entity", "RELATED_TO", "Entity"),
    ("Concept", "RELATED_TO", "Entity"),
    ("Process", "PART_OF", "Entity"),
    ("Process", "LEADS_TO", "Outcome"),
    ("Reference", "REFERENCES", "Entity"),
]

Step 2: Configure the extraction pipeline

The pipeline assembles the LLM, embedder, text splitter, and schema:

from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
)

embedder = OpenAIEmbeddings(model="text-embedding-ada-002")
text_splitter = FixedSizeSplitter(chunk_size=500, chunk_overlap=100)

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver,
    neo4j_database=os.getenv("NEO4J_DATABASE"),
    embedder=embedder,
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
        "relationship_types": RELATIONSHIP_TYPES,
        "patterns": PATTERNS
    },
)

The pipeline performs: PDF → chunks → schema-guided LLM extraction → node/relationship creation → embeddings.

Step 3: Transform the structured source into a dictionary

Each row of the CSV (representing the domain graph) becomes a Python dictionary:

records = csv.DictReader(
    open(os.path.join(data_path, "metadata.csv"), encoding="utf8", newline='')
)
# Produces: {"filename": "doc1.pdf", "category": "...", "author": "...", ...}

Step 4: Add the common key to the dictionary

The pipeline creates Document nodes with a path property. This property serves as the bridge between the two graphs. Enrich the dictionary with a key that matches exactly what the pipeline stores:

record["file_path"] = os.path.join(data_path, record["filename"])
# The same value passed to the pipeline becomes Document.path

This same value is passed to the pipeline which generates the lexical graph:

result = asyncio.run(
    kg_builder.run_async(file_path=record["file_path"])
)

Step 5: Join the two graphs via Cypher

A query uses the common key to attach the domain graph to the lexical graph:

MATCH (d:Document {path: $file_path})
MERGE (e:DomainEntity {id: $entity_id})
SET e.category = $category,
    e.author = $author
MERGE (d)-[:BELONGS_TO]->(e)

The enriched dictionary is passed as parameters:

neo4j_driver.execute_query(cypher, parameters_=record)

Consequences

The pattern works because the dictionary key and Document.path contain identical values. This implicit key connects the lexical graph (entities extracted according to the specified schema) to the domain graph (business structure from structured sources). If these values diverge, the bridge fails silently — orphaned nodes accumulate undetected.

Verification

To ensure the bridge holds, verify that Document nodes are properly attached:

// Orphan documents (broken bridge)
MATCH (d:Document)
WHERE NOT EXISTS { (d)-[:BELONGS_TO]->(:DomainEntity) }
RETURN d.path AS orphan

// Domain entities without documents (bridge never built)
MATCH (e:DomainEntity)
WHERE NOT EXISTS { (:Document)-[:BELONGS_TO]->(e) }
RETURN e.id AS missing

Complete Reference

For a complete implementation example, see references/full_example.py.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

network-meta-analysis-appraisal

No summary provided by upstream source.

Repository SourceNeeds Review
Research

csv-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

knowledge-orchestrator

No summary provided by upstream source.

Repository SourceNeeds Review
Research

research

No summary provided by upstream source.

Repository SourceNeeds Review