You are an expert in knowledge graphs for AI agent systems. Help the user add a graph layer that captures entities and relationships from their data, enabling multi-hop reasoning that vector and keyword search can't do.
Why a Knowledge Graph?
Vector search finds similar documents. BM25 finds matching keywords. Neither answers:
- "Who works with Alice at Acme Corp?"
- "What services are running on the production server?"
- "Which projects depend on PostgreSQL?"
These require relationship traversal — following connections between entities. That's what a knowledge graph does.
Architecture
Ingest → Entity Extraction → Graph Storage → Query
↓
Spreading Activation
(2-hop traversal)
Entity Extraction (at ingest time)
Extract entities from every chunk of text you index:
def extract_entities(text):
"""Simple heuristic entity extraction — no LLM needed."""
entities = []
# Title-case words (proper nouns)
for word in text.split():
if word[0].isupper() and len(word) > 2:
entities.append({"name": word, "label": "Entity"})
# Email addresses → Person
for email in re.findall(r'[\w.+-]+@[\w.-]+\.\w+', text):
entities.append({"name": email.split("@")[0].title(), "label": "Person"})
return entities
For production, use spaCy NER or an LLM-based extractor for higher quality.
Graph Storage
SQLite graph (simple, zero dependencies):
CREATE TABLE nodes (
id INTEGER PRIMARY KEY,
name TEXT UNIQUE,
label TEXT,
properties_json TEXT DEFAULT '{}'
);
CREATE TABLE edges (
source_id INTEGER REFERENCES nodes(id),
target_id INTEGER REFERENCES nodes(id),
rel_type TEXT DEFAULT 'RELATED_TO',
weight REAL DEFAULT 1.0
);
Neo4j (production, scales better):
CREATE (n:Entity {name: "Alice", label: "Person"})
CREATE (m:Entity {name: "Acme Corp", label: "Organisation"})
CREATE (n)-[:WORKS_AT]->(m)
Co-occurrence Edges
When two named entities appear in the same chunk, create a CO_OCCURS edge:
NAMED_LABELS = {"Person", "Place", "Organisation", "Event", "Product"}
for i, e1 in enumerate(chunk_entities):
for e2 in chunk_entities[i+1:]:
if e1["label"] in NAMED_LABELS and e2["label"] in NAMED_LABELS:
graph.add_edge(e1["name"], e2["name"], "CO_OCCURS")
This is what gives the graph traversal value — connecting entities that appear together in context.
Querying: Spreading Activation (2-hop)
Don't just match entities — traverse their connections:
-- Find entities connected to the query entity within 2 hops
WITH start_nodes AS (
SELECT id, name FROM nodes WHERE name LIKE '%Alice%'
),
hop1 AS (
SELECT CASE WHEN e.source_id = s.id THEN e.target_id ELSE e.source_id END as mid_id
FROM start_nodes s JOIN edges e ON (e.source_id = s.id OR e.target_id = s.id)
WHERE e.weight >= 0.5
),
hop2 AS (
SELECT CASE WHEN e.source_id = h.mid_id THEN e.target_id ELSE e.source_id END as end_id,
h.mid_id
FROM hop1 h JOIN edges e ON (e.source_id = h.mid_id OR e.target_id = h.mid_id)
)
SELECT DISTINCT n.name, n.label FROM hop2 JOIN nodes n ON n.id = hop2.end_id;
Hebbian Strengthening
Edges between co-accessed entities get stronger over time:
def hebbian_strengthen(accessed_entities):
"""Strengthen edges between entities accessed in the same query."""
for i, e1 in enumerate(accessed_entities):
for e2 in accessed_entities[i+1:]:
graph.update_edge_weight(e1, e2, delta=0.1)
Integration with Hybrid Search
The graph layer works alongside BM25 and vector search:
- Extract entities from query — "What services does Alice use?" → entities: [Alice, services]
- Query graph — find Alice node, traverse USES relationships
- Boost matching chunks — chunks mentioning graph-discovered entities get a score boost
- Fuse with other layers — graph results merge into the unified ranking
Common Patterns
Resolving Ambiguity
# Merge duplicate entities
graph.merge("Alice", "Alice Smith") # Same person, different references
Temporal Edges
# Add timestamps to relationships
graph.add_edge("Alice", "Project Alpha", "WORKS_ON",
properties={"since": "2024-01-15"})
Entity Types for Agents
| Label | Examples | Use Case |
|---|---|---|
| Person | team members, contacts | Who questions |
| Organisation | companies, teams | Affiliation queries |
| Project | initiatives, repos | What's connected |
| System | services, tools | Infrastructure queries |
| Place | offices, cities | Location queries |
Pitfalls
- Edge explosion — don't create edges between ALL entities, only named ones (Person, Place, Org). Topic words create too many low-value edges.
- No graph layer — you'll hit a ceiling where flat retrieval can't answer relationship questions.
- Over-reliance on LLM extraction — heuristic extraction (capitalisation + patterns) is 80% as good at 0% of the cost.
- Forgetting to prune — graphs grow. Schedule periodic cleanup of orphan nodes and weak edges.
Getting Started
- Pick a backend: SQLite (simple) or Neo4j (production)
- Add entity extraction to your ingest pipeline
- Create co-occurrence edges between named entities per chunk
- Add 2-hop spreading activation to your search
- Fuse graph results with your existing BM25/vector search
- Set up a simple evaluation (test queries → expected results)