HeteroMind
Unified heterogeneous knowledge QA system with automatic source detection and multi-stage reasoning.
Core Concept
Natural language queries are automatically routed to the appropriate knowledge source (SQL, Knowledge Graph, or Table files) without requiring users to specify the data source. A 4-layer detection architecture ensures accurate source identification, followed by multi-stage query generation with self-revision and voting.
User Query → Source Detection (4 layers) → Query Generation → Self-Revision → Voting → Execution → Answer
When to Use
| Trigger | Action |
|---|---|
| "How many employees in X?" | NL2SQL engine |
| "Who is the founder of X?" | NL2SPARQL engine (KG) |
| "Which quarter had highest sales?" | TableQA engine |
| "Show average salary by department" | Auto-detect SQL |
| Queries with aggregations, filters, joins | Route to SQL |
| Entity relationship queries | Route to KG |
| Questions about CSV/Excel files | Route to TableQA |
| Multi-hop queries across sources | Decompose + fuse |
Architecture
4-Layer Source Detection
Layer 1 (15%): Rule-Based
- 20+ keywords per source type
- 7 regex patterns (aggregation, comparison, relation)
- Fast pre-filtering
Layer 2 (35%): LLM Semantic
- Intent classification
- Entity/predicate detection
- Multi-hop identification
Layer 3a (25%): SQL Schema Match
- Inverted index on tables/columns
- Automatic JOIN inference
- Confidence scoring
Layer 3b (25%): KG Entity Link
- Entity mention extraction
- SPARQL endpoint lookup
- Predicate pattern matching
Layer 3c (25%+30%): Entity Verification
- Cross-source entity existence check
- 30% score boost for verified entities
Layer 4: Multi-Source Fusion
- Weighted aggregation
- Execution plan generation
Query Generation Pipeline
1. Schema/Entity Linking → Identify relevant tables/columns/entities
2. Parallel Generation → Generate 3 candidates concurrently
3. Multi-Round Revision → 2 rounds of self-review
4. Validation → Syntax and semantic checks
5. Voting → Select best candidate
6. Execution → Run query
7. Result Verification → Validate reasonableness
Engines
NL2SQL Engine
from src.engines.nl2sql.multi_stage_engine import MultiStageNL2SQLEngine
engine = MultiStageNL2SQLEngine({
"name": "sql_engine",
"schema": schema,
"llm_config": {
"model": "deepseek-chat",
"api_key": "sk-...",
},
"generation_config": {
"num_candidates": 3,
"max_revisions": 2,
"parallel_generation": True,
},
})
result = await engine.execute("How many employees in Engineering?", {})
Features:
- Schema linking (rule-based + LLM)
- Parallel SQL candidate generation
- Multi-round self-revision
- Voting mechanism
- Result verification
NL2SPARQL Engine
from src.engines.nl2sparql.multi_stage_engine import MultiStageNL2SPARQLEngine
engine = MultiStageNL2SPARQLEngine({
"name": "sparql_engine",
"endpoint_url": "https://dbpedia.org/sparql",
"ontology": ontology,
"llm_config": {"model": "gpt-4", "api_key": "sk-..."},
})
result = await engine.execute("Who founded Microsoft?", {})
Features:
- Entity linking to KG
- Ontology retrieval
- SPARQL generation with revision
- Multi-endpoint support
TableQA Engine
from src.engines.table_qa.multi_stage_engine import MultiStageTableQAEngine
engine = MultiStageTableQAEngine({
"name": "table_engine",
"table_path": "data/sales.csv",
"llm_config": {"model": "deepseek-chat", "api_key": "sk-..."},
})
result = await engine.execute("Which quarter had highest sales?", {})
Features:
- Table schema analysis
- Query intent interpretation
- Pandas code generation
- Safe execution sandbox
Multi-LLM Support
Override model and API key at runtime:
# Initialize with default
engine = MultiStageNL2SQLEngine({
"llm_config": {"model": "deepseek-chat", "api_key": "sk-deepseek-key"},
})
# Override per-call
result = await engine.execute(
query="Complex query",
context={},
model="gpt-4-turbo", # Override model
api_key="sk-openai-key", # Override API key
)
Supported Providers
| Provider | Models | Configuration |
|---|---|---|
| DeepSeek | deepseek-chat | base_url: https://api.deepseek.com/v1 |
| OpenAI | gpt-4, gpt-3.5-turbo | Default endpoint |
| Azure OpenAI | gpt-4 | base_url: https://{resource}.openai.azure.com |
| Local (Ollama) | llama2, mistral | base_url: http://localhost:11434/v1 |
Configuration
LLM Configuration
llm_config:
model: deepseek-chat
api_key: sk-...
base_url: https://api.deepseek.com/v1 # Optional
temperature: 0.1
max_tokens: 500
timeout: 30
Generation Configuration
generation_config:
num_candidates: 3 # SQL/SPARQL candidates to generate
max_revisions: 2 # Self-revision rounds
parallel_generation: true # Concurrent candidate generation
voting_enabled: true # Multi-candidate voting
Source Detection Weights
weights:
rule_based: 0.15 # Layer 1
llm_based: 0.35 # Layer 2
schema_based: 0.25 # Layer 3a/3b
verification: 0.25 # Layer 3c
verification_boost: 0.3 # 30% boost for verified entities
Workflows
Complete Query Flow
from src.orchestrator import HeteroMindOrchestrator
orchestrator = HeteroMindOrchestrator({
"source_detection": {
"layer2": {"api_key": "sk-...", "model": "gpt-4"},
"layer3": {"schemas": [schema], "kg_endpoints": [...]},
},
"engines": {
"sql": [{"name": "default", "enabled": True}],
"sparql": [{"name": "default", "enabled": True}],
"table_qa": [{"name": "default", "enabled": True}],
},
})
response = await orchestrator.query("How many employees in Engineering?")
print(f"Answer: {response.answer}")
print(f"Source: {response.sources}")
print(f"Confidence: {response.confidence:.2f}")
Source Detection Only
from src.classifier import SourceDetectorOrchestrator
detector = SourceDetectorOrchestrator({
"layer2": {"api_key": "sk-...", "model": "gpt-4"},
"layer3": {"schemas": [schema]},
})
decision = await detector.detect("How many employees?")
print(f"Primary Source: {decision.primary_source.value}")
print(f"Confidence: {decision.confidence:.2f}")
print(f"Execution Plan: {decision.execution_plan}")
Test Results
| Engine | Tests | Passed | Accuracy | Avg Confidence | Avg Time |
|---|---|---|---|---|---|
| SQL (NL2SQL) | 3 | 3 | 100.0% | 0.60 | 22.5s |
| SPARQL (NL2SPARQL) | 2 | 2 | 100.0% | 0.20 | 36.3s |
| TableQA | 3 | 3 | 100.0% | 0.62 | 24.2s |
| Overall | 8 | 8 | 100.0% | 0.51 | 26.6s |
Environment Variables
Required (for LLM-based generation)
| Variable | Description | Example |
|---|---|---|
DEEPSEEK_API_KEY | DeepSeek API key | sk-... |
OPENAI_API_KEY | OpenAI API key | sk-... |
Optional (for specific features)
| Variable | Description | Example |
|---|---|---|
MYSQL_CONNECTION_STRING | MySQL database connection | mysql://user:pass@host/db |
CUSTOM_KG_ENDPOINT | Custom KG SPARQL endpoint | https://example.com/sparql |
WORKSPACE | Base path for table file scanning | /path/to/workspace |
Setup
# Copy example env file
cp .env.example .env
# Edit with your credentials
nano .env
# Load environment
export $(cat .env | xargs)
Installation
cd HeteroMind
pip install -r requirements.txt
Requirements
- Python 3.10+
aiohttp,pandas,openpyxl- OpenAI-compatible API key (optional)
Project Structure
HeteroMind/
├── src/
│ ├── classifier/ # 4-layer source detection
│ │ ├── rule_detector.py # Layer 1
│ │ ├── llm_detector.py # Layer 2
│ │ ├── sql_schema_matcher.py # Layer 3a
│ │ ├── kg_entity_linker.py # Layer 3b
│ │ ├── entity_verifier.py # Layer 3c
│ │ └── source_fusion.py # Layer 4
│ ├── engines/ # Query engines
│ │ ├── nl2sql/
│ │ ├── nl2sparql/
│ │ └── table_qa/
│ ├── decomposer/ # Task decomposition
│ ├── fusion/ # Result fusion
│ ├── generator/ # Answer generation
│ └── orchestrator.py # Main orchestrator
├── config/
│ └── source_detection.yaml
├── tests/
│ └── test_data/
├── comprehensive_tests.py
└── SKILL.md
Examples
SQL: Aggregation with Filter
Query: "How many employees are in the Engineering department?"
Generated SQL:
SELECT COUNT(*) FROM employees e
JOIN departments d ON e.department_id = d.id
WHERE d.name = 'Engineering'
SPARQL: Entity Relationship
Query: "Who is the founder of Microsoft?"
Generated SPARQL:
SELECT ?founder WHERE {
<http://dbpedia.org/resource/Microsoft>
<http://dbpedia.org/ontology/founder> ?founder
}
TableQA: Aggregation
Query: "Which quarter had the highest sales in 2024?"
Generated Code:
result = df.groupby('quarter')['sales'].sum().idxmax()
Skill Contract
Skills that use HeteroMind should declare:
heteromind:
reads: [Database Schema, KG Ontology, Table Files]
writes: [Generated SQL, SPARQL, Pandas Code]
requires:
- LLM API key (for generation stages)
- Schema metadata (for source detection)
postconditions:
- Generated query passes validation
- Result verified for reasonableness
Integration Patterns
With Agent Memory
Log query execution for audit:
from src.orchestrator import HeteroMindOrchestrator
orchestrator = HeteroMindOrchestrator(config)
response = await orchestrator.query(query)
# Log to agent memory
memory.record({
"action": "knowledge_query",
"query": query,
"source": response.sources,
"confidence": response.confidence,
"answer": response.answer,
})
Multi-Source Fusion
For queries requiring multiple sources:
# Query automatically detects hybrid need
response = await orchestrator.query(
"Show employees who published papers"
)
# Routes to: SQL (employees) + KG (papers) + Fusion
References
README.md— Full documentation and API referenceUSAGE.md— Detailed usage guide with multi-LLM examplesconfig/source_detection.yaml— Detection configurationtests/test_data/— Example schemas and test data
Version: 0.1.0
Last Updated: 2026-04-12
Test Coverage: 100.0% accuracy on 8 test cases