Semantic Search in DDC CWICR Database
Business Case
Problem Statement
Construction cost estimation requires finding relevant work items from large databases. Traditional keyword search fails when:
-
Users describe work in natural language
-
Terminology varies across regions and languages
-
Similar work items have different naming conventions
Solution
DDC CWICR database provides pre-computed embeddings (OpenAI text-embedding-3-large, 3072 dimensions) enabling semantic similarity search across 55,719 work items in 9 languages.
Business Value
-
90% faster work item lookup compared to manual search
-
Multi-language support: Arabic, Chinese, German, English, Spanish, French, Hindi, Portuguese, Russian
-
Higher accuracy by finding semantically similar items, not just keyword matches
Technical Implementation
Prerequisites
pip install qdrant-client openai pandas
Database Setup
Download Qdrant snapshot
Start Qdrant with Docker
docker run -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant
Python Implementation
import pandas as pd from qdrant_client import QdrantClient from qdrant_client.models import Distance, VectorParams import openai
class CWICRSemanticSearch: def init(self, qdrant_host: str = "localhost", port: int = 6333): self.client = QdrantClient(host=qdrant_host, port=port) self.collection_name = "ddc_cwicr_en" self.embedding_model = "text-embedding-3-large" self.embedding_dim = 3072
def get_embedding(self, text: str) -> list:
"""Generate embedding for search query."""
response = openai.embeddings.create(
model=self.embedding_model,
input=text
)
return response.data[0].embedding
def search_work_items(self, query: str, limit: int = 10,
min_score: float = 0.7) -> pd.DataFrame:
"""Search for similar work items."""
query_vector = self.get_embedding(query)
results = self.client.search(
collection_name=self.collection_name,
query_vector=query_vector,
limit=limit,
score_threshold=min_score
)
items = []
for result in results:
item = result.payload
item['similarity_score'] = result.score
items.append(item)
return pd.DataFrame(items)
def search_by_category(self, query: str, category: str,
limit: int = 10) -> pd.DataFrame:
"""Search within specific category."""
query_vector = self.get_embedding(query)
results = self.client.search(
collection_name=self.collection_name,
query_vector=query_vector,
query_filter={
"must": [{"key": "category", "match": {"value": category}}]
},
limit=limit
)
return pd.DataFrame([{**r.payload, 'score': r.score} for r in results])
def estimate_cost(self, work_items: pd.DataFrame,
quantities: dict) -> dict:
"""Calculate cost from matched work items."""
total_cost = 0
breakdown = []
for _, item in work_items.iterrows():
if item['work_item_code'] in quantities:
qty = quantities[item['work_item_code']]
cost = qty * item.get('unit_price', 0)
total_cost += cost
breakdown.append({
'item': item['description'],
'quantity': qty,
'unit_price': item.get('unit_price', 0),
'total': cost
})
return {
'total_cost': total_cost,
'breakdown': breakdown,
'currency': 'Regional default'
}
Usage Examples
Basic Search
search = CWICRSemanticSearch()
Natural language query
results = search.search_work_items("brick masonry wall construction") print(results[['description', 'unit', 'unit_price', 'similarity_score']])
Cost Estimation
Find work items for foundation work
foundation_items = search.search_work_items( "reinforced concrete foundation excavation and pouring", limit=20 )
Estimate with quantities
quantities = { 'CONC-001': 150, # cubic meters 'EXCV-002': 200, # cubic meters } estimate = search.estimate_cost(foundation_items, quantities) print(f"Estimated Cost: ${estimate['total_cost']:,.2f}")
Database Schema
Field Type Description
work_item_code string Unique identifier
description string Work item description
unit string Measurement unit
labor_norm float Labor hours per unit
material_cost float Material cost per unit
equipment_cost float Equipment cost per unit
unit_price float Total price per unit
category string Work category
embedding vector[3072] Pre-computed embedding
Best Practices
-
Use specific queries - "reinforced concrete slab 200mm" beats "concrete"
-
Filter by category - Narrow results to relevant work types
-
Check similarity scores - Scores below 0.7 may need manual verification
-
Combine with QTO - Use BIM quantities for automated estimation
Resources
-
GitHub: OpenConstructionEstimate-DDC-CWICR
-
Releases: v0.1.0 Database Downloads
-
Qdrant Docs: https://qdrant.tech/documentation/