Pandas Construction Data Analysis
Overview
Based on DDC methodology (Chapter 2.3), this skill provides comprehensive Pandas operations for construction data processing. Pandas is the Swiss Army knife for data analysts - handling everything from simple data filtering to complex aggregations across millions of rows.
Book Reference: "Pandas DataFrame и LLM ChatGPT" / "Pandas DataFrame and LLM ChatGPT"
"Используя Pandas, вы можете управлять и анализировать наборы данных, намного превосходящие возможности Excel. В то время как Excel способен обрабатывать до 1 миллиона строк данных, Pandas может без труда работать с наборами данных, содержащими десятки миллионов строк." — DDC Book, Chapter 2.3
Quick Start
import pandas as pd
Read construction data
df = pd.read_excel("bim_export.xlsx")
Basic operations
print(df.head()) # First 5 rows print(df.info()) # Column types and memory print(df.describe()) # Statistics for numeric columns
Filter structural elements
structural = df[df['Category'] == 'Structural']
Calculate total volume
total_volume = df['Volume'].sum() print(f"Total volume: {total_volume:.2f} m³")
DataFrame Fundamentals
Creating DataFrames
import pandas as pd
From dictionary (construction elements)
elements = pd.DataFrame({ 'ElementId': ['E001', 'E002', 'E003', 'E004'], 'Category': ['Wall', 'Floor', 'Wall', 'Column'], 'Material': ['Concrete', 'Concrete', 'Brick', 'Steel'], 'Volume_m3': [45.5, 120.0, 32.0, 8.5], 'Level': ['Level 1', 'Level 1', 'Level 2', 'Level 1'] })
From CSV
df_csv = pd.read_csv("construction_data.csv")
From Excel
df_excel = pd.read_excel("project_data.xlsx", sheet_name="Elements")
From multiple Excel sheets
all_sheets = pd.read_excel("project.xlsx", sheet_name=None) # Dict of DataFrames
Data Types in Construction
Common data types for construction
df = pd.DataFrame({ 'element_id': pd.Series(['W001', 'W002'], dtype='string'), 'quantity': pd.Series([10, 20], dtype='int64'), 'volume': pd.Series([45.5, 32.0], dtype='float64'), 'is_structural': pd.Series([True, False], dtype='bool'), 'created_date': pd.to_datetime(['2024-01-15', '2024-01-16']), 'category': pd.Categorical(['Wall', 'Slab']) })
Check data types
print(df.dtypes)
Convert types
df['quantity'] = df['quantity'].astype('float64') df['volume'] = pd.to_numeric(df['volume'], errors='coerce')
Filtering and Selection
Basic Filtering
Single condition
walls = df[df['Category'] == 'Wall']
Multiple conditions (AND)
large_concrete = df[(df['Material'] == 'Concrete') & (df['Volume_m3'] > 50)]
Multiple conditions (OR)
walls_or_floors = df[(df['Category'] == 'Wall') | (df['Category'] == 'Floor')]
Using isin for multiple values
structural = df[df['Category'].isin(['Wall', 'Column', 'Beam', 'Foundation'])]
String contains
insulated = df[df['Description'].str.contains('insulated', case=False, na=False)]
Null value filtering
incomplete = df[df['Cost'].isna()] complete = df[df['Cost'].notna()]
Advanced Selection
Select columns
volumes = df[['ElementId', 'Category', 'Volume_m3']]
Query syntax (SQL-like)
result = df.query("Category == 'Wall' and Volume_m3 > 30")
Loc and iloc
specific_row = df.loc[0] # By label range_rows = df.iloc[0:10] # By position specific_cell = df.loc[0, 'Volume_m3'] # Row and column subset = df.loc[0:5, ['Category', 'Volume_m3']] # Range with columns
Grouping and Aggregation
GroupBy Operations
Basic groupby
by_category = df.groupby('Category')['Volume_m3'].sum()
Multiple aggregations
summary = df.groupby('Category').agg({ 'Volume_m3': ['sum', 'mean', 'count'], 'Cost': ['sum', 'mean'] })
Named aggregations (cleaner output)
summary = df.groupby('Category').agg( total_volume=('Volume_m3', 'sum'), avg_volume=('Volume_m3', 'mean'), element_count=('ElementId', 'count'), total_cost=('Cost', 'sum') ).reset_index()
Multiple grouping columns
by_level_cat = df.groupby(['Level', 'Category']).agg({ 'Volume_m3': 'sum', 'Cost': 'sum' }).reset_index()
Pivot Tables
Create pivot table
pivot = pd.pivot_table( df, values='Volume_m3', index='Level', columns='Category', aggfunc='sum', fill_value=0, margins=True, # Add totals margins_name='Total' )
Multiple values
pivot_detailed = pd.pivot_table( df, values=['Volume_m3', 'Cost'], index='Level', columns='Category', aggfunc={'Volume_m3': 'sum', 'Cost': 'mean'} )
Data Transformation
Adding Calculated Columns
Simple calculation
df['Cost_Total'] = df['Volume_m3'] * df['Unit_Price']
Conditional column
df['Size_Category'] = df['Volume_m3'].apply( lambda x: 'Large' if x > 50 else ('Medium' if x > 20 else 'Small') )
Using np.where for binary conditions
import numpy as np df['Is_Large'] = np.where(df['Volume_m3'] > 50, True, False)
Using cut for binning
df['Volume_Bin'] = pd.cut( df['Volume_m3'], bins=[0, 10, 50, 100, float('inf')], labels=['XS', 'S', 'M', 'L'] )
String Operations
Extract from strings
df['Level_Number'] = df['Level'].str.extract(r'(\d+)').astype(int)
Split and expand
df[['Building', 'Floor']] = df['Location'].str.split('-', expand=True)
Clean strings
df['Category'] = df['Category'].str.strip().str.lower().str.title()
Replace values
df['Material'] = df['Material'].str.replace('Reinforced Concrete', 'RC')
Date Operations
Parse dates
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
Extract components
df['Year'] = df['Start_Date'].dt.year df['Month'] = df['Start_Date'].dt.month df['Week'] = df['Start_Date'].dt.isocalendar().week df['DayOfWeek'] = df['Start_Date'].dt.day_name()
Calculate duration
df['Duration_Days'] = (df['End_Date'] - df['Start_Date']).dt.days
Filter by date range
recent = df[df['Start_Date'] >= '2024-01-01']
Merging and Joining
Merge DataFrames
Elements data
elements = pd.DataFrame({ 'ElementId': ['E001', 'E002', 'E003'], 'Category': ['Wall', 'Floor', 'Column'], 'Volume_m3': [45.5, 120.0, 8.5] })
Unit prices
prices = pd.DataFrame({ 'Category': ['Wall', 'Floor', 'Column', 'Beam'], 'Unit_Price': [150, 80, 450, 200] })
Inner join (only matching)
merged = elements.merge(prices, on='Category', how='inner')
Left join (keep all elements)
merged = elements.merge(prices, on='Category', how='left')
Join on different column names
result = df1.merge(df2, left_on='elem_id', right_on='ElementId')
Concatenating DataFrames
Vertical concatenation (stacking)
all_floors = pd.concat([floor1_df, floor2_df, floor3_df], ignore_index=True)
Horizontal concatenation
combined = pd.concat([quantities, costs, schedule], axis=1)
Append new rows
new_elements = pd.DataFrame({'ElementId': ['E004'], 'Category': ['Beam']}) df = pd.concat([df, new_elements], ignore_index=True)
Construction-Specific Analyses
Quantity Take-Off (QTO)
def generate_qto_report(df): """Generate Quantity Take-Off summary by category""" qto = df.groupby(['Category', 'Material']).agg( count=('ElementId', 'count'), total_volume=('Volume_m3', 'sum'), total_area=('Area_m2', 'sum'), avg_volume=('Volume_m3', 'mean') ).round(2)
# Add percentage column
qto['volume_pct'] = (qto['total_volume'] /
qto['total_volume'].sum() * 100).round(1)
return qto.sort_values('total_volume', ascending=False)
Usage
qto_report = generate_qto_report(df) qto_report.to_excel("qto_report.xlsx")
Cost Estimation
def calculate_project_cost(elements_df, prices_df, markup=0.15): """Calculate total project cost with markup""" # Merge with prices df = elements_df.merge(prices_df, on='Category', how='left')
# Calculate base cost
df['Base_Cost'] = df['Volume_m3'] * df['Unit_Price']
# Apply markup
df['Total_Cost'] = df['Base_Cost'] * (1 + markup)
# Summary by category
summary = df.groupby('Category').agg(
volume=('Volume_m3', 'sum'),
base_cost=('Base_Cost', 'sum'),
total_cost=('Total_Cost', 'sum')
).round(2)
return df, summary, summary['total_cost'].sum()
Usage
detailed, summary, total = calculate_project_cost(elements, prices) print(f"Project Total: ${total:,.2f}")
Material Summary
def material_summary(df): """Summarize materials across project""" summary = df.groupby('Material').agg({ 'Volume_m3': 'sum', 'Weight_kg': 'sum', 'ElementId': 'nunique' }).rename(columns={'ElementId': 'Element_Count'})
summary['Volume_Pct'] = (summary['Volume_m3'] /
summary['Volume_m3'].sum() * 100).round(1)
return summary.sort_values('Volume_m3', ascending=False)
Level-by-Level Analysis
def analyze_by_level(df): """Analyze construction quantities by building level""" level_summary = df.pivot_table( values=['Volume_m3', 'Cost'], index='Level', columns='Category', aggfunc='sum', fill_value=0 )
level_summary['Total_Volume'] = level_summary['Volume_m3'].sum(axis=1)
level_summary['Total_Cost'] = level_summary['Cost'].sum(axis=1)
return level_summary
Data Export
Export to Excel with Multiple Sheets
def export_to_excel_formatted(df, summary, filepath): """Export with multiple sheets""" with pd.ExcelWriter(filepath, engine='openpyxl') as writer: df.to_excel(writer, sheet_name='Details', index=False) summary.to_excel(writer, sheet_name='Summary')
pivot = pd.pivot_table(df, values='Volume_m3',
index='Level', columns='Category')
pivot.to_excel(writer, sheet_name='By_Level')
Usage
export_to_excel_formatted(elements, qto_summary, "project_report.xlsx")
Export to CSV
Basic export
df.to_csv("output.csv", index=False)
With encoding for special characters
df.to_csv("output.csv", index=False, encoding='utf-8-sig')
Specific columns
df[['ElementId', 'Category', 'Volume_m3']].to_csv("volumes.csv", index=False)
Performance Tips
Use categories for string columns with few unique values
df['Category'] = df['Category'].astype('category')
Read only needed columns
df = pd.read_csv("large_file.csv", usecols=['ElementId', 'Category', 'Volume'])
Use chunking for very large files
chunks = pd.read_csv("huge_file.csv", chunksize=100000) result = pd.concat([chunk[chunk['Category'] == 'Wall'] for chunk in chunks])
Check memory usage
print(df.memory_usage(deep=True).sum() / 1024**2, "MB")
Quick Reference
Operation Code
Read Excel pd.read_excel("file.xlsx")
Read CSV pd.read_csv("file.csv")
Filter rows df[df['Column'] == 'Value']
Select columns df[['Col1', 'Col2']]
Group and sum df.groupby('Cat')['Vol'].sum()
Pivot table pd.pivot_table(df, values='Vol', index='Level')
Merge df1.merge(df2, on='key')
Add column df['New'] = df['A'] * df['B']
Export Excel df.to_excel("out.xlsx", index=False)
Resources
-
Book: "Data-Driven Construction" by Artem Boiko, Chapter 2.3
-
Website: https://datadrivenconstruction.io
-
Pandas Docs: https://pandas.pydata.org/docs/
Next Steps
-
See llm-data-automation for generating Pandas code with AI
-
See qto-report for specialized QTO calculations
-
See cost-estimation-resource for detailed cost calculations