Multi Source Data Merger
Overview
This skill guides the process of merging data from multiple sources with different formats into a unified dataset. It covers reading heterogeneous file formats, applying field name mappings, resolving conflicts using priority ordering, and generating comprehensive output files including conflict reports.
Workflow
Step 1: Analyze Requirements and Source Files
Before writing any code, thoroughly understand the task:
-
Identify all source files and their formats (JSON, CSV, Parquet, XML, etc.)
-
Determine the merge key (e.g., user_id , record_id ) that links records across sources
-
Review field mapping requirements - source fields may have different names that map to common output fields
-
Understand conflict resolution rules - typically based on source priority ordering
-
Identify expected output formats and structure
Important: Do not attempt to read binary formats (Parquet, Excel, etc.) as text files - use appropriate libraries.
Step 2: Set Up Environment
-
Create a Python virtual environment using uv or venv
-
Install required dependencies based on source formats:
-
pandas
-
Core data manipulation
-
pyarrow
-
Parquet file support
-
openpyxl
-
Excel file support
-
lxml
-
XML parsing (if needed)
-
Verify installations before proceeding
Example environment setup:
uv venv .venv source .venv/bin/activate uv pip install pandas pyarrow
Step 3: Write the Merge Script
Structure the script with clear separation of concerns:
-
Data reading functions - One per format type
-
Field mapping function - Apply column renames
-
Data normalization - Handle date formats, type conversions
-
Merge logic - Combine records using the merge key
-
Conflict resolution - Apply priority rules
-
Output generation - Write merged data and conflict reports
Script quality practices:
-
Validate syntax before execution: python -m py_compile script.py
-
Use try-except blocks with informative error messages
-
Document assumptions about data formats
Step 4: Execute and Verify
Run a comprehensive verification process:
-
Check output file existence at expected locations
-
Validate merged data contains expected values
-
Verify conflict report structure and content
-
Run any provided test suites
Common Pitfalls
Binary File Handling
-
Mistake: Attempting to read Parquet/Excel files as text
-
Solution: Always use pandas with appropriate engine (pyarrow for Parquet, openpyxl for Excel)
Syntax Errors in Scripts
-
Mistake: Writing long scripts without validation, leading to indentation or syntax errors
-
Solution: Run python -m py_compile script.py before execution
Date Format Normalization
-
Mistake: Assuming consistent date formats across sources
-
Solution: Implement flexible date parsing that handles multiple formats:
-
ISO format: 2024-01-15
-
US format: 01/15/2024
-
European format: 15-01-2024
-
Datetime: 2024-01-15T10:30:00
Incomplete Script Output
-
Mistake: Writing very long scripts that may get truncated
-
Solution: Break into modular functions, verify complete code visibility
Environment Path Issues
-
Mistake: Repeating PATH exports in every command
-
Solution: Set PATH once in a setup step or use absolute paths to executables
Verification Strategies
Output Validation Checklist
File existence check:
import os assert os.path.exists("output/merged_data.json") assert os.path.exists("output/conflict_report.json")
Data completeness check:
import json with open("output/merged_data.json") as f: data = json.load(f)
Verify expected record count
assert len(data) == expected_count
Conflict report validation:
with open("output/conflict_report.json") as f: conflicts = json.load(f)
Verify conflict structure has required fields
for conflict in conflicts: assert "field" in conflict assert "selected" in conflict assert "sources" in conflict
Sample value verification:
Spot-check specific merged records
record = next(r for r in data if r["user_id"] == "expected_id") assert record["field_name"] == "expected_value"
Consolidate Verification
Instead of running multiple separate verification commands, create a single comprehensive test script that validates all aspects of the output.
Edge Cases to Consider
-
Empty source files - Handle gracefully with appropriate warnings
-
Missing merge keys - Decide whether to skip or error
-
Type mismatches - Convert consistently (e.g., user_id as string vs integer)
-
Null/None values - Determine handling in conflict resolution
-
Unicode/encoding - Specify encoding when reading text-based formats
-
Records in some sources but not others - Include partial records or require complete matches
Field Mapping Example
When sources have different field names for the same concept:
FIELD_MAPPINGS = { "source_a": { "firstName": "first_name", "lastName": "last_name", "emailAddress": "email" }, "source_b": { "fname": "first_name", "lname": "last_name", "mail": "email" } }
def apply_mapping(df, source_name): mapping = FIELD_MAPPINGS.get(source_name, {}) return df.rename(columns=mapping)
Conflict Resolution Pattern
When the same field has different values across sources:
def resolve_conflict(values_by_source, priority_order): """ Select value based on source priority.
Args:
values_by_source: dict mapping source name to value
priority_order: list of source names from highest to lowest priority
Returns:
tuple: (selected_value, conflict_info)
"""
conflict_info = None
unique_values = set(v for v in values_by_source.values() if v is not None)
if len(unique_values) > 1:
conflict_info = {
"sources": values_by_source,
"resolved_by": "priority"
}
for source in priority_order:
if source in values_by_source and values_by_source[source] is not None:
return values_by_source[source], conflict_info
return None, conflict_info