Multi Source Data Merger

Overview

This skill guides the process of merging data from multiple sources with different formats into a unified dataset. It covers reading heterogeneous file formats, applying field name mappings, resolving conflicts using priority ordering, and generating comprehensive output files including conflict reports.

Workflow

Step 1: Analyze Requirements and Source Files

Before writing any code, thoroughly understand the task:

Identify all source files and their formats (JSON, CSV, Parquet, XML, etc.)
Determine the merge key (e.g., user_id , record_id ) that links records across sources
Review field mapping requirements - source fields may have different names that map to common output fields
Understand conflict resolution rules - typically based on source priority ordering
Identify expected output formats and structure

Important: Do not attempt to read binary formats (Parquet, Excel, etc.) as text files - use appropriate libraries.

Step 2: Set Up Environment

Create a Python virtual environment using uv or venv
Install required dependencies based on source formats:
pandas
Core data manipulation
pyarrow
Parquet file support
openpyxl
Excel file support
lxml
XML parsing (if needed)
Verify installations before proceeding

Example environment setup:

uv venv .venv source .venv/bin/activate uv pip install pandas pyarrow

Step 3: Write the Merge Script

Structure the script with clear separation of concerns:

Data reading functions - One per format type
Field mapping function - Apply column renames
Data normalization - Handle date formats, type conversions
Merge logic - Combine records using the merge key
Conflict resolution - Apply priority rules
Output generation - Write merged data and conflict reports

Script quality practices:

Validate syntax before execution: python -m py_compile script.py
Use try-except blocks with informative error messages
Document assumptions about data formats

Step 4: Execute and Verify

Run a comprehensive verification process:

Check output file existence at expected locations
Validate merged data contains expected values
Verify conflict report structure and content
Run any provided test suites

Common Pitfalls

Binary File Handling

Mistake: Attempting to read Parquet/Excel files as text
Solution: Always use pandas with appropriate engine (pyarrow for Parquet, openpyxl for Excel)

Syntax Errors in Scripts

Mistake: Writing long scripts without validation, leading to indentation or syntax errors
Solution: Run python -m py_compile script.py before execution

Date Format Normalization

Mistake: Assuming consistent date formats across sources
Solution: Implement flexible date parsing that handles multiple formats:
ISO format: 2024-01-15
US format: 01/15/2024
European format: 15-01-2024
Datetime: 2024-01-15T10:30:00

Incomplete Script Output

Mistake: Writing very long scripts that may get truncated
Solution: Break into modular functions, verify complete code visibility

Environment Path Issues

Mistake: Repeating PATH exports in every command
Solution: Set PATH once in a setup step or use absolute paths to executables

Verification Strategies

Output Validation Checklist

File existence check:

import os assert os.path.exists("output/merged_data.json") assert os.path.exists("output/conflict_report.json")

Data completeness check:

import json with open("output/merged_data.json") as f: data = json.load(f)

Verify expected record count

assert len(data) == expected_count

Conflict report validation:

with open("output/conflict_report.json") as f: conflicts = json.load(f)

Verify conflict structure has required fields

for conflict in conflicts: assert "field" in conflict assert "selected" in conflict assert "sources" in conflict

Sample value verification:

Spot-check specific merged records

record = next(r for r in data if r["user_id"] == "expected_id") assert record["field_name"] == "expected_value"

Consolidate Verification

Instead of running multiple separate verification commands, create a single comprehensive test script that validates all aspects of the output.

Edge Cases to Consider

Empty source files - Handle gracefully with appropriate warnings
Missing merge keys - Decide whether to skip or error
Type mismatches - Convert consistently (e.g., user_id as string vs integer)
Null/None values - Determine handling in conflict resolution
Unicode/encoding - Specify encoding when reading text-based formats
Records in some sources but not others - Include partial records or require complete matches

Field Mapping Example

When sources have different field names for the same concept:

FIELD_MAPPINGS = { "source_a": { "firstName": "first_name", "lastName": "last_name", "emailAddress": "email" }, "source_b": { "fname": "first_name", "lname": "last_name", "mail": "email" } }

def apply_mapping(df, source_name): mapping = FIELD_MAPPINGS.get(source_name, {}) return df.rename(columns=mapping)

Conflict Resolution Pattern

When the same field has different values across sources:

def resolve_conflict(values_by_source, priority_order): """ Select value based on source priority.

Args:
    values_by_source: dict mapping source name to value
    priority_order: list of source names from highest to lowest priority

Returns:
    tuple: (selected_value, conflict_info)
"""
conflict_info = None
unique_values = set(v for v in values_by_source.values() if v is not None)

if len(unique_values) > 1:
    conflict_info = {
        "sources": values_by_source,
        "resolved_by": "priority"
    }

for source in priority_order:
    if source in values_by_source and values_by_source[source] is not None:
        return values_by_source[source], conflict_info

return None, conflict_info

multi-source-data-merger

Safety Notice

Copy this and send it to your AI assistant to learn

Verify expected record count

Verify conflict structure has required fields

Spot-check specific merged records

Source Transparency

Related Skills

extracting-pdf-text

video-processing

google-workspace

portfolio-optimization