Data Python Skill

Version: 1.0 Stack: Python (pandas, polars, pyspark)

Python makes it easy to write data processing code that works on sample data and fails on real data. iterrows() takes 30 seconds on 10K rows and 30 minutes on 10M. A DataFrame without explicit dtypes uses 8x the memory it needs. Chained indexing creates silent copies that lose your changes. These aren't edge cases — they're the default behavior of pandas when you write it like regular Python.

Vectorized operations, explicit schemas, and proper dtypes mean your code scales from prototype to production without rewriting.

Scope and Boundaries

This skill covers:

DataFrame patterns (pandas, polars, pyspark)
Data type handling and validation
Memory-efficient processing
Vectorized operations over loops
Method chaining patterns
Error handling for data pipelines

Defers to other skills:

code-quality : General code structure, testing, naming
data-sql : Query patterns when using SQL interfaces
data-pipelines : Orchestration and ETL architecture

Use this skill when: Writing Python code that processes data.

Core Principles

Vectorize, Don't Loop — Use DataFrame operations, not row iteration.
Fail Fast on Bad Data — Validate early, reject invalid data at boundaries.
Memory Awareness — Know your data size, use appropriate dtypes.
Immutable Transforms — Chain operations, don't mutate in place.
Explicit Schemas — Define expected columns and types upfront.

Patterns

Method Chaining

Good - readable pipeline

result = ( df .query("status == 'active'") .assign(total=lambda x: x["quantity"] * x["price"]) .groupby("category") .agg({"total": "sum"}) .sort_values("total", ascending=False) )

Bad - intermediate variables obscure flow

filtered = df[df["status"] == "active"] filtered["total"] = filtered["quantity"] * filtered["price"] grouped = filtered.groupby("category") result = grouped.agg({"total": "sum"}) result = result.sort_values("total", ascending=False)

Schema Validation

EXPECTED_COLUMNS = {"id", "name", "value", "timestamp"} REQUIRED_COLUMNS = {"id", "value"}

def validate_schema(df: pd.DataFrame) -> pd.DataFrame: missing = REQUIRED_COLUMNS - set(df.columns) if missing: raise ValueError(f"Missing required columns: {missing}") return df[list(EXPECTED_COLUMNS & set(df.columns))]

Type Optimization

def optimize_dtypes(df: pd.DataFrame) -> pd.DataFrame: """Downcast numeric types to reduce memory.""" for col in df.select_dtypes(include=["int"]).columns: df[col] = pd.to_numeric(df[col], downcast="integer") for col in df.select_dtypes(include=["float"]).columns: df[col] = pd.to_numeric(df[col], downcast="float") return df

Anti-Patterns

Anti-Pattern Problem Fix

for row in df.iterrows()

Slow, defeats vectorization Use vectorized operations

df["col"] = df.apply(...)

Usually slower than vectorized Use np.where or df.assign

Chained indexing df["a"]["b"]

SettingWithCopyWarning Use df.loc[:, "a"]

Loading entire file to check schema Wastes memory Use nrows=100 or chunking

Ignoring dtypes on read Memory bloat Specify dtype= parameter

Checklist

No iterrows() or itertuples() for computation
Explicit dtypes on file reads
Schema validation at boundaries
Method chaining for transforms
Memory profiled for large datasets

References

references/vectorization.md — Vectorized operations and performance
references/memory-optimization.md — Memory optimization techniques

Assets

assets/pandas-cheatsheet.md — Quick reference for pandas operations

data-python

Safety Notice

Copy this and send it to your AI assistant to learn

Good - readable pipeline

Bad - intermediate variables obscure flow

Source Transparency

Related Skills

code-quality

unity-csharp

web-performance