Python Performance Profiling

When NOT to Use This Skill

Java/JVM profiling - Use the java-profiling skill for JFR and GC tuning
Node.js profiling - Use the nodejs-profiling skill for V8 profiler
NumPy/Pandas optimization - Use library-specific profiling tools and vectorization guides
Database query optimization - Use database-specific profiling tools
Web server performance - Use application-level profiling (Django Debug Toolbar, Flask-DebugToolbar)

Deep Knowledge: Use mcp__documentation__fetch_docs with technology: python for comprehensive profiling guides, optimization techniques, and best practices.

cProfile (CPU Profiling)

Command Line Usage

Profile entire script

python -m cProfile -o output.prof script.py

Sort by cumulative time

python -m cProfile -s cumtime script.py

Sort by total time in function

python -m cProfile -s tottime script.py

Analyze saved profile

python -m pstats output.prof

pstats Analysis

import pstats

Load and analyze profile

stats = pstats.Stats('output.prof') stats.strip_dirs() stats.sort_stats('cumulative') stats.print_stats(20) # Top 20 functions

Filter by module

stats.print_stats('mymodule')

Show callers

stats.print_callers('slow_function')

Show callees

stats.print_callees('main')

Programmatic Profiling

import cProfile import pstats from io import StringIO

def profile_function(func, *args, **kwargs): profiler = cProfile.Profile() profiler.enable()

result = func(*args, **kwargs)

profiler.disable()

# Analyze
stream = StringIO()
stats = pstats.Stats(profiler, stream=stream)
stats.sort_stats('cumulative')
stats.print_stats(10)
print(stream.getvalue())

return result

Context manager

from contextlib import contextmanager

@contextmanager def profile_block(name='profile'): profiler = cProfile.Profile() profiler.enable() try: yield finally: profiler.disable() profiler.dump_stats(f'{name}.prof')

Memory Profiling

tracemalloc (Built-in)

import tracemalloc

Start tracking

tracemalloc.start()

Your code here

result = process_data()

Get snapshot

snapshot = tracemalloc.take_snapshot() top_stats = snapshot.statistics('lineno')

print("Top 10 memory allocations:") for stat in top_stats[:10]: print(stat)

Compare snapshots

snapshot1 = tracemalloc.take_snapshot()

... code ...

snapshot2 = tracemalloc.take_snapshot()

diff = snapshot2.compare_to(snapshot1, 'lineno') for stat in diff[:10]: print(stat)

Stop tracking

tracemalloc.stop()

memory_profiler (Line-by-line)

Install: pip install memory_profiler

from memory_profiler import profile

@profile def my_function(): a = [1] * 1_000_000 b = [2] * 2_000_000 del b return a

Command line usage

python -m memory_profiler script.py

Profile specific function

mprof run script.py

mprof plot

objgraph (Object References)

Install: pip install objgraph

import objgraph

Most common types

objgraph.show_most_common_types(limit=20)

Growth since last call

objgraph.show_growth()

Find reference chain (memory leak detection)

objgraph.show_backrefs([leaked_object], filename='refs.png')

Line Profiler

Install: pip install line_profiler

Decorate functions to profile

@profile def slow_function(): total = 0 for i in range(1000000): total += i return total

Run with: kernprof -l -v script.py

High-Resolution Timing

time Module

import time

Monotonic clock (best for measuring durations)

start = time.perf_counter() result = do_work() duration = time.perf_counter() - start print(f"Duration: {duration:.4f}s")

Nanosecond precision (Python 3.7+)

start = time.perf_counter_ns() result = do_work() duration_ns = time.perf_counter_ns() - start print(f"Duration: {duration_ns}ns")

timeit Module

import timeit

Time small code snippets

duration = timeit.timeit('sum(range(1000))', number=10000) print(f"Average: {duration / 10000:.6f}s")

Compare implementations

setup = "data = list(range(10000))" time1 = timeit.timeit('sum(data)', setup, number=1000) time2 = timeit.timeit('sum(x for x in data)', setup, number=1000) print(f"sum(): {time1:.4f}s, generator: {time2:.4f}s")

Common Bottleneck Patterns

List Operations

❌ Bad: Concatenating lists in loop

result = [] for item in items: result = result + [process(item)] # O(n²)

✅ Good: Use append

result = [] for item in items: result.append(process(item)) # O(n)

✅ Better: List comprehension

result = [process(item) for item in items]

❌ Bad: Checking membership in list

if item in large_list: # O(n) pass

✅ Good: Use set for membership

large_set = set(large_list) if item in large_set: # O(1) pass

String Operations

❌ Bad: String concatenation in loop

result = "" for s in strings: result += s # Creates new string each time

✅ Good: Use join

result = "".join(strings)

❌ Bad: Format in loop

for item in items: log(f"Processing {item}")

✅ Good: Lazy formatting

import logging for item in items: logging.debug("Processing %s", item) # Only formats if needed

Dictionary Operations

❌ Bad: Repeated key lookup

if key in d: value = d[key] process(value)

✅ Good: Use get or setdefault

value = d.get(key) if value is not None: process(value)

❌ Bad: Checking then setting

if key not in d: d[key] = [] d[key].append(value)

✅ Good: Use defaultdict

from collections import defaultdict d = defaultdict(list) d[key].append(value)

Generator vs List

❌ Bad: Creating large intermediate lists

result = sum([x * 2 for x in range(10_000_000)]) # Uses memory

✅ Good: Use generator

result = sum(x * 2 for x in range(10_000_000)) # Lazy evaluation

Process large files

❌ Bad

data = open('large.csv').readlines() # All in memory for line in data: process(line)

✅ Good

with open('large.csv') as f: # Stream line by line for line in f: process(line)

NumPy Optimization

import numpy as np

❌ Bad: Python loops over arrays

result = [] for i in range(len(arr)): result.append(arr[i] * 2)

✅ Good: Vectorized operations

result = arr * 2 # SIMD operations

❌ Bad: Creating many temporary arrays

result = (arr1 + arr2) * arr3 / arr4 # 3 temporaries

✅ Good: In-place operations when possible

result = arr1.copy() result += arr2 result *= arr3 result /= arr4

Use appropriate dtypes

arr = np.array(data, dtype=np.float32) # Half memory of float64

Async Optimization

import asyncio import aiohttp

❌ Bad: Sequential async

async def fetch_all_sequential(urls): results = [] async with aiohttp.ClientSession() as session: for url in urls: async with session.get(url) as resp: results.append(await resp.text()) return results

✅ Good: Concurrent async

async def fetch_all_concurrent(urls): async with aiohttp.ClientSession() as session: tasks = [session.get(url) for url in urls] responses = await asyncio.gather(*tasks) return [await r.text() for r in responses]

✅ Better: With concurrency limit

from asyncio import Semaphore

async def fetch_with_limit(urls, limit=10): semaphore = Semaphore(limit)

async def fetch_one(url):
    async with semaphore:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as resp:
                return await resp.text()

return await asyncio.gather(*[fetch_one(url) for url in urls])

Multiprocessing

from multiprocessing import Pool, cpu_count from concurrent.futures import ProcessPoolExecutor

CPU-bound work

def cpu_intensive(x): return sum(i * i for i in range(x))

Using Pool

with Pool(cpu_count()) as pool: results = pool.map(cpu_intensive, range(100))

Using ProcessPoolExecutor

with ProcessPoolExecutor() as executor: results = list(executor.map(cpu_intensive, range(100)))

Shared memory (Python 3.8+)

from multiprocessing import shared_memory import numpy as np

Create shared array

shm = shared_memory.SharedMemory(create=True, size=arr.nbytes) shared_arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf) shared_arr[:] = arr[:]

Profiling Checklist

Check Tool Command

CPU hotspots cProfile python -m cProfile script.py

Line-by-line line_profiler kernprof -l -v script.py

Memory usage tracemalloc tracemalloc.start()

Memory per line memory_profiler @profile decorator

Object references objgraph objgraph.show_growth()

Quick benchmarks timeit timeit.timeit()

py-spy (Sampling Profiler)

Install: pip install py-spy

Record profile

py-spy record -o profile.svg -- python script.py

Top-like view of running process

py-spy top --pid <pid>

Dump current stack

py-spy dump --pid <pid>

Profile subprocesses

py-spy record --subprocesses -o profile.svg -- python script.py

Production Optimization

Use slots for memory efficiency

class Point: slots = ['x', 'y'] def init(self, x, y): self.x = x self.y = y

Use lru_cache for memoization

from functools import lru_cache

@lru_cache(maxsize=1000) def expensive_computation(x): return x ** 2

Use dataclasses with slots (Python 3.10+)

from dataclasses import dataclass

@dataclass(slots=True) class Point: x: float y: float

Anti-Patterns

Anti-Pattern Why It's Wrong Correct Approach

Using + to concatenate strings in loop O(n²) time complexity Use ''.join() or list comprehension

List comprehension when generator suffices Unnecessary memory allocation Use generator expression for one-time iteration

range() when enumerate() needed Manual index tracking, error-prone Use enumerate() for index and value

Checking membership in list O(n) lookup Use set for O(1) membership testing

global variables everywhere Hard to profile, side effects Pass parameters, return values

Not using NumPy for numerical work Orders of magnitude slower Vectorize with NumPy for array operations

Premature optimization Wasted effort, harder to maintain Profile first, optimize bottlenecks

Using import *

Namespace pollution, slower imports Import specific names

.append() in loop when size known Multiple reallocations Pre-allocate with list comprehension or [None] * size

Not using slots for many instances Higher memory usage Use slots for classes with many instances

Quick Troubleshooting

Issue Diagnosis Solution

Slow loops over large data Python loops are slow Vectorize with NumPy, use list comprehensions

High memory usage Creating large intermediate objects Use generators, process in chunks

GIL contention Multi-threading doesn't speed up CPU work Use multiprocessing for CPU-bound tasks

Slow imports Large modules with side effects Lazy import, reduce module-level code

Memory leak Objects not being garbage collected Check for circular references, use weakref

RecursionError

Recursion too deep Increase limit with sys.setrecursionlimit() or refactor to iteration

Slow dictionary operations Hash collisions Ensure keys are hashable and well-distributed

High CPU in profiler C extensions not showing Use sampling profiler like py-spy

Out of memory with large file Loading entire file Use with open() and iterate line by line

Slow JSON parsing Large JSON file Use streaming parser (ijson) or pandas

Related Skills

FastAPI
Django
NumPy/Pandas