Python Performance Optimization

Comprehensive guide to profiling, analyzing, and optimizing Python code for better performance, including CPU profiling, memory optimization, and implementation best practices.

When to Use This Skill

Identifying performance bottlenecks in Python applications
Reducing application latency and response times
Optimizing CPU-intensive operations
Reducing memory consumption and memory leaks
Improving database query performance
Optimizing I/O operations
Speeding up data processing pipelines
Implementing high-performance algorithms
Profiling production applications

Core Concepts

Profiling Types

CPU Profiling: Identify time-consuming functions
Memory Profiling: Track memory allocation and leaks
Line Profiling: Profile at line-by-line granularity
Call Graph: Visualize function call relationships

Performance Metrics

Execution Time: How long operations take
Memory Usage: Peak and average memory consumption
CPU Utilization: Processor usage patterns
I/O Wait: Time spent on I/O operations

Optimization Strategies

Algorithmic: Better algorithms and data structures
Implementation: More efficient code patterns
Parallelization: Multi-threading/processing
Caching: Avoid redundant computation
Native Extensions: C/Rust for critical paths

Quick Start

Basic Timing

import time

def measure_time(): """Simple timing measurement.""" start = time.time()

# Your code here
result = sum(range(1000000))

elapsed = time.time() - start
print(f"Execution time: {elapsed:.4f} seconds")
return result

Better: use timeit for accurate measurements

import timeit

execution_time = timeit.timeit( "sum(range(1000000))", number=100 ) print(f"Average time: {execution_time/100:.6f} seconds")

Profiling Tools

Pattern 1: cProfile - CPU Profiling

import cProfile import pstats from pstats import SortKey

def slow_function(): """Function to profile.""" total = 0 for i in range(1000000): total += i return total

def another_function(): """Another function.""" return [i**2 for i in range(100000)]

def main(): """Main function to profile.""" result1 = slow_function() result2 = another_function() return result1, result2

Profile the code

if name == "main": profiler = cProfile.Profile() profiler.enable()

main()

profiler.disable()

# Print stats
stats = pstats.Stats(profiler)
stats.sort_stats(SortKey.CUMULATIVE)
stats.print_stats(10)  # Top 10 functions

# Save to file for later analysis
stats.dump_stats("profile_output.prof")

Command-line profiling:

Profile a script

python -m cProfile -o output.prof script.py

View results

python -m pstats output.prof

In pstats:

sort cumtime

stats 10

Pattern 2: line_profiler - Line-by-Line Profiling

Install: pip install line-profiler

Add @profile decorator (line_profiler provides this)

@profile def process_data(data): """Process data with line profiling.""" result = [] for item in data: processed = item * 2 result.append(processed) return result

Run with:

kernprof -l -v script.py

Manual line profiling:

from line_profiler import LineProfiler

def process_data(data): """Function to profile.""" result = [] for item in data: processed = item * 2 result.append(processed) return result

if name == "main": lp = LineProfiler() lp.add_function(process_data)

data = list(range(100000))

lp_wrapper = lp(process_data)
lp_wrapper(data)

lp.print_stats()

Pattern 3: memory_profiler - Memory Usage

Install: pip install memory-profiler

from memory_profiler import profile

@profile def memory_intensive(): """Function that uses lots of memory.""" # Create large list big_list = [i for i in range(1000000)]

# Create large dict
big_dict = {i: i**2 for i in range(100000)}

# Process data
result = sum(big_list)

return result

if name == "main": memory_intensive()

Run with:

python -m memory_profiler script.py

Pattern 4: py-spy - Production Profiling

Install: pip install py-spy

Profile a running Python process

py-spy top --pid 12345

Generate flamegraph

py-spy record -o profile.svg --pid 12345

Profile a script

py-spy record -o profile.svg -- python script.py

Dump current call stack

py-spy dump --pid 12345

Optimization Patterns

Pattern 5: List Comprehensions vs Loops

import timeit

Slow: Traditional loop

def slow_squares(n): """Create list of squares using loop.""" result = [] for i in range(n): result.append(i**2) return result

Fast: List comprehension

def fast_squares(n): """Create list of squares using comprehension.""" return [i**2 for i in range(n)]

Benchmark

n = 100000

slow_time = timeit.timeit(lambda: slow_squares(n), number=100) fast_time = timeit.timeit(lambda: fast_squares(n), number=100)

print(f"Loop: {slow_time:.4f}s") print(f"Comprehension: {fast_time:.4f}s") print(f"Speedup: {slow_time/fast_time:.2f}x")

Even faster for simple operations: map

def faster_squares(n): """Use map for even better performance.""" return list(map(lambda x: x**2, range(n)))

Pattern 6: Generator Expressions for Memory

import sys

def list_approach(): """Memory-intensive list.""" data = [i**2 for i in range(1000000)] return sum(data)

def generator_approach(): """Memory-efficient generator.""" data = (i**2 for i in range(1000000)) return sum(data)

Memory comparison

list_data = [i for i in range(1000000)] gen_data = (i for i in range(1000000))

print(f"List size: {sys.getsizeof(list_data)} bytes") print(f"Generator size: {sys.getsizeof(gen_data)} bytes")

Generators use constant memory regardless of size

Pattern 7: String Concatenation

import timeit

def slow_concat(items): """Slow string concatenation.""" result = "" for item in items: result += str(item) return result

def fast_concat(items): """Fast string concatenation with join.""" return "".join(str(item) for item in items)

def faster_concat(items): """Even faster with list.""" parts = [str(item) for item in items] return "".join(parts)

items = list(range(10000))

Benchmark

slow = timeit.timeit(lambda: slow_concat(items), number=100) fast = timeit.timeit(lambda: fast_concat(items), number=100) faster = timeit.timeit(lambda: faster_concat(items), number=100)

print(f"Concatenation (+): {slow:.4f}s") print(f"Join (generator): {fast:.4f}s") print(f"Join (list): {faster:.4f}s")

Pattern 8: Dictionary Lookups vs List Searches

import timeit

Create test data

size = 10000 items = list(range(size)) lookup_dict = {i: i for i in range(size)}

def list_search(items, target): """O(n) search in list.""" return target in items

def dict_search(lookup_dict, target): """O(1) search in dict.""" return target in lookup_dict

target = size - 1 # Worst case for list

Benchmark

list_time = timeit.timeit( lambda: list_search(items, target), number=1000 ) dict_time = timeit.timeit( lambda: dict_search(lookup_dict, target), number=1000 )

print(f"List search: {list_time:.6f}s") print(f"Dict search: {dict_time:.6f}s") print(f"Speedup: {list_time/dict_time:.0f}x")

Pattern 9: Local Variable Access

import timeit

Global variable (slow)

GLOBAL_VALUE = 100

def use_global(): """Access global variable.""" total = 0 for i in range(10000): total += GLOBAL_VALUE return total

def use_local(): """Use local variable.""" local_value = 100 total = 0 for i in range(10000): total += local_value return total

Local is faster

global_time = timeit.timeit(use_global, number=1000) local_time = timeit.timeit(use_local, number=1000)

print(f"Global access: {global_time:.4f}s") print(f"Local access: {local_time:.4f}s") print(f"Speedup: {global_time/local_time:.2f}x")

Pattern 10: Function Call Overhead

import timeit

def calculate_inline(): """Inline calculation.""" total = 0 for i in range(10000): total += i * 2 + 1 return total

def helper_function(x): """Helper function.""" return x * 2 + 1

def calculate_with_function(): """Calculation with function calls.""" total = 0 for i in range(10000): total += helper_function(i) return total

Inline is faster due to no call overhead

inline_time = timeit.timeit(calculate_inline, number=1000) function_time = timeit.timeit(calculate_with_function, number=1000)

print(f"Inline: {inline_time:.4f}s") print(f"Function calls: {function_time:.4f}s")

Advanced Optimization

Pattern 11: NumPy for Numerical Operations

import timeit import numpy as np

def python_sum(n): """Sum using pure Python.""" return sum(range(n))

def numpy_sum(n): """Sum using NumPy.""" return np.arange(n).sum()

n = 1000000

python_time = timeit.timeit(lambda: python_sum(n), number=100) numpy_time = timeit.timeit(lambda: numpy_sum(n), number=100)

print(f"Python: {python_time:.4f}s") print(f"NumPy: {numpy_time:.4f}s") print(f"Speedup: {python_time/numpy_time:.2f}x")

Vectorized operations

def python_multiply(): """Element-wise multiplication in Python.""" a = list(range(100000)) b = list(range(100000)) return [x * y for x, y in zip(a, b)]

def numpy_multiply(): """Vectorized multiplication in NumPy.""" a = np.arange(100000) b = np.arange(100000) return a * b

py_time = timeit.timeit(python_multiply, number=100) np_time = timeit.timeit(numpy_multiply, number=100)

print(f"\nPython multiply: {py_time:.4f}s") print(f"NumPy multiply: {np_time:.4f}s") print(f"Speedup: {py_time/np_time:.2f}x")

Pattern 12: Caching with functools.lru_cache

from functools import lru_cache import timeit

def fibonacci_slow(n): """Recursive fibonacci without caching.""" if n < 2: return n return fibonacci_slow(n-1) + fibonacci_slow(n-2)

@lru_cache(maxsize=None) def fibonacci_fast(n): """Recursive fibonacci with caching.""" if n < 2: return n return fibonacci_fast(n-1) + fibonacci_fast(n-2)

Massive speedup for recursive algorithms

n = 30

slow_time = timeit.timeit(lambda: fibonacci_slow(n), number=1) fast_time = timeit.timeit(lambda: fibonacci_fast(n), number=1000)

print(f"Without cache (1 run): {slow_time:.4f}s") print(f"With cache (1000 runs): {fast_time:.4f}s")

Cache info

print(f"Cache info: {fibonacci_fast.cache_info()}")

Pattern 13: Using slots for Memory

import sys

class RegularClass: """Regular class with dict.""" def init(self, x, y, z): self.x = x self.y = y self.z = z

class SlottedClass: """Class with slots for memory efficiency.""" slots = ['x', 'y', 'z']

def __init__(self, x, y, z):
    self.x = x
    self.y = y
    self.z = z

Memory comparison

regular = RegularClass(1, 2, 3) slotted = SlottedClass(1, 2, 3)

print(f"Regular class size: {sys.getsizeof(regular)} bytes") print(f"Slotted class size: {sys.getsizeof(slotted)} bytes")

Significant savings with many instances

regular_objects = [RegularClass(i, i+1, i+2) for i in range(10000)] slotted_objects = [SlottedClass(i, i+1, i+2) for i in range(10000)]

print(f"\nMemory for 10000 regular objects: ~{sys.getsizeof(regular) * 10000} bytes") print(f"Memory for 10000 slotted objects: ~{sys.getsizeof(slotted) * 10000} bytes")

Pattern 14: Multiprocessing for CPU-Bound Tasks

import multiprocessing as mp import time

def cpu_intensive_task(n): """CPU-intensive calculation.""" return sum(i**2 for i in range(n))

def sequential_processing(): """Process tasks sequentially.""" start = time.time() results = [cpu_intensive_task(1000000) for _ in range(4)] elapsed = time.time() - start return elapsed, results

def parallel_processing(): """Process tasks in parallel.""" start = time.time() with mp.Pool(processes=4) as pool: results = pool.map(cpu_intensive_task, [1000000] * 4) elapsed = time.time() - start return elapsed, results

if name == "main": seq_time, seq_results = sequential_processing() par_time, par_results = parallel_processing()

print(f"Sequential: {seq_time:.2f}s")
print(f"Parallel: {par_time:.2f}s")
print(f"Speedup: {seq_time/par_time:.2f}x")

Pattern 15: Async I/O for I/O-Bound Tasks

import asyncio import aiohttp import time import requests

urls = [ "https://httpbin.org/delay/1", "https://httpbin.org/delay/1", "https://httpbin.org/delay/1", "https://httpbin.org/delay/1", ]

def synchronous_requests(): """Synchronous HTTP requests.""" start = time.time() results = [] for url in urls: response = requests.get(url) results.append(response.status_code) elapsed = time.time() - start return elapsed, results

async def async_fetch(session, url): """Async HTTP request.""" async with session.get(url) as response: return response.status

async def asynchronous_requests(): """Asynchronous HTTP requests.""" start = time.time() async with aiohttp.ClientSession() as session: tasks = [async_fetch(session, url) for url in urls] results = await asyncio.gather(*tasks) elapsed = time.time() - start return elapsed, results

Async is much faster for I/O-bound work

sync_time, sync_results = synchronous_requests() async_time, async_results = asyncio.run(asynchronous_requests())

print(f"Synchronous: {sync_time:.2f}s") print(f"Asynchronous: {async_time:.2f}s") print(f"Speedup: {sync_time/async_time:.2f}x")

Database Optimization

Pattern 16: Batch Database Operations

import sqlite3 import time

def create_db(): """Create test database.""" conn = sqlite3.connect(":memory:") conn.execute("CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT)") return conn

def slow_inserts(conn, count): """Insert records one at a time.""" start = time.time() cursor = conn.cursor() for i in range(count): cursor.execute("INSERT INTO users (name) VALUES (?)", (f"User {i}",)) conn.commit() # Commit each insert elapsed = time.time() - start return elapsed

def fast_inserts(conn, count): """Batch insert with single commit.""" start = time.time() cursor = conn.cursor() data = [(f"User {i}",) for i in range(count)] cursor.executemany("INSERT INTO users (name) VALUES (?)", data) conn.commit() # Single commit elapsed = time.time() - start return elapsed

Benchmark

conn1 = create_db() slow_time = slow_inserts(conn1, 1000)

conn2 = create_db() fast_time = fast_inserts(conn2, 1000)

print(f"Individual inserts: {slow_time:.4f}s") print(f"Batch insert: {fast_time:.4f}s") print(f"Speedup: {slow_time/fast_time:.2f}x")

Pattern 17: Query Optimization

Use indexes for frequently queried columns

""" -- Slow: No index SELECT * FROM users WHERE email = 'user@example.com';

-- Fast: With index CREATE INDEX idx_users_email ON users(email); SELECT * FROM users WHERE email = 'user@example.com'; """

Use query planning

import sqlite3

conn = sqlite3.connect("example.db") cursor = conn.cursor()

Analyze query performance

cursor.execute("EXPLAIN QUERY PLAN SELECT * FROM users WHERE email = ?", ("test@example.com",)) print(cursor.fetchall())

Use SELECT only needed columns

Slow: SELECT *

Fast: SELECT id, name

Memory Optimization

Pattern 18: Detecting Memory Leaks

import tracemalloc import gc

def memory_leak_example(): """Example that leaks memory.""" leaked_objects = []

for i in range(100000):
    # Objects added but never removed
    leaked_objects.append([i] * 100)

# In real code, this would be an unintended reference

def track_memory_usage(): """Track memory allocations.""" tracemalloc.start()

# Take snapshot before
snapshot1 = tracemalloc.take_snapshot()

# Run code
memory_leak_example()

# Take snapshot after
snapshot2 = tracemalloc.take_snapshot()

# Compare
top_stats = snapshot2.compare_to(snapshot1, 'lineno')

print("Top 10 memory allocations:")
for stat in top_stats[:10]:
    print(stat)

tracemalloc.stop()

Monitor memory

track_memory_usage()

Force garbage collection

gc.collect()

Pattern 19: Iterators vs Lists

import sys

def process_file_list(filename): """Load entire file into memory.""" with open(filename) as f: lines = f.readlines() # Loads all lines return sum(1 for line in lines if line.strip())

def process_file_iterator(filename): """Process file line by line.""" with open(filename) as f: return sum(1 for line in f if line.strip())

Iterator uses constant memory

List loads entire file into memory

Pattern 20: Weakref for Caches

import weakref

class CachedResource: """Resource that can be garbage collected.""" def init(self, data): self.data = data

Regular cache prevents garbage collection

regular_cache = {}

def get_resource_regular(key): """Get resource from regular cache.""" if key not in regular_cache: regular_cache[key] = CachedResource(f"Data for {key}") return regular_cache[key]

Weak reference cache allows garbage collection

weak_cache = weakref.WeakValueDictionary()

def get_resource_weak(key): """Get resource from weak cache.""" resource = weak_cache.get(key) if resource is None: resource = CachedResource(f"Data for {key}") weak_cache[key] = resource return resource

When no strong references exist, objects can be GC'd

Benchmarking Tools

Custom Benchmark Decorator

import time from functools import wraps

def benchmark(func): """Decorator to benchmark function execution.""" @wraps(func) def wrapper(*args, **kwargs): start = time.perf_counter() result = func(*args, **kwargs) elapsed = time.perf_counter() - start print(f"{func.name} took {elapsed:.6f} seconds") return result return wrapper

@benchmark def slow_function(): """Function to benchmark.""" time.sleep(0.5) return sum(range(1000000))

result = slow_function()

Performance Testing with pytest-benchmark

Install: pip install pytest-benchmark

def test_list_comprehension(benchmark): """Benchmark list comprehension.""" result = benchmark(lambda: [i**2 for i in range(10000)]) assert len(result) == 10000

def test_map_function(benchmark): """Benchmark map function.""" result = benchmark(lambda: list(map(lambda x: x**2, range(10000)))) assert len(result) == 10000

Run with: pytest test_performance.py --benchmark-compare

Best Practices

Profile before optimizing - Measure to find real bottlenecks
Focus on hot paths - Optimize code that runs most frequently
Use appropriate data structures - Dict for lookups, set for membership
Avoid premature optimization - Clarity first, then optimize
Use built-in functions - They're implemented in C
Cache expensive computations - Use lru_cache
Batch I/O operations - Reduce system calls
Use generators for large datasets
Consider NumPy for numerical operations
Profile production code - Use py-spy for live systems

Common Pitfalls

Optimizing without profiling
Using global variables unnecessarily
Not using appropriate data structures
Creating unnecessary copies of data
Not using connection pooling for databases
Ignoring algorithmic complexity
Over-optimizing rare code paths
Not considering memory usage

Resources

cProfile: Built-in CPU profiler
memory_profiler: Memory usage profiling
line_profiler: Line-by-line profiling
py-spy: Sampling profiler for production
NumPy: High-performance numerical computing
Cython: Compile Python to C
PyPy: Alternative Python interpreter with JIT

Performance Checklist

Profiled code to identify bottlenecks
Used appropriate data structures
Implemented caching where beneficial
Optimized database queries
Used generators for large datasets
Considered multiprocessing for CPU-bound tasks
Used async I/O for I/O-bound tasks
Minimized function call overhead in hot loops
Checked for memory leaks
Benchmarked before and after optimization

python-performance-optimization

Safety Notice

Copy this and send it to your AI assistant to learn

Better: use timeit for accurate measurements

Profile the code

Profile a script

View results

In pstats:

sort cumtime

stats 10

Install: pip install line-profiler

Add @profile decorator (line_profiler provides this)

Run with:

kernprof -l -v script.py

Install: pip install memory-profiler

Run with:

python -m memory_profiler script.py

Install: pip install py-spy

Profile a running Python process

Generate flamegraph

Profile a script

Dump current call stack

Slow: Traditional loop

Fast: List comprehension

Benchmark

Even faster for simple operations: map

Memory comparison

Generators use constant memory regardless of size

Benchmark

Create test data

Benchmark

Global variable (slow)

Local is faster

Inline is faster due to no call overhead

Vectorized operations

Massive speedup for recursive algorithms

Cache info

Memory comparison

Significant savings with many instances

Async is much faster for I/O-bound work

Benchmark

Use indexes for frequently queried columns

Use query planning

Analyze query performance

Use SELECT only needed columns

Slow: SELECT *

Fast: SELECT id, name

Monitor memory

Force garbage collection

Iterator uses constant memory

List loads entire file into memory

Regular cache prevents garbage collection

Weak reference cache allows garbage collection

When no strong references exist, objects can be GC'd

Install: pip install pytest-benchmark

Run with: pytest test_performance.py --benchmark-compare

Source Transparency

Related Skills

python-refactor

async-python-patterns

python-packaging