Python Performance Optimization
Comprehensive guide to profiling, analyzing, and optimizing Python code for better performance, including CPU profiling, memory optimization, and implementation best practices.
When to Use This Skill
-
Identifying performance bottlenecks in Python applications
-
Reducing application latency and response times
-
Optimizing CPU-intensive operations
-
Reducing memory consumption and memory leaks
-
Improving database query performance
-
Optimizing I/O operations
-
Speeding up data processing pipelines
-
Implementing high-performance algorithms
-
Profiling production applications
Core Concepts
- Profiling Types
-
CPU Profiling: Identify time-consuming functions
-
Memory Profiling: Track memory allocation and leaks
-
Line Profiling: Profile at line-by-line granularity
-
Call Graph: Visualize function call relationships
- Performance Metrics
-
Execution Time: How long operations take
-
Memory Usage: Peak and average memory consumption
-
CPU Utilization: Processor usage patterns
-
I/O Wait: Time spent on I/O operations
- Optimization Strategies
-
Algorithmic: Better algorithms and data structures
-
Implementation: More efficient code patterns
-
Parallelization: Multi-threading/processing
-
Caching: Avoid redundant computation
-
Native Extensions: C/Rust for critical paths
Quick Start
Basic Timing
import time
def measure_time(): """Simple timing measurement.""" start = time.time()
# Your code here
result = sum(range(1000000))
elapsed = time.time() - start
print(f"Execution time: {elapsed:.4f} seconds")
return result
Better: use timeit for accurate measurements
import timeit
execution_time = timeit.timeit( "sum(range(1000000))", number=100 ) print(f"Average time: {execution_time/100:.6f} seconds")
Profiling Tools
Pattern 1: cProfile - CPU Profiling
import cProfile import pstats from pstats import SortKey
def slow_function(): """Function to profile.""" total = 0 for i in range(1000000): total += i return total
def another_function(): """Another function.""" return [i**2 for i in range(100000)]
def main(): """Main function to profile.""" result1 = slow_function() result2 = another_function() return result1, result2
Profile the code
if name == "main": profiler = cProfile.Profile() profiler.enable()
main()
profiler.disable()
# Print stats
stats = pstats.Stats(profiler)
stats.sort_stats(SortKey.CUMULATIVE)
stats.print_stats(10) # Top 10 functions
# Save to file for later analysis
stats.dump_stats("profile_output.prof")
Command-line profiling:
Profile a script
python -m cProfile -o output.prof script.py
View results
python -m pstats output.prof
In pstats:
sort cumtime
stats 10
Pattern 2: line_profiler - Line-by-Line Profiling
Install: pip install line-profiler
Add @profile decorator (line_profiler provides this)
@profile def process_data(data): """Process data with line profiling.""" result = [] for item in data: processed = item * 2 result.append(processed) return result
Run with:
kernprof -l -v script.py
Manual line profiling:
from line_profiler import LineProfiler
def process_data(data): """Function to profile.""" result = [] for item in data: processed = item * 2 result.append(processed) return result
if name == "main": lp = LineProfiler() lp.add_function(process_data)
data = list(range(100000))
lp_wrapper = lp(process_data)
lp_wrapper(data)
lp.print_stats()
Pattern 3: memory_profiler - Memory Usage
Install: pip install memory-profiler
from memory_profiler import profile
@profile def memory_intensive(): """Function that uses lots of memory.""" # Create large list big_list = [i for i in range(1000000)]
# Create large dict
big_dict = {i: i**2 for i in range(100000)}
# Process data
result = sum(big_list)
return result
if name == "main": memory_intensive()
Run with:
python -m memory_profiler script.py
Pattern 4: py-spy - Production Profiling
Install: pip install py-spy
Profile a running Python process
py-spy top --pid 12345
Generate flamegraph
py-spy record -o profile.svg --pid 12345
Profile a script
py-spy record -o profile.svg -- python script.py
Dump current call stack
py-spy dump --pid 12345
Optimization Patterns
Pattern 5: List Comprehensions vs Loops
import timeit
Slow: Traditional loop
def slow_squares(n): """Create list of squares using loop.""" result = [] for i in range(n): result.append(i**2) return result
Fast: List comprehension
def fast_squares(n): """Create list of squares using comprehension.""" return [i**2 for i in range(n)]
Benchmark
n = 100000
slow_time = timeit.timeit(lambda: slow_squares(n), number=100) fast_time = timeit.timeit(lambda: fast_squares(n), number=100)
print(f"Loop: {slow_time:.4f}s") print(f"Comprehension: {fast_time:.4f}s") print(f"Speedup: {slow_time/fast_time:.2f}x")
Even faster for simple operations: map
def faster_squares(n): """Use map for even better performance.""" return list(map(lambda x: x**2, range(n)))
Pattern 6: Generator Expressions for Memory
import sys
def list_approach(): """Memory-intensive list.""" data = [i**2 for i in range(1000000)] return sum(data)
def generator_approach(): """Memory-efficient generator.""" data = (i**2 for i in range(1000000)) return sum(data)
Memory comparison
list_data = [i for i in range(1000000)] gen_data = (i for i in range(1000000))
print(f"List size: {sys.getsizeof(list_data)} bytes") print(f"Generator size: {sys.getsizeof(gen_data)} bytes")
Generators use constant memory regardless of size
Pattern 7: String Concatenation
import timeit
def slow_concat(items): """Slow string concatenation.""" result = "" for item in items: result += str(item) return result
def fast_concat(items): """Fast string concatenation with join.""" return "".join(str(item) for item in items)
def faster_concat(items): """Even faster with list.""" parts = [str(item) for item in items] return "".join(parts)
items = list(range(10000))
Benchmark
slow = timeit.timeit(lambda: slow_concat(items), number=100) fast = timeit.timeit(lambda: fast_concat(items), number=100) faster = timeit.timeit(lambda: faster_concat(items), number=100)
print(f"Concatenation (+): {slow:.4f}s") print(f"Join (generator): {fast:.4f}s") print(f"Join (list): {faster:.4f}s")
Pattern 8: Dictionary Lookups vs List Searches
import timeit
Create test data
size = 10000 items = list(range(size)) lookup_dict = {i: i for i in range(size)}
def list_search(items, target): """O(n) search in list.""" return target in items
def dict_search(lookup_dict, target): """O(1) search in dict.""" return target in lookup_dict
target = size - 1 # Worst case for list
Benchmark
list_time = timeit.timeit( lambda: list_search(items, target), number=1000 ) dict_time = timeit.timeit( lambda: dict_search(lookup_dict, target), number=1000 )
print(f"List search: {list_time:.6f}s") print(f"Dict search: {dict_time:.6f}s") print(f"Speedup: {list_time/dict_time:.0f}x")
Pattern 9: Local Variable Access
import timeit
Global variable (slow)
GLOBAL_VALUE = 100
def use_global(): """Access global variable.""" total = 0 for i in range(10000): total += GLOBAL_VALUE return total
def use_local(): """Use local variable.""" local_value = 100 total = 0 for i in range(10000): total += local_value return total
Local is faster
global_time = timeit.timeit(use_global, number=1000) local_time = timeit.timeit(use_local, number=1000)
print(f"Global access: {global_time:.4f}s") print(f"Local access: {local_time:.4f}s") print(f"Speedup: {global_time/local_time:.2f}x")
Pattern 10: Function Call Overhead
import timeit
def calculate_inline(): """Inline calculation.""" total = 0 for i in range(10000): total += i * 2 + 1 return total
def helper_function(x): """Helper function.""" return x * 2 + 1
def calculate_with_function(): """Calculation with function calls.""" total = 0 for i in range(10000): total += helper_function(i) return total
Inline is faster due to no call overhead
inline_time = timeit.timeit(calculate_inline, number=1000) function_time = timeit.timeit(calculate_with_function, number=1000)
print(f"Inline: {inline_time:.4f}s") print(f"Function calls: {function_time:.4f}s")
Advanced Optimization
Pattern 11: NumPy for Numerical Operations
import timeit import numpy as np
def python_sum(n): """Sum using pure Python.""" return sum(range(n))
def numpy_sum(n): """Sum using NumPy.""" return np.arange(n).sum()
n = 1000000
python_time = timeit.timeit(lambda: python_sum(n), number=100) numpy_time = timeit.timeit(lambda: numpy_sum(n), number=100)
print(f"Python: {python_time:.4f}s") print(f"NumPy: {numpy_time:.4f}s") print(f"Speedup: {python_time/numpy_time:.2f}x")
Vectorized operations
def python_multiply(): """Element-wise multiplication in Python.""" a = list(range(100000)) b = list(range(100000)) return [x * y for x, y in zip(a, b)]
def numpy_multiply(): """Vectorized multiplication in NumPy.""" a = np.arange(100000) b = np.arange(100000) return a * b
py_time = timeit.timeit(python_multiply, number=100) np_time = timeit.timeit(numpy_multiply, number=100)
print(f"\nPython multiply: {py_time:.4f}s") print(f"NumPy multiply: {np_time:.4f}s") print(f"Speedup: {py_time/np_time:.2f}x")
Pattern 12: Caching with functools.lru_cache
from functools import lru_cache import timeit
def fibonacci_slow(n): """Recursive fibonacci without caching.""" if n < 2: return n return fibonacci_slow(n-1) + fibonacci_slow(n-2)
@lru_cache(maxsize=None) def fibonacci_fast(n): """Recursive fibonacci with caching.""" if n < 2: return n return fibonacci_fast(n-1) + fibonacci_fast(n-2)
Massive speedup for recursive algorithms
n = 30
slow_time = timeit.timeit(lambda: fibonacci_slow(n), number=1) fast_time = timeit.timeit(lambda: fibonacci_fast(n), number=1000)
print(f"Without cache (1 run): {slow_time:.4f}s") print(f"With cache (1000 runs): {fast_time:.4f}s")
Cache info
print(f"Cache info: {fibonacci_fast.cache_info()}")
Pattern 13: Using slots for Memory
import sys
class RegularClass: """Regular class with dict.""" def init(self, x, y, z): self.x = x self.y = y self.z = z
class SlottedClass: """Class with slots for memory efficiency.""" slots = ['x', 'y', 'z']
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
Memory comparison
regular = RegularClass(1, 2, 3) slotted = SlottedClass(1, 2, 3)
print(f"Regular class size: {sys.getsizeof(regular)} bytes") print(f"Slotted class size: {sys.getsizeof(slotted)} bytes")
Significant savings with many instances
regular_objects = [RegularClass(i, i+1, i+2) for i in range(10000)] slotted_objects = [SlottedClass(i, i+1, i+2) for i in range(10000)]
print(f"\nMemory for 10000 regular objects: ~{sys.getsizeof(regular) * 10000} bytes") print(f"Memory for 10000 slotted objects: ~{sys.getsizeof(slotted) * 10000} bytes")
Pattern 14: Multiprocessing for CPU-Bound Tasks
import multiprocessing as mp import time
def cpu_intensive_task(n): """CPU-intensive calculation.""" return sum(i**2 for i in range(n))
def sequential_processing(): """Process tasks sequentially.""" start = time.time() results = [cpu_intensive_task(1000000) for _ in range(4)] elapsed = time.time() - start return elapsed, results
def parallel_processing(): """Process tasks in parallel.""" start = time.time() with mp.Pool(processes=4) as pool: results = pool.map(cpu_intensive_task, [1000000] * 4) elapsed = time.time() - start return elapsed, results
if name == "main": seq_time, seq_results = sequential_processing() par_time, par_results = parallel_processing()
print(f"Sequential: {seq_time:.2f}s")
print(f"Parallel: {par_time:.2f}s")
print(f"Speedup: {seq_time/par_time:.2f}x")
Pattern 15: Async I/O for I/O-Bound Tasks
import asyncio import aiohttp import time import requests
urls = [ "https://httpbin.org/delay/1", "https://httpbin.org/delay/1", "https://httpbin.org/delay/1", "https://httpbin.org/delay/1", ]
def synchronous_requests(): """Synchronous HTTP requests.""" start = time.time() results = [] for url in urls: response = requests.get(url) results.append(response.status_code) elapsed = time.time() - start return elapsed, results
async def async_fetch(session, url): """Async HTTP request.""" async with session.get(url) as response: return response.status
async def asynchronous_requests(): """Asynchronous HTTP requests.""" start = time.time() async with aiohttp.ClientSession() as session: tasks = [async_fetch(session, url) for url in urls] results = await asyncio.gather(*tasks) elapsed = time.time() - start return elapsed, results
Async is much faster for I/O-bound work
sync_time, sync_results = synchronous_requests() async_time, async_results = asyncio.run(asynchronous_requests())
print(f"Synchronous: {sync_time:.2f}s") print(f"Asynchronous: {async_time:.2f}s") print(f"Speedup: {sync_time/async_time:.2f}x")
Database Optimization
Pattern 16: Batch Database Operations
import sqlite3 import time
def create_db(): """Create test database.""" conn = sqlite3.connect(":memory:") conn.execute("CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT)") return conn
def slow_inserts(conn, count): """Insert records one at a time.""" start = time.time() cursor = conn.cursor() for i in range(count): cursor.execute("INSERT INTO users (name) VALUES (?)", (f"User {i}",)) conn.commit() # Commit each insert elapsed = time.time() - start return elapsed
def fast_inserts(conn, count): """Batch insert with single commit.""" start = time.time() cursor = conn.cursor() data = [(f"User {i}",) for i in range(count)] cursor.executemany("INSERT INTO users (name) VALUES (?)", data) conn.commit() # Single commit elapsed = time.time() - start return elapsed
Benchmark
conn1 = create_db() slow_time = slow_inserts(conn1, 1000)
conn2 = create_db() fast_time = fast_inserts(conn2, 1000)
print(f"Individual inserts: {slow_time:.4f}s") print(f"Batch insert: {fast_time:.4f}s") print(f"Speedup: {slow_time/fast_time:.2f}x")
Pattern 17: Query Optimization
Use indexes for frequently queried columns
""" -- Slow: No index SELECT * FROM users WHERE email = 'user@example.com';
-- Fast: With index CREATE INDEX idx_users_email ON users(email); SELECT * FROM users WHERE email = 'user@example.com'; """
Use query planning
import sqlite3
conn = sqlite3.connect("example.db") cursor = conn.cursor()
Analyze query performance
cursor.execute("EXPLAIN QUERY PLAN SELECT * FROM users WHERE email = ?", ("test@example.com",)) print(cursor.fetchall())
Use SELECT only needed columns
Slow: SELECT *
Fast: SELECT id, name
Memory Optimization
Pattern 18: Detecting Memory Leaks
import tracemalloc import gc
def memory_leak_example(): """Example that leaks memory.""" leaked_objects = []
for i in range(100000):
# Objects added but never removed
leaked_objects.append([i] * 100)
# In real code, this would be an unintended reference
def track_memory_usage(): """Track memory allocations.""" tracemalloc.start()
# Take snapshot before
snapshot1 = tracemalloc.take_snapshot()
# Run code
memory_leak_example()
# Take snapshot after
snapshot2 = tracemalloc.take_snapshot()
# Compare
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
print("Top 10 memory allocations:")
for stat in top_stats[:10]:
print(stat)
tracemalloc.stop()
Monitor memory
track_memory_usage()
Force garbage collection
gc.collect()
Pattern 19: Iterators vs Lists
import sys
def process_file_list(filename): """Load entire file into memory.""" with open(filename) as f: lines = f.readlines() # Loads all lines return sum(1 for line in lines if line.strip())
def process_file_iterator(filename): """Process file line by line.""" with open(filename) as f: return sum(1 for line in f if line.strip())
Iterator uses constant memory
List loads entire file into memory
Pattern 20: Weakref for Caches
import weakref
class CachedResource: """Resource that can be garbage collected.""" def init(self, data): self.data = data
Regular cache prevents garbage collection
regular_cache = {}
def get_resource_regular(key): """Get resource from regular cache.""" if key not in regular_cache: regular_cache[key] = CachedResource(f"Data for {key}") return regular_cache[key]
Weak reference cache allows garbage collection
weak_cache = weakref.WeakValueDictionary()
def get_resource_weak(key): """Get resource from weak cache.""" resource = weak_cache.get(key) if resource is None: resource = CachedResource(f"Data for {key}") weak_cache[key] = resource return resource
When no strong references exist, objects can be GC'd
Benchmarking Tools
Custom Benchmark Decorator
import time from functools import wraps
def benchmark(func): """Decorator to benchmark function execution.""" @wraps(func) def wrapper(*args, **kwargs): start = time.perf_counter() result = func(*args, **kwargs) elapsed = time.perf_counter() - start print(f"{func.name} took {elapsed:.6f} seconds") return result return wrapper
@benchmark def slow_function(): """Function to benchmark.""" time.sleep(0.5) return sum(range(1000000))
result = slow_function()
Performance Testing with pytest-benchmark
Install: pip install pytest-benchmark
def test_list_comprehension(benchmark): """Benchmark list comprehension.""" result = benchmark(lambda: [i**2 for i in range(10000)]) assert len(result) == 10000
def test_map_function(benchmark): """Benchmark map function.""" result = benchmark(lambda: list(map(lambda x: x**2, range(10000)))) assert len(result) == 10000
Run with: pytest test_performance.py --benchmark-compare
Best Practices
-
Profile before optimizing - Measure to find real bottlenecks
-
Focus on hot paths - Optimize code that runs most frequently
-
Use appropriate data structures - Dict for lookups, set for membership
-
Avoid premature optimization - Clarity first, then optimize
-
Use built-in functions - They're implemented in C
-
Cache expensive computations - Use lru_cache
-
Batch I/O operations - Reduce system calls
-
Use generators for large datasets
-
Consider NumPy for numerical operations
-
Profile production code - Use py-spy for live systems
Common Pitfalls
-
Optimizing without profiling
-
Using global variables unnecessarily
-
Not using appropriate data structures
-
Creating unnecessary copies of data
-
Not using connection pooling for databases
-
Ignoring algorithmic complexity
-
Over-optimizing rare code paths
-
Not considering memory usage
Resources
-
cProfile: Built-in CPU profiler
-
memory_profiler: Memory usage profiling
-
line_profiler: Line-by-line profiling
-
py-spy: Sampling profiler for production
-
NumPy: High-performance numerical computing
-
Cython: Compile Python to C
-
PyPy: Alternative Python interpreter with JIT
Performance Checklist
-
Profiled code to identify bottlenecks
-
Used appropriate data structures
-
Implemented caching where beneficial
-
Optimized database queries
-
Used generators for large datasets
-
Considered multiprocessing for CPU-bound tasks
-
Used async I/O for I/O-bound tasks
-
Minimized function call overhead in hot loops
-
Checked for memory leaks
-
Benchmarked before and after optimization