PyTorch Debugging Guide

This guide provides systematic approaches to debugging PyTorch models, from common tensor errors to complex training issues.

Common Error Patterns

CUDA Out of Memory (OOM)

Error Message:

RuntimeError: CUDA out of memory. Tried to allocate X.XX GiB

Causes:

Batch size too large for GPU memory
Accumulating gradients without clearing
Storing tensors on GPU unnecessarily
Memory leaks from not detaching tensors

Solutions:

Check current memory usage

print(torch.cuda.memory_summary(device=None, abbreviated=False)) print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB") print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

Clear cache

torch.cuda.empty_cache()

Reduce batch size

batch_size = batch_size // 2

Use gradient checkpointing for large models

from torch.utils.checkpoint import checkpoint output = checkpoint(self.heavy_layer, input)

Use mixed precision training

from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() with autocast(): output = model(input) loss = criterion(output, target) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()

Detach tensors when storing for logging

logged_loss = loss.detach().cpu().item()

Use gradient accumulation instead of large batches

accumulation_steps = 4 for i, (inputs, labels) in enumerate(dataloader): outputs = model(inputs) loss = criterion(outputs, labels) / accumulation_steps loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()

Tensor Size/Shape Mismatch

Error Message:

RuntimeError: size mismatch, m1: [32 x 512], m2: [256 x 10] RuntimeError: The size of tensor a (64) must match the size of tensor b (32)

Causes:

Incorrect layer dimensions
Wrong tensor reshaping
Mismatched batch sizes
Incorrect input preprocessing

Solutions:

Debug by printing shapes at each layer

class DebugModel(nn.Module): def forward(self, x): print(f"Input shape: {x.shape}") x = self.layer1(x) print(f"After layer1: {x.shape}") x = self.layer2(x) print(f"After layer2: {x.shape}") return x

Add shape assertions as contracts

def forward(self, x): assert x.dim() == 4, f"Expected 4D input, got {x.dim()}D" assert x.shape[1] == 3, f"Expected 3 channels, got {x.shape[1]}" # ... rest of forward pass

Use einops for clearer reshaping

from einops import rearrange x = rearrange(x, 'b c h w -> b (c h w)')

Calculate dimensions programmatically

def _get_conv_output_size(self, shape): with torch.no_grad(): dummy = torch.zeros(1, *shape) output = self.conv_layers(dummy) return output.numel()

NaN in Gradients/Loss

Error Message:

Loss is nan RuntimeError: Function 'XXXBackward' returned nan values

Causes:

Learning rate too high
Numerical instability in operations
Division by zero
Log of zero or negative numbers
Exploding gradients

Solutions:

Enable anomaly detection to find the source

torch.autograd.set_detect_anomaly(True)

Check for NaN in tensors

def check_nan(tensor, name="tensor"): if torch.isnan(tensor).any(): print(f"NaN detected in {name}") print(f"Shape: {tensor.shape}") print(f"NaN count: {torch.isnan(tensor).sum()}") raise ValueError(f"NaN in {name}")

Add epsilon for numerical stability

eps = 1e-8 log_probs = torch.log(probs + eps) normalized = x / (x.norm() + eps)

Gradient clipping

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=1.0)

Check gradients after backward

def check_gradients(model): for name, param in model.named_parameters(): if param.grad is not None: if torch.isnan(param.grad).any(): print(f"NaN gradient in {name}") if torch.isinf(param.grad).any(): print(f"Inf gradient in {name}") grad_norm = param.grad.norm() print(f"{name}: grad_norm = {grad_norm:.4f}")

Use stable loss functions

BAD: nn.CrossEntropyLoss on softmax output

GOOD: nn.CrossEntropyLoss on logits (raw scores)

loss = nn.CrossEntropyLoss()(logits, targets) # Not softmax(logits)

Device Mismatch (CPU/GPU)

Error Message:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Causes:

Model on GPU, data on CPU (or vice versa)
Loading model saved on different device
Creating new tensors without specifying device
Mixing tensors from different GPUs

Solutions:

Always explicitly move to device

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = model.to(device) inputs = inputs.to(device) targets = targets.to(device)

Load model with map_location

model.load_state_dict(torch.load('model.pt', map_location=device))

Create tensors on the correct device

new_tensor = torch.zeros(10, device=device) new_tensor = torch.zeros_like(existing_tensor) # Same device as existing

Check device of all model parameters

def check_model_device(model): devices = {p.device for p in model.parameters()} print(f"Model parameters on devices: {devices}")

Debug device issues

print(f"Model device: {next(model.parameters()).device}") print(f"Input device: {inputs.device}") print(f"Target device: {targets.device}")

Autograd Graph Issues

Error Message:

RuntimeError: Trying to backward through the graph a second time RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Causes:

Calling backward() twice without retain_graph
In-place operations on tensors requiring gradients
Detaching tensors incorrectly
Not enabling gradients on input tensors

Solutions:

For multiple backward passes

loss.backward(retain_graph=True) # Use sparingly - memory intensive

Avoid in-place operations on tensors with gradients

BAD:

x += 1 x[0] = 0 x.add_(1)

GOOD:

x = x + 1 x = x.clone() x[0] = 0

Ensure requires_grad is set

input_tensor = torch.randn(10, requires_grad=True)

Clone before in-place modification

x_modified = x.clone() x_modified[0] = 0

Check if tensor has gradient function

print(f"requires_grad: {tensor.requires_grad}") print(f"grad_fn: {tensor.grad_fn}") print(f"is_leaf: {tensor.is_leaf}")

Properly detach for logging/storage

logged_value = tensor.detach().cpu().numpy()

DataLoader Problems

Error Message:

RuntimeError: DataLoader worker (pid X) is killed BrokenPipeError: [Errno 32] Broken pipe RuntimeError: Cannot re-initialize CUDA in forked subprocess

Causes:

Too many workers
Memory issues in workers
CUDA operations before DataLoader fork
Shared memory issues

Solutions:

Reduce number of workers

dataloader = DataLoader(dataset, batch_size=32, num_workers=2)

Use spawn instead of fork for CUDA

import multiprocessing multiprocessing.set_start_method('spawn', force=True)

Pin memory for faster GPU transfer (but uses more memory)

dataloader = DataLoader(dataset, pin_memory=True)

Increase shared memory for Docker

In docker-compose.yml: shm_size: '2gb'

Debug DataLoader issues

dataloader = DataLoader(dataset, num_workers=0) # Single process for debugging

Use persistent workers to avoid respawning

dataloader = DataLoader(dataset, num_workers=4, persistent_workers=True)

Custom collate function with error handling

def safe_collate(batch): try: return torch.utils.data.dataloader.default_collate(batch) except Exception as e: print(f"Collate error: {e}") print(f"Batch: {batch}") raise

Dtype Mismatch

Error Message:

RuntimeError: expected scalar type Float but found Double RuntimeError: expected scalar type Long but found Int

Causes:

Mixing float32 and float64
Wrong dtype for loss functions
NumPy default dtype conflicts

Solutions:

Explicitly set dtype

tensor = torch.tensor(data, dtype=torch.float32) tensor = tensor.float() # Convert to float32 tensor = tensor.long() # Convert to int64

Set default dtype globally

torch.set_default_dtype(torch.float32)

CrossEntropyLoss expects Long targets

targets = targets.long()

BCELoss expects Float targets

targets = targets.float()

Convert NumPy arrays properly

numpy_array = np.array([1.0, 2.0]) # float64 by default tensor = torch.from_numpy(numpy_array).float() # Convert to float32

Debugging Tools

Anomaly Detection

Enable for debugging (disable in production - slow)

torch.autograd.set_detect_anomaly(True)

Use as context manager

with torch.autograd.detect_anomaly(): output = model(input) loss = criterion(output, target) loss.backward()

Python Debugger

Insert breakpoint

breakpoint() # Python 3.7+ import pdb; pdb.set_trace() # Older Python

In Jupyter

from IPython.core.debugger import set_trace set_trace()

Common pdb commands:

n - next line

s - step into

c - continue

p variable - print variable

pp variable - pretty print

l - list source code

q - quit

Memory Profiling

GPU memory summary

print(torch.cuda.memory_summary())

Memory snapshots

torch.cuda.memory._record_memory_history()

... run code ...

torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")

Profile memory allocation

from torch.profiler import profile, ProfilerActivity with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], profile_memory=True) as prof: model(input) print(prof.key_averages().table(sort_by="self_cuda_memory_usage"))

TensorBoard Integration

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter('runs/experiment_1')

Log scalars

writer.add_scalar('Loss/train', loss.item(), epoch) writer.add_scalar('Accuracy/val', accuracy, epoch)

Log histograms of weights and gradients

for name, param in model.named_parameters(): writer.add_histogram(f'weights/{name}', param, epoch) if param.grad is not None: writer.add_histogram(f'gradients/{name}', param.grad, epoch)

Log model graph

writer.add_graph(model, input_tensor)

Log images

writer.add_images('predictions', predicted_images, epoch)

writer.close()

Launch: tensorboard --logdir=runs

PyTorch Profiler

from torch.profiler import profile, record_function, ProfilerActivity

with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True, profile_memory=True, with_stack=True ) as prof: with record_function("model_inference"): model(input)

Print summary

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Export for Chrome trace viewer

prof.export_chrome_trace("trace.json")

Export for TensorBoard

prof.export_stacks("profiler_stacks.txt", "self_cuda_time_total")

The Four Phases of PyTorch Debugging

Phase 1: Reproduce and Isolate

Set seeds for reproducibility

def set_seed(seed=42): torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) np.random.seed(seed) random.seed(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False

set_seed(42)

Create minimal reproducible example

Isolate the problematic component

Test with synthetic data first

dummy_input = torch.randn(2, 3, 224, 224, device=device) dummy_target = torch.randint(0, 10, (2,), device=device)

Run single forward/backward pass

model.train() output = model(dummy_input) loss = criterion(output, dummy_target) loss.backward()

Phase 2: Validate Data Pipeline

Check dataset

print(f"Dataset size: {len(dataset)}") sample = dataset[0] print(f"Sample type: {type(sample)}") print(f"Sample shapes: {[s.shape if hasattr(s, 'shape') else type(s) for s in sample]}")

Check DataLoader output

batch = next(iter(dataloader)) for i, item in enumerate(batch): print(f"Batch item {i}: shape={item.shape}, dtype={item.dtype}, device={item.device}")

Validate data ranges

inputs, targets = batch print(f"Input range: [{inputs.min():.4f}, {inputs.max():.4f}]") print(f"Input mean: {inputs.mean():.4f}, std: {inputs.std():.4f}") print(f"Target unique values: {targets.unique()}")

Check for data issues

assert not torch.isnan(inputs).any(), "NaN in inputs" assert not torch.isinf(inputs).any(), "Inf in inputs"

Phase 3: Validate Model Architecture

Test with tiny data to check for bugs

def overfit_single_batch(model, batch, epochs=100): """Model should be able to overfit a single batch.""" model.train() inputs, targets = batch optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f"Epoch {epoch}: Loss = {loss.item():.6f}")

# Loss should be very low if model can learn
assert loss.item() &#x3C; 0.1, "Model cannot overfit single batch!"

Check model parameters

def inspect_model(model): total_params = 0 trainable_params = 0 for name, param in model.named_parameters(): total_params += param.numel() if param.requires_grad: trainable_params += param.numel() print(f"{name}: shape={param.shape}, requires_grad={param.requires_grad}") print(f"\nTotal params: {total_params:,}") print(f"Trainable params: {trainable_params:,}")

Verify forward pass shape transformations

def trace_shapes(model, input_shape): """Trace shapes through the model.""" hooks = [] shapes = []

def hook(module, input, output):
    shapes.append({
        'module': module.__class__.__name__,
        'input': [i.shape for i in input if hasattr(i, 'shape')],
        'output': output.shape if hasattr(output, 'shape') else type(output)
    })

for layer in model.modules():
    hooks.append(layer.register_forward_hook(hook))

dummy = torch.randn(1, *input_shape)
model(dummy)

for h in hooks:
    h.remove()

for s in shapes:
    print(s)

Phase 4: Validate Training Loop

Comprehensive training loop debugging

def debug_training_step(model, batch, criterion, optimizer): model.train() inputs, targets = batch

# Check inputs
print(f"Input shape: {inputs.shape}, dtype: {inputs.dtype}")
print(f"Target shape: {targets.shape}, dtype: {targets.dtype}")

# Zero gradients
optimizer.zero_grad()

# Forward pass
with torch.autograd.detect_anomaly():
    outputs = model(inputs)
    print(f"Output shape: {outputs.shape}")
    print(f"Output range: [{outputs.min():.4f}, {outputs.max():.4f}]")

    # Check for NaN in outputs
    if torch.isnan(outputs).any():
        print("WARNING: NaN in outputs!")

    # Compute loss
    loss = criterion(outputs, targets)
    print(f"Loss: {loss.item():.6f}")

    if torch.isnan(loss):
        print("WARNING: NaN loss!")
        return

    # Backward pass
    loss.backward()

# Check gradients
for name, param in model.named_parameters():
    if param.grad is not None:
        grad_norm = param.grad.norm().item()
        print(f"{name}: grad_norm = {grad_norm:.6f}")
        if torch.isnan(param.grad).any():
            print(f"  WARNING: NaN gradient in {name}!")

# Optimizer step
optimizer.step()

return loss.item()

Quick Reference Commands

Environment and Device Checks

PyTorch version

print(f"PyTorch version: {torch.version}") print(f"CUDA available: {torch.cuda.is_available()}") print(f"CUDA version: {torch.version.cuda}") print(f"cuDNN version: {torch.backends.cudnn.version()}")

GPU information

if torch.cuda.is_available(): print(f"GPU count: {torch.cuda.device_count()}") print(f"Current GPU: {torch.cuda.current_device()}") print(f"GPU name: {torch.cuda.get_device_name()}") print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

Gradient Checking

Numerical gradient check

from torch.autograd import gradcheck

For custom autograd functions

input = torch.randn(10, requires_grad=True, dtype=torch.double) test = gradcheck(my_function, input, eps=1e-6, atol=1e-4) print(f"Gradient check passed: {test}")

For modules

def check_module_gradients(module, input_shape): module = module.double() input = torch.randn(*input_shape, requires_grad=True, dtype=torch.double) return gradcheck(module, input, eps=1e-6, atol=1e-4)

Memory Management

Clear GPU memory

torch.cuda.empty_cache()

Memory stats

print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB") print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB") print(f"Max allocated: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")

Reset peak stats

torch.cuda.reset_peak_memory_stats()

Find memory leaks

import gc gc.collect() torch.cuda.empty_cache()

Model Inspection

Count parameters

total_params = sum(p.numel() for p in model.parameters()) trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) print(f"Total: {total_params:,}, Trainable: {trainable_params:,}")

Model summary (requires torchsummary)

from torchsummary import summary summary(model, input_size=(3, 224, 224))

Export model to ONNX for visualization

torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)

Tensor Debugging

Comprehensive tensor info

def tensor_info(t, name="tensor"): print(f"{name}:") print(f" shape: {t.shape}") print(f" dtype: {t.dtype}") print(f" device: {t.device}") print(f" requires_grad: {t.requires_grad}") print(f" is_leaf: {t.is_leaf}") print(f" grad_fn: {t.grad_fn}") print(f" min: {t.min().item():.6f}") print(f" max: {t.max().item():.6f}") print(f" mean: {t.mean().item():.6f}") print(f" std: {t.std().item():.6f}") print(f" has_nan: {torch.isnan(t).any().item()}") print(f" has_inf: {torch.isinf(t).any().item()}")

Best Practices Summary

Always use explicit device placement - Never assume tensors are on the right device
Use loss functions on logits - CrossEntropyLoss expects raw scores, not softmax output
Register modules properly - Use nn.ModuleList/ModuleDict for parameter detection
Avoid in-place operations - They can break autograd graphs
Set seeds for reproducibility - Makes debugging deterministic
Start with small data - Test overfitting on a single batch first
Use anomaly detection - Enable during development, disable in production
Monitor gradients - Check for NaN, Inf, and exploding/vanishing gradients
Profile memory - Use memory_summary() to identify leaks
Print shapes liberally - Most bugs are shape mismatches

Resources

UvA Deep Learning Notebooks - Debugging Guide
PyTorch Lightning Debugging Guide
PyTorch Performance Tuning Guide
TorchRL Common Errors and Solutions
Machine Learning Mastery - Debugging PyTorch

debug:pytorch

Safety Notice

Copy this and send it to your AI assistant to learn

Check current memory usage

Clear cache

Reduce batch size

Use gradient checkpointing for large models

Use mixed precision training

Detach tensors when storing for logging

Use gradient accumulation instead of large batches

Debug by printing shapes at each layer

Add shape assertions as contracts

Use einops for clearer reshaping

Calculate dimensions programmatically

Enable anomaly detection to find the source

Check for NaN in tensors

Add epsilon for numerical stability

Gradient clipping

Check gradients after backward

Use stable loss functions

BAD: nn.CrossEntropyLoss on softmax output

GOOD: nn.CrossEntropyLoss on logits (raw scores)

Always explicitly move to device

Load model with map_location

Create tensors on the correct device

Check device of all model parameters

Debug device issues

For multiple backward passes

Avoid in-place operations on tensors with gradients

BAD:

GOOD:

Ensure requires_grad is set

Clone before in-place modification

Check if tensor has gradient function

Properly detach for logging/storage

Reduce number of workers

Use spawn instead of fork for CUDA

Pin memory for faster GPU transfer (but uses more memory)

Increase shared memory for Docker

In docker-compose.yml: shm_size: '2gb'

Debug DataLoader issues

Use persistent workers to avoid respawning

Custom collate function with error handling

Explicitly set dtype

Set default dtype globally

CrossEntropyLoss expects Long targets

BCELoss expects Float targets

Convert NumPy arrays properly

Enable for debugging (disable in production - slow)

Use as context manager

Insert breakpoint

In Jupyter

Common pdb commands:

n - next line

s - step into

c - continue

p variable - print variable

pp variable - pretty print

l - list source code

q - quit

GPU memory summary

Memory snapshots

... run code ...

Profile memory allocation

Log scalars

Log histograms of weights and gradients

Log model graph

Log images

Launch: tensorboard --logdir=runs

Print summary

Export for Chrome trace viewer

Export for TensorBoard

Set seeds for reproducibility

Create minimal reproducible example

Isolate the problematic component

Test with synthetic data first

Run single forward/backward pass

Check dataset

Check DataLoader output

Validate data ranges