PyTorch Debugging Guide
This guide provides systematic approaches to debugging PyTorch models, from common tensor errors to complex training issues.
Common Error Patterns
- CUDA Out of Memory (OOM)
Error Message:
RuntimeError: CUDA out of memory. Tried to allocate X.XX GiB
Causes:
-
Batch size too large for GPU memory
-
Accumulating gradients without clearing
-
Storing tensors on GPU unnecessarily
-
Memory leaks from not detaching tensors
Solutions:
Check current memory usage
print(torch.cuda.memory_summary(device=None, abbreviated=False)) print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB") print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
Clear cache
torch.cuda.empty_cache()
Reduce batch size
batch_size = batch_size // 2
Use gradient checkpointing for large models
from torch.utils.checkpoint import checkpoint output = checkpoint(self.heavy_layer, input)
Use mixed precision training
from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() with autocast(): output = model(input) loss = criterion(output, target) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()
Detach tensors when storing for logging
logged_loss = loss.detach().cpu().item()
Use gradient accumulation instead of large batches
accumulation_steps = 4 for i, (inputs, labels) in enumerate(dataloader): outputs = model(inputs) loss = criterion(outputs, labels) / accumulation_steps loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()
- Tensor Size/Shape Mismatch
Error Message:
RuntimeError: size mismatch, m1: [32 x 512], m2: [256 x 10] RuntimeError: The size of tensor a (64) must match the size of tensor b (32)
Causes:
-
Incorrect layer dimensions
-
Wrong tensor reshaping
-
Mismatched batch sizes
-
Incorrect input preprocessing
Solutions:
Debug by printing shapes at each layer
class DebugModel(nn.Module): def forward(self, x): print(f"Input shape: {x.shape}") x = self.layer1(x) print(f"After layer1: {x.shape}") x = self.layer2(x) print(f"After layer2: {x.shape}") return x
Add shape assertions as contracts
def forward(self, x): assert x.dim() == 4, f"Expected 4D input, got {x.dim()}D" assert x.shape[1] == 3, f"Expected 3 channels, got {x.shape[1]}" # ... rest of forward pass
Use einops for clearer reshaping
from einops import rearrange x = rearrange(x, 'b c h w -> b (c h w)')
Calculate dimensions programmatically
def _get_conv_output_size(self, shape): with torch.no_grad(): dummy = torch.zeros(1, *shape) output = self.conv_layers(dummy) return output.numel()
- NaN in Gradients/Loss
Error Message:
Loss is nan RuntimeError: Function 'XXXBackward' returned nan values
Causes:
-
Learning rate too high
-
Numerical instability in operations
-
Division by zero
-
Log of zero or negative numbers
-
Exploding gradients
Solutions:
Enable anomaly detection to find the source
torch.autograd.set_detect_anomaly(True)
Check for NaN in tensors
def check_nan(tensor, name="tensor"): if torch.isnan(tensor).any(): print(f"NaN detected in {name}") print(f"Shape: {tensor.shape}") print(f"NaN count: {torch.isnan(tensor).sum()}") raise ValueError(f"NaN in {name}")
Add epsilon for numerical stability
eps = 1e-8 log_probs = torch.log(probs + eps) normalized = x / (x.norm() + eps)
Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=1.0)
Check gradients after backward
def check_gradients(model): for name, param in model.named_parameters(): if param.grad is not None: if torch.isnan(param.grad).any(): print(f"NaN gradient in {name}") if torch.isinf(param.grad).any(): print(f"Inf gradient in {name}") grad_norm = param.grad.norm() print(f"{name}: grad_norm = {grad_norm:.4f}")
Use stable loss functions
BAD: nn.CrossEntropyLoss on softmax output
GOOD: nn.CrossEntropyLoss on logits (raw scores)
loss = nn.CrossEntropyLoss()(logits, targets) # Not softmax(logits)
- Device Mismatch (CPU/GPU)
Error Message:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same
Causes:
-
Model on GPU, data on CPU (or vice versa)
-
Loading model saved on different device
-
Creating new tensors without specifying device
-
Mixing tensors from different GPUs
Solutions:
Always explicitly move to device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = model.to(device) inputs = inputs.to(device) targets = targets.to(device)
Load model with map_location
model.load_state_dict(torch.load('model.pt', map_location=device))
Create tensors on the correct device
new_tensor = torch.zeros(10, device=device) new_tensor = torch.zeros_like(existing_tensor) # Same device as existing
Check device of all model parameters
def check_model_device(model): devices = {p.device for p in model.parameters()} print(f"Model parameters on devices: {devices}")
Debug device issues
print(f"Model device: {next(model.parameters()).device}") print(f"Input device: {inputs.device}") print(f"Target device: {targets.device}")
- Autograd Graph Issues
Error Message:
RuntimeError: Trying to backward through the graph a second time RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Causes:
-
Calling backward() twice without retain_graph
-
In-place operations on tensors requiring gradients
-
Detaching tensors incorrectly
-
Not enabling gradients on input tensors
Solutions:
For multiple backward passes
loss.backward(retain_graph=True) # Use sparingly - memory intensive
Avoid in-place operations on tensors with gradients
BAD:
x += 1 x[0] = 0 x.add_(1)
GOOD:
x = x + 1 x = x.clone() x[0] = 0
Ensure requires_grad is set
input_tensor = torch.randn(10, requires_grad=True)
Clone before in-place modification
x_modified = x.clone() x_modified[0] = 0
Check if tensor has gradient function
print(f"requires_grad: {tensor.requires_grad}") print(f"grad_fn: {tensor.grad_fn}") print(f"is_leaf: {tensor.is_leaf}")
Properly detach for logging/storage
logged_value = tensor.detach().cpu().numpy()
- DataLoader Problems
Error Message:
RuntimeError: DataLoader worker (pid X) is killed BrokenPipeError: [Errno 32] Broken pipe RuntimeError: Cannot re-initialize CUDA in forked subprocess
Causes:
-
Too many workers
-
Memory issues in workers
-
CUDA operations before DataLoader fork
-
Shared memory issues
Solutions:
Reduce number of workers
dataloader = DataLoader(dataset, batch_size=32, num_workers=2)
Use spawn instead of fork for CUDA
import multiprocessing multiprocessing.set_start_method('spawn', force=True)
Pin memory for faster GPU transfer (but uses more memory)
dataloader = DataLoader(dataset, pin_memory=True)
Increase shared memory for Docker
In docker-compose.yml: shm_size: '2gb'
Debug DataLoader issues
dataloader = DataLoader(dataset, num_workers=0) # Single process for debugging
Use persistent workers to avoid respawning
dataloader = DataLoader(dataset, num_workers=4, persistent_workers=True)
Custom collate function with error handling
def safe_collate(batch): try: return torch.utils.data.dataloader.default_collate(batch) except Exception as e: print(f"Collate error: {e}") print(f"Batch: {batch}") raise
- Dtype Mismatch
Error Message:
RuntimeError: expected scalar type Float but found Double RuntimeError: expected scalar type Long but found Int
Causes:
-
Mixing float32 and float64
-
Wrong dtype for loss functions
-
NumPy default dtype conflicts
Solutions:
Explicitly set dtype
tensor = torch.tensor(data, dtype=torch.float32) tensor = tensor.float() # Convert to float32 tensor = tensor.long() # Convert to int64
Set default dtype globally
torch.set_default_dtype(torch.float32)
CrossEntropyLoss expects Long targets
targets = targets.long()
BCELoss expects Float targets
targets = targets.float()
Convert NumPy arrays properly
numpy_array = np.array([1.0, 2.0]) # float64 by default tensor = torch.from_numpy(numpy_array).float() # Convert to float32
Debugging Tools
- Anomaly Detection
Enable for debugging (disable in production - slow)
torch.autograd.set_detect_anomaly(True)
Use as context manager
with torch.autograd.detect_anomaly(): output = model(input) loss = criterion(output, target) loss.backward()
- Python Debugger
Insert breakpoint
breakpoint() # Python 3.7+ import pdb; pdb.set_trace() # Older Python
In Jupyter
from IPython.core.debugger import set_trace set_trace()
Common pdb commands:
n - next line
s - step into
c - continue
p variable - print variable
pp variable - pretty print
l - list source code
q - quit
- Memory Profiling
GPU memory summary
print(torch.cuda.memory_summary())
Memory snapshots
torch.cuda.memory._record_memory_history()
... run code ...
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
Profile memory allocation
from torch.profiler import profile, ProfilerActivity with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], profile_memory=True) as prof: model(input) print(prof.key_averages().table(sort_by="self_cuda_memory_usage"))
- TensorBoard Integration
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter('runs/experiment_1')
Log scalars
writer.add_scalar('Loss/train', loss.item(), epoch) writer.add_scalar('Accuracy/val', accuracy, epoch)
Log histograms of weights and gradients
for name, param in model.named_parameters(): writer.add_histogram(f'weights/{name}', param, epoch) if param.grad is not None: writer.add_histogram(f'gradients/{name}', param.grad, epoch)
Log model graph
writer.add_graph(model, input_tensor)
Log images
writer.add_images('predictions', predicted_images, epoch)
writer.close()
Launch: tensorboard --logdir=runs
- PyTorch Profiler
from torch.profiler import profile, record_function, ProfilerActivity
with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True, profile_memory=True, with_stack=True ) as prof: with record_function("model_inference"): model(input)
Print summary
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
Export for Chrome trace viewer
prof.export_chrome_trace("trace.json")
Export for TensorBoard
prof.export_stacks("profiler_stacks.txt", "self_cuda_time_total")
The Four Phases of PyTorch Debugging
Phase 1: Reproduce and Isolate
Set seeds for reproducibility
def set_seed(seed=42): torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) np.random.seed(seed) random.seed(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False
set_seed(42)
Create minimal reproducible example
Isolate the problematic component
Test with synthetic data first
dummy_input = torch.randn(2, 3, 224, 224, device=device) dummy_target = torch.randint(0, 10, (2,), device=device)
Run single forward/backward pass
model.train() output = model(dummy_input) loss = criterion(output, dummy_target) loss.backward()
Phase 2: Validate Data Pipeline
Check dataset
print(f"Dataset size: {len(dataset)}") sample = dataset[0] print(f"Sample type: {type(sample)}") print(f"Sample shapes: {[s.shape if hasattr(s, 'shape') else type(s) for s in sample]}")
Check DataLoader output
batch = next(iter(dataloader)) for i, item in enumerate(batch): print(f"Batch item {i}: shape={item.shape}, dtype={item.dtype}, device={item.device}")
Validate data ranges
inputs, targets = batch print(f"Input range: [{inputs.min():.4f}, {inputs.max():.4f}]") print(f"Input mean: {inputs.mean():.4f}, std: {inputs.std():.4f}") print(f"Target unique values: {targets.unique()}")
Check for data issues
assert not torch.isnan(inputs).any(), "NaN in inputs" assert not torch.isinf(inputs).any(), "Inf in inputs"
Phase 3: Validate Model Architecture
Test with tiny data to check for bugs
def overfit_single_batch(model, batch, epochs=100): """Model should be able to overfit a single batch.""" model.train() inputs, targets = batch optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(epochs):
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
if epoch % 10 == 0:
print(f"Epoch {epoch}: Loss = {loss.item():.6f}")
# Loss should be very low if model can learn
assert loss.item() < 0.1, "Model cannot overfit single batch!"
Check model parameters
def inspect_model(model): total_params = 0 trainable_params = 0 for name, param in model.named_parameters(): total_params += param.numel() if param.requires_grad: trainable_params += param.numel() print(f"{name}: shape={param.shape}, requires_grad={param.requires_grad}") print(f"\nTotal params: {total_params:,}") print(f"Trainable params: {trainable_params:,}")
Verify forward pass shape transformations
def trace_shapes(model, input_shape): """Trace shapes through the model.""" hooks = [] shapes = []
def hook(module, input, output):
shapes.append({
'module': module.__class__.__name__,
'input': [i.shape for i in input if hasattr(i, 'shape')],
'output': output.shape if hasattr(output, 'shape') else type(output)
})
for layer in model.modules():
hooks.append(layer.register_forward_hook(hook))
dummy = torch.randn(1, *input_shape)
model(dummy)
for h in hooks:
h.remove()
for s in shapes:
print(s)
Phase 4: Validate Training Loop
Comprehensive training loop debugging
def debug_training_step(model, batch, criterion, optimizer): model.train() inputs, targets = batch
# Check inputs
print(f"Input shape: {inputs.shape}, dtype: {inputs.dtype}")
print(f"Target shape: {targets.shape}, dtype: {targets.dtype}")
# Zero gradients
optimizer.zero_grad()
# Forward pass
with torch.autograd.detect_anomaly():
outputs = model(inputs)
print(f"Output shape: {outputs.shape}")
print(f"Output range: [{outputs.min():.4f}, {outputs.max():.4f}]")
# Check for NaN in outputs
if torch.isnan(outputs).any():
print("WARNING: NaN in outputs!")
# Compute loss
loss = criterion(outputs, targets)
print(f"Loss: {loss.item():.6f}")
if torch.isnan(loss):
print("WARNING: NaN loss!")
return
# Backward pass
loss.backward()
# Check gradients
for name, param in model.named_parameters():
if param.grad is not None:
grad_norm = param.grad.norm().item()
print(f"{name}: grad_norm = {grad_norm:.6f}")
if torch.isnan(param.grad).any():
print(f" WARNING: NaN gradient in {name}!")
# Optimizer step
optimizer.step()
return loss.item()
Quick Reference Commands
Environment and Device Checks
PyTorch version
print(f"PyTorch version: {torch.version}") print(f"CUDA available: {torch.cuda.is_available()}") print(f"CUDA version: {torch.version.cuda}") print(f"cuDNN version: {torch.backends.cudnn.version()}")
GPU information
if torch.cuda.is_available(): print(f"GPU count: {torch.cuda.device_count()}") print(f"Current GPU: {torch.cuda.current_device()}") print(f"GPU name: {torch.cuda.get_device_name()}") print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
Gradient Checking
Numerical gradient check
from torch.autograd import gradcheck
For custom autograd functions
input = torch.randn(10, requires_grad=True, dtype=torch.double) test = gradcheck(my_function, input, eps=1e-6, atol=1e-4) print(f"Gradient check passed: {test}")
For modules
def check_module_gradients(module, input_shape): module = module.double() input = torch.randn(*input_shape, requires_grad=True, dtype=torch.double) return gradcheck(module, input, eps=1e-6, atol=1e-4)
Memory Management
Clear GPU memory
torch.cuda.empty_cache()
Memory stats
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB") print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB") print(f"Max allocated: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")
Reset peak stats
torch.cuda.reset_peak_memory_stats()
Find memory leaks
import gc gc.collect() torch.cuda.empty_cache()
Model Inspection
Count parameters
total_params = sum(p.numel() for p in model.parameters()) trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) print(f"Total: {total_params:,}, Trainable: {trainable_params:,}")
Model summary (requires torchsummary)
from torchsummary import summary summary(model, input_size=(3, 224, 224))
Export model to ONNX for visualization
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)
Tensor Debugging
Comprehensive tensor info
def tensor_info(t, name="tensor"): print(f"{name}:") print(f" shape: {t.shape}") print(f" dtype: {t.dtype}") print(f" device: {t.device}") print(f" requires_grad: {t.requires_grad}") print(f" is_leaf: {t.is_leaf}") print(f" grad_fn: {t.grad_fn}") print(f" min: {t.min().item():.6f}") print(f" max: {t.max().item():.6f}") print(f" mean: {t.mean().item():.6f}") print(f" std: {t.std().item():.6f}") print(f" has_nan: {torch.isnan(t).any().item()}") print(f" has_inf: {torch.isinf(t).any().item()}")
Best Practices Summary
-
Always use explicit device placement - Never assume tensors are on the right device
-
Use loss functions on logits - CrossEntropyLoss expects raw scores, not softmax output
-
Register modules properly - Use nn.ModuleList/ModuleDict for parameter detection
-
Avoid in-place operations - They can break autograd graphs
-
Set seeds for reproducibility - Makes debugging deterministic
-
Start with small data - Test overfitting on a single batch first
-
Use anomaly detection - Enable during development, disable in production
-
Monitor gradients - Check for NaN, Inf, and exploding/vanishing gradients
-
Profile memory - Use memory_summary() to identify leaks
-
Print shapes liberally - Most bugs are shape mismatches
Resources
-
UvA Deep Learning Notebooks - Debugging Guide
-
PyTorch Lightning Debugging Guide
-
PyTorch Performance Tuning Guide
-
TorchRL Common Errors and Solutions
-
Machine Learning Mastery - Debugging PyTorch