TensorFlow Debugging Guide
This skill provides a systematic approach to debugging TensorFlow applications, covering common error patterns, debugging tools, and resolution strategies.
Common Error Patterns
- Shape Mismatch Errors
Symptoms:
-
InvalidArgumentError: Incompatible shapes
-
ValueError: Shapes (X,) and (Y,) are incompatible
-
Matrix multiplication failures
Diagnostic Steps:
Print shapes at key points
print(f"Input shape: {x.shape}") print(f"Expected shape: {model.input_shape}")
Use tf.debugging for assertions
tf.debugging.assert_shapes([ (x, ('batch', 'features')), (y, ('batch', 'classes')) ])
Enable eager execution for immediate shape inspection
tf.config.run_functions_eagerly(True)
Common Causes:
-
Batch dimension mismatch (missing or extra dimension)
-
Incorrect reshape operations
-
Mismatched layer input/output dimensions
-
Broadcasting issues with incompatible shapes
Solutions:
Expand dimensions if needed
x = tf.expand_dims(x, axis=0) # Add batch dimension
Reshape explicitly
x = tf.reshape(x, [-1, height, width, channels])
Use tf.ensure_shape for runtime validation
x = tf.ensure_shape(x, [None, 224, 224, 3])
- OOM (Out of Memory) Errors
Symptoms:
-
ResourceExhaustedError: OOM when allocating tensor
-
CUDA_ERROR_OUT_OF_MEMORY
-
Training crashes after a few epochs
Diagnostic Steps:
Check GPU memory usage
gpus = tf.config.list_physical_devices('GPU') if gpus: for gpu in gpus: details = tf.config.experimental.get_device_details(gpu) print(f"GPU: {gpu.name}, Details: {details}")
Monitor memory during training
tf.debugging.experimental.enable_dump_debug_info( '/tmp/tfdbg2_logdir', tensor_debug_mode='FULL_HEALTH', circular_buffer_size=1000 )
Solutions:
Enable memory growth (prevent TF from allocating all GPU memory)
gpus = tf.config.list_physical_devices('GPU') for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True)
Limit GPU memory
tf.config.set_logical_device_configuration( gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=4096)] # 4GB )
Reduce batch size
BATCH_SIZE = 16 # Try smaller values
Use gradient checkpointing for large models
(recompute activations during backward pass)
Clear session between runs
tf.keras.backend.clear_session()
Use mixed precision training
tf.keras.mixed_precision.set_global_policy('mixed_float16')
- NaN/Inf in Loss
Symptoms:
-
Loss becomes nan or inf during training
-
Model predictions are all NaN
-
Gradient norm explodes
Diagnostic Steps:
Enable numeric checking
tf.debugging.enable_check_numerics()
Check for NaN in tensors
tf.debugging.check_numerics(tensor, "Tensor contains NaN or Inf")
Use TensorBoard Debugger V2
tf.debugging.experimental.enable_dump_debug_info( logdir='/tmp/tfdbg2_logdir', tensor_debug_mode='FULL_HEALTH', circular_buffer_size=1000 )
Common Causes:
-
Learning rate too high
-
Exploding gradients
-
Log of zero or negative numbers
-
Division by zero
-
Incorrect loss function for data range
Solutions:
Reduce learning rate
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
Add gradient clipping
optimizer = tf.keras.optimizers.Adam(clipnorm=1.0)
or
optimizer = tf.keras.optimizers.Adam(clipvalue=0.5)
Use numerically stable operations
Instead of: tf.math.log(x)
tf.math.log(x + 1e-7) # Add epsilon
Instead of: x / y
tf.math.divide_no_nan(x, y)
Add batch normalization
model.add(tf.keras.layers.BatchNormalization())
Check data for NaN before training
assert not tf.reduce_any(tf.math.is_nan(train_data)).numpy()
- Gradient Issues
Symptoms:
-
Vanishing gradients (weights not updating)
-
Exploding gradients (loss becomes NaN)
-
Training stalls, loss doesn't decrease
Diagnostic Steps:
Inspect gradients with GradientTape
with tf.GradientTape() as tape: predictions = model(x, training=True) loss = loss_fn(y, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
for var, grad in zip(model.trainable_variables, gradients): if grad is not None: print(f"{var.name}: grad_norm={tf.norm(grad).numpy():.6f}") else: print(f"{var.name}: NO GRADIENT (disconnected)")
Check for dead ReLUs
activations = model.layers[5].output dead_neurons = tf.reduce_mean(tf.cast(activations <= 0, tf.float32))
Solutions:
For vanishing gradients
Use He initialization for ReLU networks
initializer = tf.keras.initializers.HeNormal()
Use LeakyReLU instead of ReLU
model.add(tf.keras.layers.LeakyReLU(alpha=0.1))
Add residual connections (skip connections)
For exploding gradients
Apply gradient clipping
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
Use proper weight initialization
initializer = tf.keras.initializers.GlorotUniform()
- GPU Not Detected
Symptoms:
-
tf.config.list_physical_devices('GPU') returns empty list
-
Training runs on CPU (slow)
-
CUDA errors on startup
Diagnostic Steps:
Check available devices
print("Physical devices:", tf.config.list_physical_devices()) print("GPU devices:", tf.config.list_physical_devices('GPU')) print("Built with CUDA:", tf.test.is_built_with_cuda()) print("GPU available:", tf.test.is_gpu_available())
Check CUDA/cuDNN versions
import subprocess result = subprocess.run(['nvidia-smi'], capture_output=True, text=True) print(result.stdout)
Verify TensorFlow GPU package
import tensorflow as tf print(tf.version) print(tf.sysconfig.get_build_info())
Common Causes:
-
Wrong TensorFlow package (CPU-only version)
-
CUDA/cuDNN version mismatch
-
NVIDIA driver issues
-
GPU not visible to container (Docker)
Solutions:
Install correct TensorFlow GPU package
pip install tensorflow[and-cuda] # TF 2.15+
or
pip install tensorflow-gpu # Older versions
Verify CUDA compatibility
TF 2.15: CUDA 12.x, cuDNN 8.9
TF 2.14: CUDA 11.8, cuDNN 8.7
TF 2.13: CUDA 11.8, cuDNN 8.6
For Docker, use nvidia-docker
docker run --gpus all -it tensorflow/tensorflow:latest-gpu
Force GPU visibility
import os os.environ['CUDA_VISIBLE_DEVICES'] = '0' # Use first GPU
Verify GPU is being used
with tf.device('/GPU:0'): a = tf.constant([[1.0, 2.0], [3.0, 4.0]]) b = tf.constant([[1.0, 1.0], [0.0, 1.0]]) c = tf.matmul(a, b) print(c.device) # Should show GPU
- SavedModel Loading Errors
Symptoms:
-
OSError: SavedModel file does not exist
-
ValueError: Unknown layer when loading
-
Version compatibility errors
Diagnostic Steps:
Check SavedModel structure
import os for root, dirs, files in os.walk('saved_model_dir'): for file in files: print(os.path.join(root, file))
Verify model signature
loaded = tf.saved_model.load('saved_model_dir') print(list(loaded.signatures.keys()))
Solutions:
Save model correctly
model.save('my_model') # SavedModel format (recommended) model.save('my_model.keras') # Keras format
Load with custom objects
custom_objects = { 'CustomLayer': CustomLayer, 'custom_loss': custom_loss } model = tf.keras.models.load_model('my_model', custom_objects=custom_objects)
For version mismatches, save weights only
model.save_weights('model_weights.weights.h5')
Then rebuild model architecture and load weights
new_model.load_weights('model_weights.weights.h5')
- Data Pipeline Issues
Symptoms:
-
InvalidArgumentError during training
-
Slow training (input bottleneck)
-
Memory leaks during data loading
Diagnostic Steps:
Profile input pipeline
import tensorflow as tf
Enable profiler
tf.profiler.experimental.start('/tmp/logdir')
... run training ...
tf.profiler.experimental.stop()
Check dataset element spec
print(dataset.element_spec)
Iterate and inspect
for batch in dataset.take(1): print(f"Batch shape: {batch[0].shape}") print(f"Dtype: {batch[0].dtype}")
Solutions:
Optimize pipeline
dataset = tf.data.Dataset.from_tensor_slices((x, y)) dataset = dataset.cache() # Cache after expensive operations dataset = dataset.shuffle(buffer_size=1000) dataset = dataset.batch(32) dataset = dataset.prefetch(tf.data.AUTOTUNE) # Overlap data loading
Use parallel processing
dataset = dataset.map( preprocess_fn, num_parallel_calls=tf.data.AUTOTUNE )
Handle variable-length sequences
dataset = dataset.padded_batch(32, padded_shapes=([None], []))
Debugging Tools
tf.debugging Module
Shape assertions
tf.debugging.assert_shapes([ (x, ('N', 'H', 'W', 'C')), (y, ('N', 'num_classes')) ])
Value assertions
tf.debugging.assert_non_negative(x) tf.debugging.assert_near(x, y, rtol=1e-5) tf.debugging.assert_equal(x.shape, expected_shape)
Numeric checking
tf.debugging.check_numerics(tensor, "check: tensor contains NaN/Inf") tf.debugging.enable_check_numerics() # Global check
Type assertions
tf.debugging.assert_type(x, tf.float32)
TensorBoard
Set up TensorBoard logging
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") tensorboard_callback = tf.keras.callbacks.TensorBoard( log_dir=log_dir, histogram_freq=1, profile_batch='500,520' # Profile batches 500-520 )
model.fit( x_train, y_train, epochs=5, callbacks=[tensorboard_callback] )
Launch TensorBoard
tensorboard --logdir logs/fit
TensorBoard Debugger V2
Enable debug info dumping
tf.debugging.experimental.enable_dump_debug_info( logdir='/tmp/tfdbg2_logdir', tensor_debug_mode='FULL_HEALTH', circular_buffer_size=1000 )
Run training...
model.fit(x_train, y_train, epochs=5)
View in TensorBoard
tensorboard --logdir /tmp/tfdbg2_logdir
Eager Execution Debugging
Enable eager execution (default in TF 2.x)
tf.config.run_functions_eagerly(True)
Debug with breakpoints in @tf.function
@tf.function def my_function(x): tf.print("Debug:", x) # Works in graph mode # Use tf.debugging.assert_* for runtime checks tf.debugging.assert_positive(x) return x * 2
Disable tf.function for debugging
@tf.function def buggy_function(x): # Temporarily remove @tf.function decorator # or use tf.config.run_functions_eagerly(True) return x
tf.print() for Graph Mode
@tf.function def compute(x): # Regular print won't work in graph mode tf.print("Shape:", tf.shape(x)) tf.print("Values:", x, summarize=-1) # -1 for all values tf.print("Stats - min:", tf.reduce_min(x), "max:", tf.reduce_max(x), "mean:", tf.reduce_mean(x)) return x * 2
Memory Profiler
Profile memory usage
tf.config.experimental.set_memory_growth(gpu, True)
Use TensorFlow Profiler
with tf.profiler.experimental.Profile('/tmp/logdir'): model.fit(x_train, y_train, epochs=1)
Check memory info
tf.config.experimental.get_memory_info('GPU:0')
Returns: {'current': bytes, 'peak': bytes}
The Four Phases of TensorFlow Debugging
Phase 1: Reproduce and Isolate
Create minimal reproduction
Minimal test case
import tensorflow as tf
Smallest possible model
model = tf.keras.Sequential([ tf.keras.layers.Dense(10, input_shape=(5,)) ])
Synthetic data
x = tf.random.normal((32, 5)) y = tf.random.normal((32, 10))
model.compile(optimizer='adam', loss='mse') model.fit(x, y, epochs=1)
Enable eager execution for line-by-line debugging
tf.config.run_functions_eagerly(True)
Add assertions at key points
def debug_forward_pass(model, x): for i, layer in enumerate(model.layers): x = layer(x) tf.debugging.check_numerics(x, f"Layer {i} output") print(f"Layer {i}: {x.shape}, range=[{tf.reduce_min(x):.3f}, {tf.reduce_max(x):.3f}]") return x
Phase 2: Analyze and Understand
Inspect tensor shapes throughout the pipeline
def trace_shapes(model, x): shapes = [] for layer in model.layers: x = layer(x) shapes.append((layer.name, x.shape)) return shapes
Check gradient flow
def analyze_gradients(model, x, y, loss_fn): with tf.GradientTape() as tape: pred = model(x, training=True) loss = loss_fn(y, pred)
grads = tape.gradient(loss, model.trainable_variables)
analysis = []
for var, grad in zip(model.trainable_variables, grads):
if grad is None:
analysis.append((var.name, "NONE - disconnected"))
else:
norm = tf.norm(grad).numpy()
analysis.append((var.name, f"norm={norm:.6f}"))
return analysis
- Profile performance
Use tf.profiler
tf.profiler.experimental.start('/tmp/logdir') model.fit(x, y, epochs=1) tf.profiler.experimental.stop()
Phase 3: Fix and Verify
Apply targeted fixes based on diagnosis
-
Shape issues: Add explicit reshapes and assertions
-
NaN issues: Add epsilon, reduce learning rate, clip gradients
-
OOM issues: Reduce batch size, enable memory growth
-
GPU issues: Check CUDA compatibility, install correct packages
Verify fix doesn't break other functionality
Run comprehensive tests
def test_model_components(): # Test forward pass output = model(sample_input) assert output.shape == expected_shape
# Test backward pass
with tf.GradientTape() as tape:
loss = loss_fn(model(x), y)
grads = tape.gradient(loss, model.trainable_variables)
assert all(g is not None for g in grads)
# Test save/load
model.save('/tmp/test_model')
loaded = tf.keras.models.load_model('/tmp/test_model')
assert tf.reduce_all(model(x) == loaded(x))
Phase 4: Prevent and Document
Add permanent assertions for critical invariants
class RobustModel(tf.keras.Model): def call(self, x, training=False): tf.debugging.assert_shapes([(x, ('batch', 'features'))])
x = self.layer1(x)
tf.debugging.check_numerics(x, "After layer1")
return self.output_layer(x)
- Set up monitoring callbacks
class NanCallback(tf.keras.callbacks.Callback): def on_batch_end(self, batch, logs=None): if logs and tf.math.is_nan(logs.get('loss', 0)): self.model.stop_training = True raise ValueError(f"NaN detected at batch {batch}")
Document the issue and solution
BUGFIX: Shape mismatch in attention layer
Issue: Input was (batch, seq, features) but attention expected (batch, heads, seq, features)
Solution: Added reshape before attention layer
x = tf.reshape(x, [batch_size, num_heads, seq_len, -1])
Quick Reference Commands
Device and Configuration
List devices
tf.config.list_physical_devices() tf.config.list_physical_devices('GPU')
GPU memory growth
gpus = tf.config.list_physical_devices('GPU') for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True)
Force CPU execution
with tf.device('/CPU:0'): result = model(x)
Check if built with CUDA
tf.test.is_built_with_cuda()
Debugging Assertions
Numeric checks
tf.debugging.check_numerics(tensor, message) tf.debugging.enable_check_numerics()
Shape checks
tf.debugging.assert_shapes([(tensor, shape_tuple)]) tf.ensure_shape(tensor, shape)
Value checks
tf.debugging.assert_positive(tensor) tf.debugging.assert_non_negative(tensor) tf.debugging.assert_near(a, b, rtol=1e-5) tf.debugging.assert_equal(a, b) tf.debugging.assert_less(a, b) tf.debugging.assert_greater(a, b)
Profiling and Logging
TensorBoard logging
tensorboard_callback = tf.keras.callbacks.TensorBoard( log_dir='./logs', histogram_freq=1 )
Start profiler
tf.profiler.experimental.start('/tmp/logdir')
... code ...
tf.profiler.experimental.stop()
Debug info for TensorBoard Debugger V2
tf.debugging.experimental.enable_dump_debug_info( '/tmp/tfdbg2', tensor_debug_mode='FULL_HEALTH' )
Memory Management
Clear session
tf.keras.backend.clear_session()
Get memory info
tf.config.experimental.get_memory_info('GPU:0')
Mixed precision
tf.keras.mixed_precision.set_global_policy('mixed_float16')
Gradient Debugging
Inspect gradients
with tf.GradientTape() as tape: loss = compute_loss() gradients = tape.gradient(loss, model.trainable_variables)
Clip gradients
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
Check for None gradients (disconnected graph)
for var, grad in zip(model.trainable_variables, gradients): if grad is None: print(f"Warning: {var.name} has no gradient")
Version Compatibility Reference
TensorFlow Python CUDA cuDNN
2.16.x 3.9-3.12 12.3 8.9
2.15.x 3.9-3.11 12.2 8.9
2.14.x 3.9-3.11 11.8 8.7
2.13.x 3.8-3.11 11.8 8.6
2.12.x 3.8-3.11 11.8 8.6
Additional Resources
-
TensorFlow Debugging Guide
-
TensorBoard Debugger V2
-
GPU Performance Analysis
-
Profiler Guide