debug:tensorflow

TensorFlow Debugging Guide

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "debug:tensorflow" with this command: npx skills add snakeo/claude-debug-and-refactor-skills-plugin/snakeo-claude-debug-and-refactor-skills-plugin-debug-tensorflow

TensorFlow Debugging Guide

This skill provides a systematic approach to debugging TensorFlow applications, covering common error patterns, debugging tools, and resolution strategies.

Common Error Patterns

  1. Shape Mismatch Errors

Symptoms:

  • InvalidArgumentError: Incompatible shapes

  • ValueError: Shapes (X,) and (Y,) are incompatible

  • Matrix multiplication failures

Diagnostic Steps:

Print shapes at key points

print(f"Input shape: {x.shape}") print(f"Expected shape: {model.input_shape}")

Use tf.debugging for assertions

tf.debugging.assert_shapes([ (x, ('batch', 'features')), (y, ('batch', 'classes')) ])

Enable eager execution for immediate shape inspection

tf.config.run_functions_eagerly(True)

Common Causes:

  • Batch dimension mismatch (missing or extra dimension)

  • Incorrect reshape operations

  • Mismatched layer input/output dimensions

  • Broadcasting issues with incompatible shapes

Solutions:

Expand dimensions if needed

x = tf.expand_dims(x, axis=0) # Add batch dimension

Reshape explicitly

x = tf.reshape(x, [-1, height, width, channels])

Use tf.ensure_shape for runtime validation

x = tf.ensure_shape(x, [None, 224, 224, 3])

  1. OOM (Out of Memory) Errors

Symptoms:

  • ResourceExhaustedError: OOM when allocating tensor

  • CUDA_ERROR_OUT_OF_MEMORY

  • Training crashes after a few epochs

Diagnostic Steps:

Check GPU memory usage

gpus = tf.config.list_physical_devices('GPU') if gpus: for gpu in gpus: details = tf.config.experimental.get_device_details(gpu) print(f"GPU: {gpu.name}, Details: {details}")

Monitor memory during training

tf.debugging.experimental.enable_dump_debug_info( '/tmp/tfdbg2_logdir', tensor_debug_mode='FULL_HEALTH', circular_buffer_size=1000 )

Solutions:

Enable memory growth (prevent TF from allocating all GPU memory)

gpus = tf.config.list_physical_devices('GPU') for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True)

Limit GPU memory

tf.config.set_logical_device_configuration( gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=4096)] # 4GB )

Reduce batch size

BATCH_SIZE = 16 # Try smaller values

Use gradient checkpointing for large models

(recompute activations during backward pass)

Clear session between runs

tf.keras.backend.clear_session()

Use mixed precision training

tf.keras.mixed_precision.set_global_policy('mixed_float16')

  1. NaN/Inf in Loss

Symptoms:

  • Loss becomes nan or inf during training

  • Model predictions are all NaN

  • Gradient norm explodes

Diagnostic Steps:

Enable numeric checking

tf.debugging.enable_check_numerics()

Check for NaN in tensors

tf.debugging.check_numerics(tensor, "Tensor contains NaN or Inf")

Use TensorBoard Debugger V2

tf.debugging.experimental.enable_dump_debug_info( logdir='/tmp/tfdbg2_logdir', tensor_debug_mode='FULL_HEALTH', circular_buffer_size=1000 )

Common Causes:

  • Learning rate too high

  • Exploding gradients

  • Log of zero or negative numbers

  • Division by zero

  • Incorrect loss function for data range

Solutions:

Reduce learning rate

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)

Add gradient clipping

optimizer = tf.keras.optimizers.Adam(clipnorm=1.0)

or

optimizer = tf.keras.optimizers.Adam(clipvalue=0.5)

Use numerically stable operations

Instead of: tf.math.log(x)

tf.math.log(x + 1e-7) # Add epsilon

Instead of: x / y

tf.math.divide_no_nan(x, y)

Add batch normalization

model.add(tf.keras.layers.BatchNormalization())

Check data for NaN before training

assert not tf.reduce_any(tf.math.is_nan(train_data)).numpy()

  1. Gradient Issues

Symptoms:

  • Vanishing gradients (weights not updating)

  • Exploding gradients (loss becomes NaN)

  • Training stalls, loss doesn't decrease

Diagnostic Steps:

Inspect gradients with GradientTape

with tf.GradientTape() as tape: predictions = model(x, training=True) loss = loss_fn(y, predictions)

gradients = tape.gradient(loss, model.trainable_variables)

for var, grad in zip(model.trainable_variables, gradients): if grad is not None: print(f"{var.name}: grad_norm={tf.norm(grad).numpy():.6f}") else: print(f"{var.name}: NO GRADIENT (disconnected)")

Check for dead ReLUs

activations = model.layers[5].output dead_neurons = tf.reduce_mean(tf.cast(activations <= 0, tf.float32))

Solutions:

For vanishing gradients

Use He initialization for ReLU networks

initializer = tf.keras.initializers.HeNormal()

Use LeakyReLU instead of ReLU

model.add(tf.keras.layers.LeakyReLU(alpha=0.1))

Add residual connections (skip connections)

For exploding gradients

Apply gradient clipping

gradients, _ = tf.clip_by_global_norm(gradients, 5.0)

Use proper weight initialization

initializer = tf.keras.initializers.GlorotUniform()

  1. GPU Not Detected

Symptoms:

  • tf.config.list_physical_devices('GPU') returns empty list

  • Training runs on CPU (slow)

  • CUDA errors on startup

Diagnostic Steps:

Check available devices

print("Physical devices:", tf.config.list_physical_devices()) print("GPU devices:", tf.config.list_physical_devices('GPU')) print("Built with CUDA:", tf.test.is_built_with_cuda()) print("GPU available:", tf.test.is_gpu_available())

Check CUDA/cuDNN versions

import subprocess result = subprocess.run(['nvidia-smi'], capture_output=True, text=True) print(result.stdout)

Verify TensorFlow GPU package

import tensorflow as tf print(tf.version) print(tf.sysconfig.get_build_info())

Common Causes:

  • Wrong TensorFlow package (CPU-only version)

  • CUDA/cuDNN version mismatch

  • NVIDIA driver issues

  • GPU not visible to container (Docker)

Solutions:

Install correct TensorFlow GPU package

pip install tensorflow[and-cuda] # TF 2.15+

or

pip install tensorflow-gpu # Older versions

Verify CUDA compatibility

TF 2.15: CUDA 12.x, cuDNN 8.9

TF 2.14: CUDA 11.8, cuDNN 8.7

TF 2.13: CUDA 11.8, cuDNN 8.6

For Docker, use nvidia-docker

docker run --gpus all -it tensorflow/tensorflow:latest-gpu

Force GPU visibility

import os os.environ['CUDA_VISIBLE_DEVICES'] = '0' # Use first GPU

Verify GPU is being used

with tf.device('/GPU:0'): a = tf.constant([[1.0, 2.0], [3.0, 4.0]]) b = tf.constant([[1.0, 1.0], [0.0, 1.0]]) c = tf.matmul(a, b) print(c.device) # Should show GPU

  1. SavedModel Loading Errors

Symptoms:

  • OSError: SavedModel file does not exist

  • ValueError: Unknown layer when loading

  • Version compatibility errors

Diagnostic Steps:

Check SavedModel structure

import os for root, dirs, files in os.walk('saved_model_dir'): for file in files: print(os.path.join(root, file))

Verify model signature

loaded = tf.saved_model.load('saved_model_dir') print(list(loaded.signatures.keys()))

Solutions:

Save model correctly

model.save('my_model') # SavedModel format (recommended) model.save('my_model.keras') # Keras format

Load with custom objects

custom_objects = { 'CustomLayer': CustomLayer, 'custom_loss': custom_loss } model = tf.keras.models.load_model('my_model', custom_objects=custom_objects)

For version mismatches, save weights only

model.save_weights('model_weights.weights.h5')

Then rebuild model architecture and load weights

new_model.load_weights('model_weights.weights.h5')

  1. Data Pipeline Issues

Symptoms:

  • InvalidArgumentError during training

  • Slow training (input bottleneck)

  • Memory leaks during data loading

Diagnostic Steps:

Profile input pipeline

import tensorflow as tf

Enable profiler

tf.profiler.experimental.start('/tmp/logdir')

... run training ...

tf.profiler.experimental.stop()

Check dataset element spec

print(dataset.element_spec)

Iterate and inspect

for batch in dataset.take(1): print(f"Batch shape: {batch[0].shape}") print(f"Dtype: {batch[0].dtype}")

Solutions:

Optimize pipeline

dataset = tf.data.Dataset.from_tensor_slices((x, y)) dataset = dataset.cache() # Cache after expensive operations dataset = dataset.shuffle(buffer_size=1000) dataset = dataset.batch(32) dataset = dataset.prefetch(tf.data.AUTOTUNE) # Overlap data loading

Use parallel processing

dataset = dataset.map( preprocess_fn, num_parallel_calls=tf.data.AUTOTUNE )

Handle variable-length sequences

dataset = dataset.padded_batch(32, padded_shapes=([None], []))

Debugging Tools

tf.debugging Module

Shape assertions

tf.debugging.assert_shapes([ (x, ('N', 'H', 'W', 'C')), (y, ('N', 'num_classes')) ])

Value assertions

tf.debugging.assert_non_negative(x) tf.debugging.assert_near(x, y, rtol=1e-5) tf.debugging.assert_equal(x.shape, expected_shape)

Numeric checking

tf.debugging.check_numerics(tensor, "check: tensor contains NaN/Inf") tf.debugging.enable_check_numerics() # Global check

Type assertions

tf.debugging.assert_type(x, tf.float32)

TensorBoard

Set up TensorBoard logging

log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") tensorboard_callback = tf.keras.callbacks.TensorBoard( log_dir=log_dir, histogram_freq=1, profile_batch='500,520' # Profile batches 500-520 )

model.fit( x_train, y_train, epochs=5, callbacks=[tensorboard_callback] )

Launch TensorBoard

tensorboard --logdir logs/fit

TensorBoard Debugger V2

Enable debug info dumping

tf.debugging.experimental.enable_dump_debug_info( logdir='/tmp/tfdbg2_logdir', tensor_debug_mode='FULL_HEALTH', circular_buffer_size=1000 )

Run training...

model.fit(x_train, y_train, epochs=5)

View in TensorBoard

tensorboard --logdir /tmp/tfdbg2_logdir

Eager Execution Debugging

Enable eager execution (default in TF 2.x)

tf.config.run_functions_eagerly(True)

Debug with breakpoints in @tf.function

@tf.function def my_function(x): tf.print("Debug:", x) # Works in graph mode # Use tf.debugging.assert_* for runtime checks tf.debugging.assert_positive(x) return x * 2

Disable tf.function for debugging

@tf.function def buggy_function(x): # Temporarily remove @tf.function decorator # or use tf.config.run_functions_eagerly(True) return x

tf.print() for Graph Mode

@tf.function def compute(x): # Regular print won't work in graph mode tf.print("Shape:", tf.shape(x)) tf.print("Values:", x, summarize=-1) # -1 for all values tf.print("Stats - min:", tf.reduce_min(x), "max:", tf.reduce_max(x), "mean:", tf.reduce_mean(x)) return x * 2

Memory Profiler

Profile memory usage

tf.config.experimental.set_memory_growth(gpu, True)

Use TensorFlow Profiler

with tf.profiler.experimental.Profile('/tmp/logdir'): model.fit(x_train, y_train, epochs=1)

Check memory info

tf.config.experimental.get_memory_info('GPU:0')

Returns: {'current': bytes, 'peak': bytes}

The Four Phases of TensorFlow Debugging

Phase 1: Reproduce and Isolate

Create minimal reproduction

Minimal test case

import tensorflow as tf

Smallest possible model

model = tf.keras.Sequential([ tf.keras.layers.Dense(10, input_shape=(5,)) ])

Synthetic data

x = tf.random.normal((32, 5)) y = tf.random.normal((32, 10))

model.compile(optimizer='adam', loss='mse') model.fit(x, y, epochs=1)

Enable eager execution for line-by-line debugging

tf.config.run_functions_eagerly(True)

Add assertions at key points

def debug_forward_pass(model, x): for i, layer in enumerate(model.layers): x = layer(x) tf.debugging.check_numerics(x, f"Layer {i} output") print(f"Layer {i}: {x.shape}, range=[{tf.reduce_min(x):.3f}, {tf.reduce_max(x):.3f}]") return x

Phase 2: Analyze and Understand

Inspect tensor shapes throughout the pipeline

def trace_shapes(model, x): shapes = [] for layer in model.layers: x = layer(x) shapes.append((layer.name, x.shape)) return shapes

Check gradient flow

def analyze_gradients(model, x, y, loss_fn): with tf.GradientTape() as tape: pred = model(x, training=True) loss = loss_fn(y, pred)

grads = tape.gradient(loss, model.trainable_variables)

analysis = []
for var, grad in zip(model.trainable_variables, grads):
    if grad is None:
        analysis.append((var.name, "NONE - disconnected"))
    else:
        norm = tf.norm(grad).numpy()
        analysis.append((var.name, f"norm={norm:.6f}"))
return analysis

- Profile performance

Use tf.profiler

tf.profiler.experimental.start('/tmp/logdir') model.fit(x, y, epochs=1) tf.profiler.experimental.stop()

Phase 3: Fix and Verify

Apply targeted fixes based on diagnosis

  • Shape issues: Add explicit reshapes and assertions

  • NaN issues: Add epsilon, reduce learning rate, clip gradients

  • OOM issues: Reduce batch size, enable memory growth

  • GPU issues: Check CUDA compatibility, install correct packages

Verify fix doesn't break other functionality

Run comprehensive tests

def test_model_components(): # Test forward pass output = model(sample_input) assert output.shape == expected_shape

# Test backward pass
with tf.GradientTape() as tape:
    loss = loss_fn(model(x), y)
grads = tape.gradient(loss, model.trainable_variables)
assert all(g is not None for g in grads)

# Test save/load
model.save('/tmp/test_model')
loaded = tf.keras.models.load_model('/tmp/test_model')
assert tf.reduce_all(model(x) == loaded(x))

Phase 4: Prevent and Document

Add permanent assertions for critical invariants

class RobustModel(tf.keras.Model): def call(self, x, training=False): tf.debugging.assert_shapes([(x, ('batch', 'features'))])

    x = self.layer1(x)
    tf.debugging.check_numerics(x, "After layer1")

    return self.output_layer(x)

- Set up monitoring callbacks

class NanCallback(tf.keras.callbacks.Callback): def on_batch_end(self, batch, logs=None): if logs and tf.math.is_nan(logs.get('loss', 0)): self.model.stop_training = True raise ValueError(f"NaN detected at batch {batch}")

Document the issue and solution

BUGFIX: Shape mismatch in attention layer

Issue: Input was (batch, seq, features) but attention expected (batch, heads, seq, features)

Solution: Added reshape before attention layer

x = tf.reshape(x, [batch_size, num_heads, seq_len, -1])

Quick Reference Commands

Device and Configuration

List devices

tf.config.list_physical_devices() tf.config.list_physical_devices('GPU')

GPU memory growth

gpus = tf.config.list_physical_devices('GPU') for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True)

Force CPU execution

with tf.device('/CPU:0'): result = model(x)

Check if built with CUDA

tf.test.is_built_with_cuda()

Debugging Assertions

Numeric checks

tf.debugging.check_numerics(tensor, message) tf.debugging.enable_check_numerics()

Shape checks

tf.debugging.assert_shapes([(tensor, shape_tuple)]) tf.ensure_shape(tensor, shape)

Value checks

tf.debugging.assert_positive(tensor) tf.debugging.assert_non_negative(tensor) tf.debugging.assert_near(a, b, rtol=1e-5) tf.debugging.assert_equal(a, b) tf.debugging.assert_less(a, b) tf.debugging.assert_greater(a, b)

Profiling and Logging

TensorBoard logging

tensorboard_callback = tf.keras.callbacks.TensorBoard( log_dir='./logs', histogram_freq=1 )

Start profiler

tf.profiler.experimental.start('/tmp/logdir')

... code ...

tf.profiler.experimental.stop()

Debug info for TensorBoard Debugger V2

tf.debugging.experimental.enable_dump_debug_info( '/tmp/tfdbg2', tensor_debug_mode='FULL_HEALTH' )

Memory Management

Clear session

tf.keras.backend.clear_session()

Get memory info

tf.config.experimental.get_memory_info('GPU:0')

Mixed precision

tf.keras.mixed_precision.set_global_policy('mixed_float16')

Gradient Debugging

Inspect gradients

with tf.GradientTape() as tape: loss = compute_loss() gradients = tape.gradient(loss, model.trainable_variables)

Clip gradients

gradients, _ = tf.clip_by_global_norm(gradients, 5.0)

Check for None gradients (disconnected graph)

for var, grad in zip(model.trainable_variables, gradients): if grad is None: print(f"Warning: {var.name} has no gradient")

Version Compatibility Reference

TensorFlow Python CUDA cuDNN

2.16.x 3.9-3.12 12.3 8.9

2.15.x 3.9-3.11 12.2 8.9

2.14.x 3.9-3.11 11.8 8.7

2.13.x 3.8-3.11 11.8 8.6

2.12.x 3.8-3.11 11.8 8.6

Additional Resources

  • TensorFlow Debugging Guide

  • TensorBoard Debugger V2

  • GPU Performance Analysis

  • Profiler Guide

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

refactor:flutter

No summary provided by upstream source.

Repository SourceNeeds Review
General

refactor:nestjs

No summary provided by upstream source.

Repository SourceNeeds Review
General

debug:flutter

No summary provided by upstream source.

Repository SourceNeeds Review