dead-letter-queue-analyzer

Analyze dead letter queue (DLQ) messages to identify failure patterns, root causes, and remediation strategies. Supports AWS SQS, RabbitMQ, Kafka, Azure Service Bus, and generic message queues.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "dead-letter-queue-analyzer" with this command: npx skills add charlie-morrison/dead-letter-queue-analyzer

Dead Letter Queue Analyzer

Stop ignoring your dead letter queue. Analyze DLQ messages to find failure patterns, identify root causes, determine which messages are replayable, and generate remediation plans — turning your DLQ from a black hole into an actionable error stream.

Use when: "analyze DLQ", "dead letter queue growing", "why are messages failing", "replay failed messages", "DLQ backlog", "message processing failures", or when unprocessed messages accumulate.

Commands

1. analyze — Categorize DLQ Messages

Step 1: Read DLQ Messages

AWS SQS:

aws sqs receive-message \
  --queue-url "$DLQ_URL" \
  --max-number-of-messages 10 \
  --attribute-names All \
  --message-attribute-names All | python3 -c "
import json, sys
msgs = json.load(sys.stdin).get('Messages', [])
for m in msgs:
    body = json.loads(m['Body']) if m['Body'].startswith('{') else m['Body']
    attrs = m.get('Attributes', {})
    print(f'ID: {m[\"MessageId\"]}')
    print(f'  Received count: {attrs.get(\"ApproximateReceiveCount\", \"?\")}')
    print(f'  First received: {attrs.get(\"ApproximateFirstReceiveTimestamp\", \"?\")}')
    print(f'  Body preview: {str(body)[:200]}')
    print()
"

# Count total DLQ depth
aws sqs get-queue-attributes --queue-url "$DLQ_URL" \
  --attribute-names ApproximateNumberOfMessages | python3 -c "
import json, sys
attrs = json.load(sys.stdin)['Attributes']
print(f'DLQ depth: {attrs[\"ApproximateNumberOfMessages\"]} messages')
"

RabbitMQ:

# List DLQ queues
rabbitmqctl list_queues name messages | grep -i "dead\|dlq\|error"

# Peek at messages
rabbitmqadmin get queue="dead_letter_queue" count=10 2>/dev/null

Kafka:

# Read from DLT (dead letter topic)
kafka-console-consumer --bootstrap-server $KAFKA_BROKER \
  --topic "$DLT_TOPIC" --from-beginning --max-messages 20 \
  --property print.headers=true --property print.timestamp=true

Step 2: Classify Failure Causes

Group DLQ messages by failure reason:

CategorySignalReplayable?Action
Schema errorValidation failure, missing fieldAfter fixFix producer or consumer schema
TimeoutProcessing exceeded deadlineYesIncrease timeout or optimize processing
Dependency downConnection refused, 503YesWait for recovery, then replay
Poison messageCrash/exception on processingNoFix handler, then replay
Data integrityFK violation, duplicate keyMaybeFix data, then replay
PermissionAuth error, access deniedAfter fixFix credentials, then replay
DeserializationInvalid JSON/Protobuf/AvroNoDiscard or fix producer
# Group messages by error pattern
from collections import Counter
errors = Counter()
for msg in dlq_messages:
    # Extract error reason from message attributes or headers
    error = msg.get('error_reason', msg.get('x-death-reason', 'unknown'))
    errors[error] += 1

for error, count in errors.most_common(10):
    print(f'{count:>5}x  {error}')

Step 3: Generate Report

# DLQ Analysis Report

## Summary
- Queue: orders-processing-dlq
- Total messages: 1,247
- Oldest message: 3 days ago
- Growth rate: ~400/day (increasing)

## Failure Categories
| Category | Count | % | Replayable | Root Cause |
|----------|-------|---|------------|------------|
| Timeout | 823 | 66% | ✅ | DB slow queries since Tuesday deploy |
| Schema error | 312 | 25% | ✅ (after fix) | New field `currency` not in consumer schema |
| Poison message | 67 | 5% | ❌ | NullPointer in price calculation |
| Permission | 45 | 4% | ✅ (after fix) | Expired service account token |

## Root Cause
Primary: DB slow queries causing processing timeouts (66% of failures)
- Started: Tuesday 14:30 UTC (correlates with deploy)
- Impact: 823 orders stuck in DLQ

## Remediation Plan
1. **Fix DB performance** — add missing index on orders.status (immediate)
2. **Replay timeout messages** (823) — safe, operations are idempotent
3. **Update consumer schema** to accept `currency` field (312 messages)
4. **Rotate service account token** (45 messages)
5. **Fix NullPointer** in OrderPriceCalculator.java:67 (67 messages — investigate first)
6. Set up DLQ depth alerting (threshold: 50 messages)

2. replay — Generate Replay Script

# SQS: move messages from DLQ back to main queue
aws sqs start-message-move-task \
  --source-arn "$DLQ_ARN" \
  --destination-arn "$MAIN_QUEUE_ARN" \
  --max-number-of-messages-per-second 10

# Or selective replay (only timeout errors)
# Read, filter, re-send

3. monitor — Set Up DLQ Alerting

Generate CloudWatch alarm / Prometheus alert for DLQ depth:

  • Alert when DLQ depth > 0 (any message is a signal)
  • Alert when growth rate > N/hour (active problem)
  • Alert when oldest message > 24h (messages going stale)
  • Dashboard showing DLQ depth over time + categorization

4. prevent — Improve Message Handling

Recommend changes to prevent future DLQ accumulation:

  • Add retry with backoff before sending to DLQ
  • Add idempotency keys for safe replay
  • Add dead letter reason headers for faster triage
  • Add message TTL to prevent infinite accumulation
  • Add schema validation before publishing (catch at source)

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

STT Recognizer | STT 识别器

语音转文字(Speech-to-Text / STT)工具。 支持从麦克风录音,使用 Whisper(faster-whisper)在本地进行语音转文字, 或通过 OpenAI 兼容 API 进行云端转写。 触发词:录音、语音转文字、STT、语音识别、转写、录音转文字。 适用平台:Linux / Windows...

Registry SourceRecently Updated
General

TTS Synthesizer | TTS 合成器

文字转语音(Text-to-Speech / TTS)工具。 支持 edge-tts(微软神经网络 TTS,在线合成)和 OpenAI 兼容 API TTS。 触发词:语音回复、TTS、文字转语音、语音合成、语音对话。 适用平台:Linux / Windows / macOS。

Registry SourceRecently Updated
General

AI大图生成器-by Digilifeform

根据用户文案或上传文件,生成16:9或21:9比例的4K或8K高清信息海报,并支持AI智能修图优化。

Registry SourceRecently Updated
General

Multi Model Consensus

多模型决策委员会 — 消除单模型偏见,通过多轮分歧讨论产出客观决策参考。支持3-13个模型同时评审,提供量化投票矩阵和6段式共识报告。触发条件:包含「多模型决策」或「多模型委员会」时自动激活。

Registry SourceRecently Updated