troubleshooting-guide

Troubleshooting Guide Creator

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "troubleshooting-guide" with this command: npx skills add dengineproblem/agents-monorepo/dengineproblem-agents-monorepo-troubleshooting-guide

Troubleshooting Guide Creator

Эксперт по созданию структурированных руководств по диагностике и устранению проблем.

Core Principles

Problem-Centric Structure

troubleshooting_principles:

  • principle: "Start with clear problem statements and symptoms" reason: "Users need to quickly identify if guide applies to their issue"

  • principle: "Use If-Then logic flows for decision trees" reason: "Systematic elimination of possible causes"

  • principle: "Organize solutions by likelihood and impact" reason: "Try simple/common fixes first, escalate to complex"

  • principle: "Follow logical diagnostic sequence (simple to complex)" reason: "Minimize time to resolution"

  • principle: "Include verification steps after each fix" reason: "Confirm the issue is actually resolved"

  • principle: "Provide rollback instructions" reason: "Allow safe recovery if fix causes new issues"

User Experience Focus

  • Пиши для целевого уровня аудитории

  • Используй консистентное форматирование

  • Указывай оценочное время для каждого шага

  • Включай скриншоты и примеры где возможно

Standard Guide Template

Troubleshooting: [Problem Title]

Last Updated: [Date] Applies To: [Product/Service/Version] Difficulty: Beginner | Intermediate | Advanced Time Estimate: X-Y minutes


Problem Statement

Symptoms

Users experiencing this issue will observe:

  • Symptom 1 (observable behavior)
  • Symptom 2 (error message or code)
  • Symptom 3 (system state)

Error Messages

[Exact error message or code]

Affected Components

  • Component A
  • Component B

Impact

  • Severity: Critical | High | Medium | Low
  • Affected Users: All users | Specific group | Single user
  • Business Impact: [Description]

Quick Checks (2-5 minutes)

Before diving into detailed troubleshooting, verify these common causes:

Check 1: [Most Common Cause]

Time: 30 seconds

# Command to verify
[diagnostic command]

Expected Output: [What you should see]
If this fails: Continue to Check 2

Check 2: [Second Most Common Cause]

Time: 1 minute

[Steps to verify]

Diagnostic Steps

Step 1: Gather Information

Collect the following before proceeding:

-  Error logs from [location]

-  System configuration from [location]

-  User actions that triggered the issue

# Commands to gather diagnostic info
[command 1]
[command 2]

Step 2: Identify the Root Cause

Use this decision tree to identify the cause:

Start
  │
  ├─ Is [condition A] true?
  │    ├─ YES → Go to Solution A
  │    └─ NO → Continue
  │
  ├─ Is [condition B] true?
  │    ├─ YES → Go to Solution B
  │    └─ NO → Continue
  │
  └─ None of the above → Escalate to Support

Solutions

Solution A: [Fix Name]

Difficulty: Easy
Time: 5 minutes
Risk: Low

Prerequisites

-  [Prerequisite 1]

-  [Prerequisite 2]

Steps

- 
Step 1 Title

[command]

Expected output: [description]

- 
Step 2 Title
[Instructions]

- 
Step 3 Title
[Instructions]

Verify Fix

[verification command]

Success Indicator: [What to look for]

Rollback (if needed)

[rollback command]

Solution B: [Fix Name]

Difficulty: Medium
Time: 15 minutes
Risk: Medium

[Same structure as Solution A]

Prevention

To prevent this issue from recurring:

- Monitoring: Set up alerts for [metric]

- Configuration: Ensure [setting] is properly configured

- Process: Follow [procedure] when making changes

- Training: Educate team on [best practice]

Escalation

If the above solutions don't resolve the issue:

When to Escalate

-  Issue persists after trying all solutions

-  Data loss or security concern identified

-  Multiple users affected simultaneously

Information to Provide

-  Time issue started

-  Steps already attempted

-  Diagnostic logs collected

-  Business impact assessment

Contact

- Support Team: [Contact info]

- Escalation Path: [Who to contact]

- SLA: [Expected response time]

Related Resources

- [Link to related guide]

- [Link to documentation]

- [Link to FAQ]

Revision History

Date
Author
Changes

[Date]
[Name]
Initial version

[Date]
[Name]
Added Solution C

---

## Diagnostic Patterns

### Layer-by-Layer Approach

```markdown
## Network Connectivity Troubleshooting

### Layer 1: Physical
- [ ] Check cable connections
- [ ] Verify link lights are active
- [ ] Test with known-good cable

### Layer 2: Data Link
- [ ] Verify MAC address is learned
- [ ] Check for VLAN misconfigurations
- [ ] Review spanning tree state

### Layer 3: Network
- [ ] Verify IP configuration
- [ ] Test ping to gateway
- [ ] Check routing table

### Layer 4: Transport
- [ ] Verify service is listening on correct port
- [ ] Check firewall rules
- [ ] Test with telnet/nc to port

### Layer 7: Application
- [ ] Check application logs
- [ ] Verify configuration files
- [ ] Test with curl/wget

Binary Elimination Method

## Identifying Faulty Component

Use binary search to isolate the issue:

### Step 1: Test Midpoint
Test the system at the midpoint of the data flow:

[Client] → [Load Balancer] → [App Server] → [Database]
↑
Test here first

**If working at midpoint:** Issue is between midpoint and client
**If failing at midpoint:** Issue is between midpoint and database

### Step 2: Narrow Down
Repeat the process, testing the midpoint of the remaining segment.

### Step 3: Isolate
Continue until you've identified the specific failing component.

Symptom-Based Decision Tree

## Application Not Responding

┌─ Can you reach the server at all?
│
├─ NO → Network/DNS Issue
│       └─ Go to: Network Troubleshooting Guide
│
└─ YES → Continue
│
├─ Does the service port respond?
│
├─ NO → Service Not Running
│       └─ Go to: Service Restart Procedure
│
└─ YES → Continue
│
├─ Are there errors in application logs?
│
├─ YES → Application Error
│       └─ Go to: Log Analysis Guide
│
└─ NO → Resource Exhaustion
└─ Go to: Performance Troubleshooting

Log Analysis Guide

Common Log Locations

linux_logs:
  system:
    - /var/log/syslog
    - /var/log/messages
    - journalctl -xe

  application:
    - /var/log/[app-name]/
    - ~/.pm2/logs/
    - docker logs [container]

  web_server:
    nginx:
      - /var/log/nginx/error.log
      - /var/log/nginx/access.log
    apache:
      - /var/log/apache2/error.log
      - /var/log/httpd/error_log

  database:
    postgresql:
      - /var/log/postgresql/
    mysql:
      - /var/log/mysql/error.log

Log Analysis Commands

# Find errors in last 100 lines
tail -100 /var/log/app.log | grep -i error

# Find errors with timestamp
grep -i error /var/log/app.log | tail -50

# Watch log in real-time
tail -f /var/log/app.log | grep --line-buffered -i error

# Count errors by type
grep -i error /var/log/app.log | sort | uniq -c | sort -rn | head -20

# Find entries around specific time
awk '/2024-01-15 14:3[0-5]/' /var/log/app.log

# Extract specific fields (JSON logs)
cat /var/log/app.json | jq 'select(.level == "error") | {time, message}'

# Search compressed logs
zgrep -i error /var/log/app.log.*.gz

Error Pattern Recognition

## Common Error Patterns

### Connection Errors

Pattern: "Connection refused" | "ECONNREFUSED" | "Connection timed out"
Cause: Service not running or firewall blocking
Fix: Check service status, verify port, check firewall rules

### Memory Errors

Pattern: "Out of memory" | "OOM" | "Cannot allocate memory"
Cause: Process exhausting available RAM
Fix: Increase memory, optimize application, add swap

### Disk Errors

Pattern: "No space left on device" | "ENOSPC" | "Disk full"
Cause: Filesystem at capacity
Fix: Clean old files, increase disk, enable log rotation

### Permission Errors

Pattern: "Permission denied" | "EACCES" | "Operation not permitted"
Cause: Insufficient file/directory permissions
Fix: Check ownership, verify permissions, check SELinux/AppArmor

### Database Errors

Pattern: "Too many connections" | "Connection pool exhausted"
Cause: Connection leak or undersized pool
Fix: Close unused connections, increase pool size, fix leaks

Specific Problem Templates

API Not Responding

# Troubleshooting: API Not Responding

## Quick Diagnosis Script

```bash
#!/bin/bash
# api-health-check.sh

API_URL="${1:-http://localhost:8080}"
TIMEOUT=5

echo "=== API Health Check ==="
echo "Target: $API_URL"
echo

# 1. DNS Resolution
echo "1. DNS Resolution..."
if host=$(dig +short $(echo $API_URL | sed 's|.*://||' | cut -d'/' -f1 | cut -d':' -f1) 2>/dev/null); then
    echo "   ✅ DNS resolves to: $host"
else
    echo "   ❌ DNS resolution failed"
fi

# 2. Port Connectivity
echo "2. Port Connectivity..."
PORT=$(echo $API_URL | grep -oP ':\K[0-9]+' || echo "80")
HOST=$(echo $API_URL | sed 's|.*://||' | cut -d'/' -f1 | cut -d':' -f1)
if nc -z -w $TIMEOUT $HOST $PORT 2>/dev/null; then
    echo "   ✅ Port $PORT is open"
else
    echo "   ❌ Port $PORT is not reachable"
fi

# 3. HTTP Response
echo "3. HTTP Response..."
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout $TIMEOUT "$API_URL/health" 2>/dev/null)
if [ "$HTTP_CODE" = "200" ]; then
    echo "   ✅ Health endpoint returns 200"
elif [ -n "$HTTP_CODE" ] && [ "$HTTP_CODE" != "000" ]; then
    echo "   ⚠️  Health endpoint returns $HTTP_CODE"
else
    echo "   ❌ No HTTP response"
fi

# 4. Response Time
echo "4. Response Time..."
RESPONSE_TIME=$(curl -s -o /dev/null -w "%{time_total}" --connect-timeout $TIMEOUT "$API_URL/health" 2>/dev/null)
if (( $(echo "$RESPONSE_TIME < 1" | bc -l) )); then
    echo "   ✅ Response time: ${RESPONSE_TIME}s"
else
    echo "   ⚠️  Slow response: ${RESPONSE_TIME}s"
fi

echo
echo "=== Check Complete ==="

Decision Tree

API Not Responding
       │
       ├─ Can you ping the server?
       │   ├─ NO → Check network/DNS
       │   └─ YES ↓
       │
       ├─ Is the service running?
       │   ├─ NO → Start/restart service
       │   └─ YES ↓
       │
       ├─ Is the port listening?
       │   ├─ NO → Check service configuration
       │   └─ YES ↓
       │
       ├─ Does health check pass?
       │   ├─ NO → Check dependencies (DB, cache)
       │   └─ YES ↓
       │
       └─ Check application logs for errors

### Database Connection Issues

```markdown
# Troubleshooting: Database Connection Failed

## Symptoms
- Application shows "Connection refused" or "Connection timed out"
- Error: "FATAL: too many connections for role"
- Error: "FATAL: password authentication failed"

## Quick Checks

### 1. Verify Database is Running
```bash
# PostgreSQL
sudo systemctl status postgresql
pg_isready -h localhost -p 5432

# MySQL
sudo systemctl status mysql
mysqladmin -u root -p ping

2. Test Connection

# PostgreSQL
psql -h localhost -U username -d database -c "SELECT 1"

# MySQL
mysql -h localhost -u username -p -e "SELECT 1"

3. Check Connection Count

-- PostgreSQL
SELECT count(*) FROM pg_stat_activity;
SELECT max_connections FROM pg_settings WHERE name = 'max_connections';

-- MySQL
SHOW STATUS LIKE 'Threads_connected';
SHOW VARIABLES LIKE 'max_connections';

Solutions

Solution 1: Restart Connection Pool

# If using PgBouncer
sudo systemctl restart pgbouncer

# Application restart
sudo systemctl restart myapp

Solution 2: Clear Idle Connections

-- PostgreSQL: Kill idle connections older than 10 minutes
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
  AND state_change < NOW() - INTERVAL '10 minutes';

Solution 3: Increase Max Connections

-- PostgreSQL (requires restart)
ALTER SYSTEM SET max_connections = 200;

-- MySQL (can be done live)
SET GLOBAL max_connections = 200;

---

## Quality Assurance Checklist

### Pre-Publication Review

```markdown
## Troubleshooting Guide Quality Checklist

### Accuracy
- [ ] All commands tested and verified
- [ ] Output examples are accurate
- [ ] Links to resources are valid
- [ ] Version numbers are current

### Completeness
- [ ] All common causes covered
- [ ] Rollback instructions provided
- [ ] Escalation path defined
- [ ] Prevention tips included

### Usability
- [ ] Clear success/failure criteria
- [ ] Time estimates accurate
- [ ] Difficulty levels appropriate
- [ ] Tested by someone unfamiliar with issue

### Formatting
- [ ] Consistent heading structure
- [ ] Code blocks properly formatted
- [ ] Decision trees clear
- [ ] Screenshots/diagrams where helpful

### Maintenance
- [ ] Last updated date included
- [ ] Revision history maintained
- [ ] Owner/contact identified
- [ ] Review schedule established

Лучшие практики

- Начинай с симптомов — пользователь должен быстро понять, подходит ли гайд

- Простое решение первым — проверь очевидные причины до сложной диагностики

- Включай verification steps — как понять, что проблема решена

- Документируй rollback — возможность отката если fix не помог

- Указывай время — пользователь должен знать сколько займёт каждый шаг

- Тестируй на новичках — гайд должен работать для тех, кто не знает систему

- Обновляй регулярно — устаревший гайд хуже чем его отсутствие

- Включай escalation path — когда и к кому обращаться

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

social-media-marketing

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

video-marketing

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

frontend-design

No summary provided by upstream source.

Repository SourceNeeds Review