server-management

Server Management

Server management principles for production operations. Learn to THINK, not memorize commands.

Process Management Principles

Tool Selection

Scenario Tool

Node.js app PM2 (clustering, reload)

Any app systemd (Linux native)

Containers Docker/Podman

Orchestration Kubernetes, Docker Swarm

Process Management Goals

Goal What It Means

Restart on crash Auto-recovery

Zero-downtime reload No service interruption

Clustering Use all CPU cores

Persistence Survive server reboot

Monitoring Principles

What to Monitor

Category Key Metrics

Availability Uptime, health checks

Performance Response time, throughput

Errors Error rate, types

Resources CPU, memory, disk

Alert Severity Strategy

Level Response

Critical Immediate action

Warning Investigate soon

Info Review daily

Monitoring Tool Selection

Need Options

Simple/Free PM2 metrics, htop

Full observability Grafana, Datadog

Error tracking Sentry

Uptime UptimeRobot, Pingdom

Log Management Principles

Log Strategy

Log Type Purpose

Application logs Debug, audit

Access logs Traffic analysis

Error logs Issue detection

Log Principles

Rotate logs to prevent disk fill
Structured logging (JSON) for parsing
Appropriate levels (error/warn/info/debug)
No sensitive data in logs

Scaling Decisions

When to Scale

Symptom Solution

High CPU Add instances (horizontal)

High memory Increase RAM or fix leak

Slow response Profile first, then scale

Traffic spikes Auto-scaling

Scaling Strategy

Type When to Use

Vertical Quick fix, single instance

Horizontal Sustainable, distributed

Auto Variable traffic

Health Check Principles

What Constitutes Healthy

Check Meaning

HTTP 200 Service responding

Database connected Data accessible

Dependencies OK External services reachable

Resources OK CPU/memory not exhausted

Health Check Implementation

Simple: Just return 200
Deep: Check all dependencies
Choose based on load balancer needs

Security Principles

Area Principle

Access SSH keys only, no passwords

Firewall Only needed ports open

Updates Regular security patches

Secrets Environment vars, not files

Audit Log access and changes

Troubleshooting Priority

When something's wrong:

Check if running (process status)
Check logs (error messages)
Check resources (disk, memory, CPU)
Check network (ports, DNS)
Check dependencies (database, APIs)

Anti-Patterns

❌ Don't ✅ Do

Run as root Use non-root user

Ignore logs Set up log rotation

Skip monitoring Monitor from day one

Manual restarts Auto-restart config

No backups Regular backup schedule

Remember: A well-managed server is boring. That's the goal.

server-management

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

testing-mastery

prd

skill-creator

seo-fundamentals