Skill: Documentation (ADR & Runbook)
Category: Documentation Version: 1.0.0 Used By: All agents, Phase 8
Overview
Create Architecture Decision Records (ADRs) and Runbooks for operational documentation.
Part 1: Architecture Decision Records (ADR)
When to Create ADR
-
Choosing between technologies
-
Significant architectural changes
-
New patterns or conventions
-
Deprecating existing approaches
ADR Template
ADR-[NUMBER]: [TITLE]
Status: [Proposed | Accepted | Deprecated | Superseded by ADR-XXX] Date: YYYY-MM-DD Deciders: [Names/Teams]
Context
[What is the issue? Why do we need to make a decision?]
Decision
[What is the change being proposed/decided?]
Options Considered
Option 1: [Name]
- Pros: [Benefits]
- Cons: [Drawbacks]
Option 2: [Name]
- Pros: [Benefits]
- Cons: [Drawbacks]
Option 3: [Name]
- Pros: [Benefits]
- Cons: [Drawbacks]
Consequences
Positive
- [Benefit 1]
- [Benefit 2]
Negative
- [Tradeoff 1]
- [Tradeoff 2]
Risks
- [Risk 1] - Mitigation: [How to handle]
References
- [Link to relevant docs/discussions]
ADR Example
ADR-001: Use PostgreSQL for Primary Database
Status: Accepted Date: 2025-01-15 Deciders: Backend Team, DevOps
Context
We need a relational database for our new application. The application requires ACID compliance, complex queries, and JSON support.
Decision
Use PostgreSQL 16 as the primary database.
Options Considered
Option 1: PostgreSQL
- Pros: ACID, JSON support, excellent performance, open source
- Cons: Requires more ops expertise than managed solutions
Option 2: MySQL
- Pros: Familiar, widely supported
- Cons: Weaker JSON support, licensing concerns
Option 3: MongoDB
- Pros: Flexible schema, easy scaling
- Cons: Not ideal for relational data, eventual consistency
Consequences
Positive
- Full ACID compliance
- Native JSON/JSONB support
- Strong ecosystem and tooling
Negative
- Team needs PostgreSQL training
- More complex backup strategy
ADR Naming Convention
docs/adr/ ├── ADR-001-database-selection.md ├── ADR-002-authentication-strategy.md ├── ADR-003-api-versioning.md └── README.md (index)
Part 2: Runbook
When to Create Runbook
-
New service deployment
-
Common operational tasks
-
Incident response procedures
-
On-call handoff documentation
Runbook Template
Runbook: [Service/Task Name]
Service: [Service name] Owner: [Team/Person] Last Updated: YYYY-MM-DD On-Call: [Rotation/Contact]
Overview
[Brief description of what this runbook covers]
Prerequisites
- Access to [system/tool]
- Credentials for [service]
- VPN connected (if applicable)
Common Operations
Start Service
# Command to start
systemctl start service-name
# Verify running
systemctl status service-name
Stop Service
# Graceful shutdown
systemctl stop service-name
# Force stop (if graceful fails)
systemctl kill service-name
Check Logs
# Recent logs
journalctl -u service-name -n 100
# Follow logs
journalctl -u service-name -f
# Search for errors
journalctl -u service-name | grep -i error
Health Check
# Endpoint check
curl -s http://localhost:8080/health | jq
# Expected response
# { "status": "healthy", "version": "1.0.0" }
Troubleshooting
Issue: Service Won't Start
Symptoms: Service fails to start, exits immediately
Diagnosis:
journalctl -u service-name -n 50
Common Causes:
- Missing environment variables → Check .env
file
- Port already in use → lsof -i :8080
- Database connection failed → Check DB connectivity
Resolution:
# Fix env vars
source /etc/service-name/env
# Restart
systemctl restart service-name
Issue: High Memory Usage
Symptoms: Memory > 80% threshold
Diagnosis:
# Check memory
free -h
ps aux --sort=-%mem | head -10
Resolution:
# Restart service (temporary)
systemctl restart service-name
# Scale if needed
kubectl scale deployment service-name --replicas=3
Alerts & Escalation
Alert
Severity
Action
Escalate After
Service Down
Critical
Restart, check logs
5 min
High CPU
Warning
Monitor, scale if needed
15 min
High Memory
Warning
Restart if > 90%
10 min
Error Rate > 5%
Critical
Check logs, rollback
5 min
Contacts
Role
Name
Contact
Primary On-Call
[Name]
[Slack/Phone]
Secondary
[Name]
[Slack/Phone]
Team Lead
[Name]
[Slack/Phone]
Related Documentation
- Service Architecture
- Deployment Guide
- Monitoring Dashboard
### Runbook Naming Convention
docs/runbooks/
├── api-service.md
├── database-maintenance.md
├── deployment-rollback.md
├── incident-response.md
└── README.md (index)
---
## Documentation Checklist
### ADR Checklist
- [ ] Clear problem statement
- [ ] Options evaluated objectively
- [ ] Decision clearly stated
- [ ] Consequences documented
- [ ] Numbered and indexed
### Runbook Checklist
- [ ] Prerequisites listed
- [ ] Commands are copy-paste ready
- [ ] Common issues documented
- [ ] Escalation path defined
- [ ] Contacts current
---
## Best Practices
### Do's
- Keep ADRs immutable (supersede, don't edit)
- Test runbook commands before documenting
- Include "why" not just "what"
- Review runbooks after incidents
- Index all documentation
### Don'ts
- Delete old ADRs (mark superseded)
- Write runbooks without testing
- Assume reader knows context
- Let docs become stale
- Skip the troubleshooting section
---
**Version:** 1.0.0 | **Last Updated:** 2025-11-28