IT Operations Expert
A comprehensive skill for managing IT infrastructure operations, ensuring service reliability, implementing monitoring and alerting strategies, managing incidents, and maintaining operational excellence through automation and best practices.
Core Principles
- Service Reliability First
-
Proactive Monitoring: Implement comprehensive observability before incidents occur
-
Incident Management: Structured response processes with clear escalation paths
-
SLA/SLO Management: Define and maintain service level objectives aligned with business needs
-
Continuous Improvement: Learn from incidents through blameless post-mortems
- Automation Over Manual Processes
-
Infrastructure as Code: Manage infrastructure configuration through version-controlled code
-
Runbook Automation: Convert manual procedures into automated workflows
-
Self-Healing Systems: Implement automated remediation for common issues
-
Configuration Management: Maintain consistency across environments
- ITIL Service Management
-
Service Strategy: Align IT services with business objectives
-
Service Design: Design resilient, scalable services
-
Service Transition: Manage changes with minimal disruption
-
Service Operation: Deliver and support services effectively
-
Continual Service Improvement: Iteratively enhance service quality
- Operational Excellence
-
Documentation: Maintain current runbooks, procedures, and architecture diagrams
-
Knowledge Management: Build searchable knowledge bases from incident resolutions
-
Capacity Planning: Forecast and provision resources proactively
-
Cost Optimization: Balance performance requirements with infrastructure costs
Core Workflow
Infrastructure Operations Workflow
-
MONITORING & OBSERVABILITY ├─ Define SLIs/SLOs/SLAs for critical services ├─ Implement metrics collection (infrastructure, application, business) ├─ Configure alerting with proper thresholds and escalation ├─ Build dashboards for different audiences (ops, devs, executives) └─ Establish on-call rotation and escalation procedures
-
INCIDENT MANAGEMENT ├─ Receive alert or user report ├─ Assess severity and impact (P1/P2/P3/P4) ├─ Engage appropriate responders ├─ Investigate and diagnose root cause ├─ Implement fix or workaround ├─ Communicate status to stakeholders ├─ Document resolution in knowledge base └─ Conduct post-incident review
-
CHANGE MANAGEMENT ├─ Submit change request with impact assessment ├─ Review and approve through CAB (Change Advisory Board) ├─ Schedule change window ├─ Execute change with rollback plan ready ├─ Validate success criteria ├─ Document actual vs planned results └─ Close change ticket
-
CAPACITY PLANNING ├─ Collect resource utilization trends ├─ Analyze growth patterns ├─ Forecast future requirements ├─ Plan procurement or provisioning ├─ Execute capacity additions └─ Monitor effectiveness
-
AUTOMATION & OPTIMIZATION ├─ Identify repetitive manual tasks ├─ Document current process ├─ Design automated solution ├─ Implement and test automation ├─ Deploy to production ├─ Measure time/cost savings └─ Iterate and improve
Decision Frameworks
Alert Configuration Decision Matrix
Scenario Alert Type Threshold Response Time Escalation
Service completely down Page Immediate < 5 min Immediate to on-call
Service degraded Page 2-3 failures < 15 min After 15 min to on-call
High resource usage Warning
80% sustained < 1 hour After 2 hours to team lead
Approaching capacity Info
70% trend < 24 hours Weekly capacity review
Configuration drift Ticket Any deviation < 7 days Monthly review
Incident Severity Classification
Priority 1 (Critical)
-
Complete service outage affecting all users
-
Data loss or security breach
-
Financial impact > $10K/hour
-
Response: Immediate, 24/7, all hands on deck
Priority 2 (High)
-
Partial service outage affecting many users
-
Significant performance degradation
-
Financial impact $1K-$10K/hour
-
Response: < 30 minutes during business hours
Priority 3 (Medium)
-
Service degradation affecting some users
-
Non-critical functionality impaired
-
Workaround available
-
Response: < 4 hours during business hours
Priority 4 (Low)
-
Minor issues with minimal impact
-
Cosmetic problems
-
Enhancement requests
-
Response: Next business day
Change Management Risk Assessment
Risk Level = Impact × Likelihood × Complexity
Impact (1-5): 1 = Single user 2 = Team 3 = Department 4 = Company-wide 5 = Customer-facing
Likelihood of Issues (1-5): 1 = Routine, tested 2 = Familiar, documented 3 = Some uncertainty 4 = New territory 5 = Never done before
Complexity (1-5): 1 = Single component 2 = Few components 3 = Multiple systems 4 = Cross-platform 5 = Enterprise-wide
Risk Score Interpretation: 1-20: Standard change (pre-approved) 21-50: Normal change (CAB review) 51-75: High-risk change (extensive testing, senior approval) 76-125: Emergency change only (executive approval)
Monitoring Tool Selection
Requirement Prometheus + Grafana Datadog New Relic ELK Stack Splunk
Cost Free (self-hosted) $$$$ $$$$ Free-$$ $$$$$
Metrics Excellent Excellent Excellent Good Good
Logs Via Loki Excellent Excellent Excellent Excellent
Traces Via Tempo Excellent Excellent Limited Good
Learning Curve Steep Moderate Moderate Steep Steep
Cloud-Native Excellent Excellent Excellent Good Good
On-Premises Excellent Good Good Excellent Excellent
APM Via exporters Excellent Excellent Limited Good
Common Operational Challenges
Challenge 1: Alert Fatigue
Problem: Too many false positive alerts causing team burnout
Solution:
Alert Tuning Process:
- Measure baseline alert volume and false positive rate
- Categorize alerts by actionability:
- Actionable + Urgent = Keep as page
- Actionable + Not Urgent = Ticket
- Not Actionable = Remove or convert to dashboard metric
- Implement alert aggregation (group similar alerts)
- Add context to alerts (runbook links, relevant metrics)
- Regular review meetings (weekly) to tune thresholds
- Track metrics:
- MTTA (Mean Time to Acknowledge): < 5 min target
- False Positive Rate: < 20% target
- Alert Volume per Week: Trending down
Challenge 2: Incident Documentation During Crisis
Problem: Teams skip documentation during high-pressure incidents
Solution:
-
Assign dedicated scribe role (not the incident commander)
-
Use incident management tools (PagerDuty, Opsgenie) with automatic timeline
-
Template-based incident reports with required fields
-
Post-incident review scheduled automatically (within 48 hours)
-
Gamify documentation (track and recognize thorough documentation)
Challenge 3: Knowledge Silos
Problem: Critical knowledge trapped in individual team members' heads
Solution:
Knowledge Transfer Strategy:
- Pair Programming/Shadowing: 20% of sprint capacity
- Runbook Requirements: Every system must have runbook
- Lunch & Learn Sessions: Weekly 30-min knowledge sharing
- Cross-Training Matrix: Track who knows what, identify gaps
- On-Call Rotation: Everyone rotates to spread knowledge
- Post-Incident Reviews: Mandatory team sharing
- Documentation Sprints: Quarterly focus on doc completion
Challenge 4: Balancing Stability vs Innovation
Problem: Operations team resists change to maintain stability
Solution:
-
Implement change windows (planned maintenance periods)
-
Use blue-green or canary deployments for lower risk
-
Establish "innovation time" (Google 20% time model)
-
Create sandbox environments for experimentation
-
Measure and reward both stability AND improvement metrics
-
Include "toil reduction" as OKR target
Key Metrics & KPIs
Service Reliability Metrics
Availability: Formula: (Total Time - Downtime) / Total Time × 100 Target: 99.9% (43.8 min/month downtime) Measurement: Per service, monthly
MTTR (Mean Time to Recovery): Formula: Sum of recovery times / Number of incidents Target: < 30 minutes for P1, < 4 hours for P2 Measurement: Per severity level, monthly
MTBF (Mean Time Between Failures): Formula: Total operational time / Number of failures Target: > 720 hours (30 days) Measurement: Per service, quarterly
MTTA (Mean Time to Acknowledge): Formula: Sum of acknowledgment times / Number of alerts Target: < 5 minutes for pages Measurement: Per on-call engineer, weekly
Change Success Rate: Formula: Successful changes / Total changes × 100 Target: > 95% Measurement: Monthly
Incident Recurrence Rate: Formula: Repeat incidents / Total incidents × 100 Target: < 10% Measurement: Quarterly (same root cause within 90 days)
Operational Efficiency Metrics
Toil Percentage: Definition: Time spent on manual, repetitive tasks Target: < 30% of team capacity Measurement: Weekly time tracking
Automation Coverage: Formula: Automated tasks / Total repetitive tasks × 100 Target: > 70% Measurement: Quarterly audit
On-Call Load: Formula: Alerts per on-call shift Target: < 5 actionable alerts per shift Measurement: Per engineer, weekly
Runbook Coverage: Formula: Services with runbooks / Total services × 100 Target: 100% Measurement: Monthly audit
Knowledge Base Utilization: Formula: Incidents resolved via KB / Total incidents × 100 Target: > 40% Measurement: Monthly
Integration Points
With Development Teams
-
Participate in design reviews for operational requirements
-
Provide deployment automation and CI/CD pipeline support
-
Share monitoring and logging requirements
-
Collaborate on incident response and post-mortems
-
Joint ownership of SLOs and error budgets
With Security Teams
-
Implement security monitoring and alerting
-
Manage access controls and authentication systems
-
Coordinate vulnerability patching and remediation
-
Conduct security incident response
-
Maintain compliance with security policies
With Business Stakeholders
-
Report on service availability and performance
-
Communicate planned maintenance windows
-
Provide capacity planning forecasts
-
Translate technical metrics to business impact
-
Participate in business continuity planning
Best Practices
- Blameless Post-Mortems
Post-Incident Review Template:
- Incident Summary (what happened, when, impact)
- Timeline of Events (detailed chronology)
- Root Cause Analysis (5 Whys or Fishbone)
- What Went Well (strengths during response)
- What Could Be Improved (opportunities)
- Action Items (with owners and due dates)
- Lessons Learned (shareable insights)
Rules:
- No blame or punishment
- Focus on systems and processes, not people
- Everyone can speak freely
- Action items must be tracked to completion
- Runbook Standards
Runbook Contents:
- Service Overview: Purpose, dependencies, architecture
- SLIs/SLOs/SLAs: Defined thresholds and targets
- Common Issues: Symptoms, causes, solutions
- Troubleshooting Steps: Step-by-step procedures
- Escalation Paths: Who to contact and when
- Useful Commands: Copy-paste ready commands
- Dashboard Links: Direct links to relevant dashboards
- Recent Changes: Link to change log
- Contact Information: Team, product owner, SMEs
Maintenance:
- Review quarterly or after major incidents
- Test procedures during low-traffic periods
- Update after every significant change
- Track usage metrics (page views, helpfulness ratings)
- On-Call Best Practices
On-Call Preparation:
- Laptop with VPN access
- Mobile device with notification apps
- Contact list (escalation paths)
- Access to all critical systems
- Runbooks bookmarked
- Backup on-call identified
During On-Call:
- Acknowledge alerts within 5 minutes
- Update incident status regularly
- Follow escalation procedures
- Document all actions in incident ticket
- Handoff clearly to next on-call
Post On-Call:
- Complete incident reports
- Submit toil reduction tickets
- Provide feedback on runbooks
- Update on-call documentation
- Change Management Discipline
Standard Change Process:
- Create change request (RFC)
- Document:
- What: Specific changes being made
- Why: Business justification
- When: Proposed date/time
- Who: Change implementer and approver
- How: Step-by-step procedure
- Risk: Assessment and mitigation
- Rollback: Detailed rollback plan
- Testing: Validation steps
- Submit for CAB review (7 days advance notice)
- Implement during approved window
- Validate success criteria
- Close change with actual results
- Post-implementation review if issues occurred
Emergency Change Process:
- Executive approval required
- Implement with heightened monitoring
- Full team notification
- Complete documentation within 24 hours
- Mandatory post-change review
Reference Files
For detailed technical guidance, see:
-
reference/monitoring.md - Observability, metrics, alerting, and dashboard design
-
reference/incident-management.md - Incident response, root cause analysis, post-mortems
-
reference/infrastructure.md - Server management, network operations, capacity planning
-
reference/automation.md - Scripting, configuration management, orchestration tools
-
reference/backup-recovery.md - Backup strategies, disaster recovery, business continuity
Getting Started
-
For New Infrastructure: Start with reference/infrastructure.md for setup guidance
-
For Monitoring Setup: Review reference/monitoring.md for observability strategy
-
For Incident Response: See reference/incident-management.md for procedures
-
For Automation Projects: Check reference/automation.md for tooling recommendations
-
For DR Planning: Consult reference/backup-recovery.md for recovery strategies