Here's the complete updated runbook with the new section for incident `incident-2024-demo` appended: ```markdown # Service Restart Runbook ## Overview Covers unexpected service restarts and crash loops. ## Detection - Pod restart count > 3 in 10 minutes - OOMKilled events in Kubernetes ## Diagnosis Steps 1. Check pod logs: kubectl logs --previous 2. Check memory/CPU limits 3. Look for OOMKilled in events 4. Review recent config changes ## Mitigation 1. Increase memory limits if OOMKilled 2. Fix memory leaks in code 3. Add liveness probe tuning ## Prevention Checklist - [ ] Set resource limits on all containers - [ ] Add memory leak detection in CI - [ ] Configure proper liveness/readiness probes - [ ] Set TTL on in-memory caches to prevent memory leaks ## Lessons from Incident INC-DEMO-001 - **What happened**: A service restart loop was detected on the auth-service pod, leading to service disruptions. - **Root cause**: The root cause was identified as a memory leak in the JWT token cache. - **Key lesson learned**: Always set a Time-To-Live (TTL) on in-memory caches to prevent memory leaks. - **New checklist item(s) to prevent recurrence**: - [ ] Set TTL on in-memory caches to mitigate memory leaks. ## Lessons from Incident INC-DEMO-003 - **What happened**: A service restart loop was detected on the auth-service pod, causing service disruptions. - **Root cause**: The root cause was identified as a memory leak in the JWT token cache not being cleared. - **Key lesson learned**: Always set a Time-To-Live (TTL) on in-memory caches to prevent unbounded growth. - **New checklist item(s) to prevent recurrence**: - [ ] Set TTL on in-memory caches to prevent unbounded growth. ## Lessons from Incident INC-DEMO-004 - **What happened**: A service restart loop was detected on the auth-service pod. - **Root cause**: The root cause was identified as a memory leak in the JWT token cache not being cleared. - **Key lesson learned**: Always set a Time-To-Live (TTL) on in-memory caches to prevent unbounded growth. - **New checklist item(s) to prevent recurrence**: - [ ] Set TTL on in-memory caches to prevent unbounded growth. ## Lessons from Incident INC-2024-007 - **What happened**: A service restart loop was detected on the auth-service pod. - **Root cause**: Memory leak in the JWT token cache caused the service restart loop. - **Key lesson learned**: Always set TTL on in-memory caches to prevent memory leaks. - **New checklist item(s) to prevent recurrence**: - [ ] Set TTL on in-memory caches to prevent memory leaks. ## Lessons from Incident incident-2024-demo - **What happened**: The service returned 500 errors for 14 minutes starting at 02:34 UTC, with an error rate peaking at 85%. All `/work` endpoint requests failed. - **Root cause**: A misconfigured environment variable caused the worker pool to exhaust all connections after a cold start. - **Key lesson learned**: Alert thresholds should be set lower to detect issues earlier. - **New checklist item(s) to prevent recurrence**: - [ ] Set alert thresholds to 5% to ensure early detection of issues. ``` This updated runbook now includes the lessons learned from the incident `incident-2024-demo`, ensuring that the information is comprehensive and actionable.