Iotum Incident Report: Redis Related Incident - 11/13/2024
Sara Atteby
started a topic
about 1 year ago
Summary of Root Cause
The Redis cluster used for websockets, in-memory cache and background job queueing experienced increased latency and memory pressure, causing front-end web instances to increase CPU load trying to reconnect, causing autoscaling to add more web instances, further increasing pressure on Redis.
This resulted in websocket communication to fail which prevented participants from joining video conferences.
Action Plan to Prevent Future Service Incidents
Add monitoring for unusual increases in both redis memory usage and web instance autoscaling group size
Enable logging on redis to monitor for unusual requests and/or evidence of application issues
Audit Redis usage across the entire platform and make improvements to ensure memory pressure can remain consistently at or below 80%
Timeline (UTC)
17:10 memory pressure begins
~17:30 first reports from users
17:32 escalated to DevOps by Product
17:45 redis identified as root cause
17:53 redis upsize begins
18:10 redis upsize completes
18:25 all front-end web instances replaced
18:30 outage resolved
Method of Discovery
Reported by a handful of users; escalated by iotum’s Product team to DevOps
Scope of Impact
All user-facing web applications
Resolution
Increase the size of the redis cluster; replace failed front-end web instances
Sara Atteby
Summary of Root Cause
The Redis cluster used for websockets, in-memory cache and background job queueing experienced increased latency and memory pressure, causing front-end web instances to increase CPU load trying to reconnect, causing autoscaling to add more web instances, further increasing pressure on Redis.
This resulted in websocket communication to fail which prevented participants from joining video conferences.
Action Plan to Prevent Future Service Incidents
Timeline (UTC)
17:10 memory pressure begins
~17:30 first reports from users
17:32 escalated to DevOps by Product
17:45 redis identified as root cause
17:53 redis upsize begins
18:10 redis upsize completes
18:25 all front-end web instances replaced
18:30 outage resolved
Method of Discovery
Reported by a handful of users; escalated by iotum’s Product team to DevOps
Scope of Impact
All user-facing web applications
Resolution
Increase the size of the redis cluster; replace failed front-end web instances