Start a new topic
Solved

Iotum Incident Report: Redis Related Incident - 11/13/2024

Summary of Root Cause

The Redis cluster used for websockets, in-memory cache and background job queueing experienced increased latency and memory pressure, causing front-end web instances to increase CPU load trying to reconnect, causing autoscaling to add more web instances, further increasing pressure on Redis.


This resulted in websocket communication to fail which prevented participants from joining video conferences.



Action Plan to Prevent Future Service Incidents

  • Add monitoring for unusual increases in both redis memory usage and web instance autoscaling group size
  • Enable logging on redis to monitor for unusual requests and/or evidence of application issues
  • Audit Redis usage across the entire platform and make improvements to ensure memory pressure can remain consistently at or below 80%



Timeline (UTC)

17:10 memory pressure begins

~17:30 first reports from users

17:32 escalated to DevOps by Product

17:45 redis identified as root cause

17:53 redis upsize begins

18:10 redis upsize completes

18:25 all front-end web instances replaced

18:30 outage resolved



Method of Discovery

Reported by a handful of users; escalated by iotum’s Product team to DevOps


Scope of Impact

All user-facing web applications


Resolution

Increase the size of the redis cluster; replace failed front-end web instances

pdf