Iotum Incident Report: Redis Related Incident - 11/13/2024 : Partner Support Center

Start a new topic

Solved

Iotum Incident Report: Redis Related Incident - 11/13/2024

Sara Atteby

started a topic about 1 year ago

Summary of Root Cause

The Redis cluster used for websockets, in-memory cache and background job queueing experienced increased latency and memory pressure, causing front-end web instances to increase CPU load trying to reconnect, causing autoscaling to add more web instances, further increasing pressure on Redis.

This resulted in websocket communication to fail which prevented participants from joining video conferences.

Action Plan to Prevent Future Service Incidents

Add monitoring for unusual increases in both redis memory usage and web instance autoscaling group size
Enable logging on redis to monitor for unusual requests and/or evidence of application issues
Audit Redis usage across the entire platform and make improvements to ensure memory pressure can remain consistently at or below 80%

Timeline (UTC)

17:10 memory pressure begins

~17:30 first reports from users

17:32 escalated to DevOps by Product

17:45 redis identified as root cause

17:53 redis upsize begins

18:10 redis upsize completes

18:25 all front-end web instances replaced

18:30 outage resolved

Method of Discovery

Reported by a handful of users; escalated by iotum’s Product team to DevOps

Scope of Impact

All user-facing web applications

Resolution

Increase the size of the redis cluster; replace failed front-end web instances

pdf

Iotum Redis ...

(32.3 KB)