Affected time range (times are CST): January 6 9:38 AM through January 6 11:36 AM
Description: A surge of website login requests resulted in significant latency in processing said requests. The increase in latency caused some of these login attempts to time out. Users experiencing timeouts would repeatedly attempt to login again. These repeated attempts compounded the queue of pending requests. This high volume of repeated requests caused contention in accessing the database which ultimately overwhelmed our database, causing a service outage.
A bug in our website session handling was introduced into production at the end of 2019. This bug has been causing pollution of our session record store with “blank” records which were not appropriately tied to anything. However, these invalid records would periodically get destroyed at regular intervals by a job which cleans up expired sessions. The regular cleanup of these session records made the bug largely invisible in the system.
On the morning of January 6th, 2022 there was a surge in the creation of these “blank” records. This increase in total session records began to put weight on a single suboptimal query that is run during the routine to build a user session upon login. Since the SCORM Cloud website is largely a management/admin portal, the number of active sessions is historically quite low. This meant as long as this dataset stayed relatively small, the suboptimal query had adequate performance.
At around 10:35 AM we experienced a surge of website login requests. Each of these requests took longer than normal due to the increase in login attempts, the larger dataset of session records, and the suboptimal query of that dataset. At the beginning of this surge we saw login requests successfully serviced in ~3-4 seconds. As more requests came in this time slowed to around ~10 seconds. At this point we began to see multiple requests from the same client IP addresses coming in, indicating that users were refreshing their browser and attempting to login again. As more users made these repeated requests our time to respond on login attempts grew dramatically.
Eventually, nearly all of our available database connections were blocked waiting on these duplicate login requests to resolve. This began triggering application server health checks to fail, and those servers were marked for removal and replacement. The replacement servers struggled to establish an initial healthy state due to the existing database contention. The reduction in available healthy servers resulted in intermittent service disruption across the system. The SCORM Cloud Website saw intermittent disruption for 73 minutes, while the SCORM Cloud API saw a 5 minute window of disruption.
We first isolated and terminated the hundreds of long running session queries that were occupying the majority of the database threads. We then applied an index to our sessions database table which served to optimize the existing query run upon login.
Below is a timeline of events occurring on January 06, 2022 (times in CST):