Slow Response Time
Incident Report for SCORM Cloud
Postmortem

SCORM Cloud Service Disruption Post-Mortem

Problem Description & Scope

Affected time range (times are CST): January 6 9:38 AM through January 6 11:36 AM

Description: A surge of website login requests resulted in significant latency in processing said requests.  The increase in latency caused some of these login attempts to time out.  Users experiencing timeouts would repeatedly attempt to login again.  These repeated attempts compounded the queue of pending requests.  This high volume of repeated requests caused contention in accessing the database which ultimately overwhelmed our database, causing a service outage.

Root Cause

A bug in our website session handling was introduced into production at the end of 2019.   This bug has been causing pollution of our session record store with “blank” records which were not appropriately tied to anything.  However, these invalid records would periodically get destroyed at regular intervals by a job which cleans up expired sessions.  The regular cleanup of these session records made the bug largely invisible in the system.

On the morning of January 6th, 2022 there was a surge in the creation of these “blank” records.  This increase in total session records began to put weight on a single suboptimal query that is run during the routine to build a user session upon login.  Since the SCORM Cloud website is largely a management/admin portal, the number of active sessions is historically quite low.  This meant as long as this dataset stayed relatively small, the suboptimal query had adequate performance.

At around 10:35 AM we experienced a surge of website login requests.  Each of these requests took longer than normal due to the increase in login attempts, the larger dataset of session records, and the suboptimal query of that dataset.  At the beginning of this surge we saw login requests successfully serviced in ~3-4 seconds.  As more requests came in this time slowed to around ~10 seconds.  At this point we began to see multiple requests from the same client IP addresses coming in, indicating that users were refreshing their browser and attempting to login again.  As more users made these repeated requests our time to respond on login attempts grew dramatically.

Eventually, nearly all of our available database connections were blocked waiting on these duplicate login requests to resolve.  This began triggering application server health checks to fail, and those servers were marked for removal and replacement.  The replacement servers struggled to establish an initial healthy state due to the existing database contention.  The reduction in available healthy servers resulted in intermittent service disruption across the system. The SCORM Cloud Website saw intermittent disruption for 73 minutes, while the SCORM Cloud API saw a 5 minute window of disruption.

Corrective Action

We first isolated and terminated the hundreds of long running session queries that were occupying the majority of the database threads.  We then applied an index to our sessions database table which served to optimize the existing query run upon login.

Preventative Action

  • In the short term, the added index has seen a dramatic decrease in login times (from ~3-4 seconds down to ~0.5 seconds), even with a larger dataset.
  • We will thoroughly investigate the scope and cause of the invalid session records, and release a hotfix for that as soon as we can.  In the meantime we are monitoring the size of the session store and checking regularly for invalid records.
  • We will explore implementing a captcha mechanism for successive login attempts from the same client IP addresses and user accounts to mitigate the possibility of queuing of multiple requests from events like browser refreshes.

Timeline

Below is a timeline of events occurring on January 06, 2022 (times in CST):

  • 9:38 AM — Initial alarms were triggered for high database CPU usage.  Investigation began immediately into the cause.
  • 9:54 AM — Health checks began failing for the SCORM Cloud Website service, and we began seeing timeouts for some API requests.
  • 9:59 AM — Initial posting on our status page to alert customers to the issue
  • 10:21 AM — Having identified the source of database contention, we began manually terminating the queries bottlenecking the system
  • 11:10 AM — Applied the database index to optimize the existing login query
  • 11:23 AM — System began to stabilize, we updated the status page, and continued to monitor the results of the fixes
  • 11:36 AM — System was fully stabilized and back to normal operation.  Status page was updated accordingly.
Posted Jan 11, 2022 - 21:18 UTC

Resolved
This incident has been resolved.
Posted Jan 06, 2022 - 17:36 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jan 06, 2022 - 17:23 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Jan 06, 2022 - 16:39 UTC
Investigating
We're experiencing performance issues that are resulting in slow response times and are actively looking into the issue.
Posted Jan 06, 2022 - 15:59 UTC
This incident affected: SCORM Cloud Website and SCORM Cloud API.