Elevated Errors

Incident Report for SCORM Cloud

Postmortem

Problem Description & Scope

Affected time range (times are CST): 2024-01-19 9:28 A.M. - 2024-01-19 10:57 A.M.

Description: A customer’s testing script was running unabated, effectively starving other customers of resources by using the majority of the connections available. We were able to remedy this issue by disabling the customer’s application that was running the script, which freed up those resources for the rest of Cloud’s customers.

Root Cause

We allow customers to create new applications for their realms through the SCORM Cloud API, which helps customers to organize their courses and registrations in a manner that best suits their needs. In this particular instance, a customer was running a test script that was creating an exorbitant amount of applications at a high rate, which led to resource starvation for our other customers.

Our application did the correct thing and scaled up to handle the increased load, but once we reached the maximum number of boxes we allow, the script continued to hog resources and starve out anyone else until SCORM Cloud was facing a widespread outage.

Analysis

Finding the root cause of this issue was interesting, since there was no immediate indication of what was causing the issue. There wasn’t a massive spike in CPU usage on the individual boxes or on the database, and there wasn’t a large number of database lock waits, which is one of the indicators we look for in these situations. The main indicator of the root cause was a consistent increase in the number of database connections in use, which made us wonder if there was something that wasn’t causing database locking, but was instead just hogging a lot of connections for a longer period of time.

Thus we began to look at the database itself, and upon checking the processlist, we noted a large number of queries against our application table from a specific realm. Upon checking, we noted that they had a massive number of applications (about 9x more than the next largest realm), and they were creating them at a rapid pace. We killed some of these queries and noted an improvement in our systems, so we determined that this was the root cause of the outage.

Fixes Implemented

First, we wanted to lower the strain on our resources, so we began to systematically kill off the problem queries to give our servers time to recover. We then decided to disable this realm’s “app management application”, which is the application customers need to use to create new applications in their realm. This temporarily stopped the customer from creating new applications.

Simultaneously, we contacted the customer to inform them of our actions, and asked them if they could identify the source of these API calls. They noted it was a test script that had run awry, and subsequently terminated it.

Future Improvements

The main improvement that we can make going forward is to limit the number of applications a customer can create. The average number of applications per realm is 4.25, and only a small handful of our customers ever get into the triple digits of applications, so limiting the number per account won’t affect the vast majority of our customers.

The next improvement we discussed is some sort of monitoring to detect anomalous behavior in API usage. The `createApplication` API call isn’t one that is called at a high rate, especially compared to API methods like `createRegistration`, so we could put some monitoring around API methods that alerts us whenever a particular method is an outlier, which could tip us off before problems occur.

Incident Timeline

All times are in CDT.

2024-01-19 9:28 AM - Our monitoring service first noted that boxes were failing health checks
2024-01-19 9:49 AM - We noticed the anomalous behavior with the application table of the database
2024-01-19 10:14 AM - We identified the customer that is calling the `createApplication` API method an exorbitant amount, and noted how often they are calling that method.
2024-01-19 10:15 AM - We began to kill the database queries for the application table for this particular customer
2024-01-19 10:20 AM - We disabled the app management application for this customer, preventing them from creating more applications
2024-01-19 10:41 AM - The issue was marked as “Monitoring” as Cloud’s metrics are recovering
2024-01-19 10:57 AM - The issue was marked as “Resolved.”

Posted Jan 24, 2024 - 18:22 UTC

Resolved

This incident has been resolved.

Posted Jan 19, 2024 - 16:57 UTC

Monitoring

Service has been stabilized and we are monitoring status of the system.

Posted Jan 19, 2024 - 16:41 UTC

Investigating

We're experiencing an elevated level of errors and are currently looking into the issue.

Posted Jan 19, 2024 - 15:53 UTC

This incident affected: SCORM Cloud Website.