Elevated Errors

Incident Report for SCORM Cloud

Postmortem

Problem Description & Scope

Affected time range (times are CST): 2024-07-29 9:53 A.M. - 2024-07-29 2:43 P.M.

Description: This incident seemed to involve a number of independent factors which converged to create a spiral of database resource contention that caused our application servers to thrash and request latencies to average higher than normal over the period. Ultimately the issues were solved by removing an inefficient database trigger. However, that trigger had been in place for well over 2 years. The analysis contains more information on the additional factors at play.

Root Cause

There does not seem to be a single cause to this issue, but several factors playing off one another. We believe the initiating event was attempting to process a batch of jobs to convert SCORM registrations to xAPI data (though the job is named ConvertScormToTinCan). A portion of this batch resulted in a timeout from the queue awaiting acknowledgement, and then a failure in the job itself, resulting in a duplicate job that would encounter a similar issue.

With this problem taking more database resources than expected, more weight was put on an inefficient database trigger as traffic increased on Monday morning. These two issues caused database resource starvation, which persisted even after the first problem was fixed. It was not until the database trigger was also removed that the resource contention resolved.

Analysis

Monday morning we began investigating application metrics indicating that request latencies were fluctuating. This predicated error rates rising and the general service degradation that happened later.

Problem 1

Investigation into fluctuation in request latencies revealed that the number of messages in our off-stack worker job queue had grown significantly over the weekend. This indicated there was something causing message duplication, since new messages would not be queued to this stack that was not serving traffic. Additionally, the success rate for processing these jobs was very low (0-1%), suggesting that the source of duplication was potentially a job (or jobs) that would always fail. This growing queue triggered automatic scaling rules for our off-stack worker server pool, increasing the amount of servers attempting to process the jobs. This ultimately resulted in a reduced number of database connections available to the live stack that was servcing requests.

As our application server pool scaled up organically with the traffic of the day, we got the first indication that error rates were rising. It was observed at this time that the writer instance of our database cluster was utilizing the maximum number of connection threads it supported. Additionally, the CPU metric for this instance was consistently hovering around ~97%. This would mean longer base-line query times for all connections.

Because of this, as we continued to scale through the morning, there would not be a guarantee that new servers could successfully establish a database connection pool. When they failed to do so, they could not report health, so they never started serving any traffic. This put additional strain on the servers that had established connections, increasing request latency and in turn causing some of these servers to be marked as unhealthy. This was the beginning of a period of server thrashing.

At this time we manually overrode the scaling rules of the off-stack worker server pool, pinning this group to a size of 1 and disabling scaling. This freed up database connections that could now be utilized by new servers coming online. The number of healthy servers did increase from this action, but the metrics for connections and CPU load on the database remained troublingly high.

Further investigation into the logs of the off-stack worker servers revealed that there were many jobs for ConvertScormToTinCan that had multiple long running database operations. It appeared these messages were taking so long to process that the queue would assume the server didn't respond, and re-queue the message as undelivered. At some point, the job would fail on the server, causing the server to re-queue the message as a retry. We deemed this to be the cause of the growing queue, and noted that we must retune the timeouts on the queue and worker servers to prevent this from happening in the future.

Hoping to stabilize service, we manually purged the off-stack worker queue of these messages. This halted the repeated and long running queries related to the ConvertScormToTinCan jobs. Around this time there was a significant reduction in the number of threads connected to the database. However, request latencies remained unstable, and database CPU performance was still suffering.

Problem 2

With the service degradation persisting, our investigations turned fully to database logs, to see what was keeping the CPU so busy. We found entries in the slow query log reporting slow inserts into HostedDispatchRegistration, indicating slow creation of dispatch registration records. Normally this would only be expected to affect dispatch functionality, and not the general service. Unless there were so many of these tying up database processes that other traffic was impacted.

Interestingly, when inspecting the active process list of the database, we didn't see any of these queries running. Instead, we saw long waits on a select query against ScormRegistration, which is the main registration table in Rustici Engine (upon which SCORM Cloud is built). This would better explain the widespread issues being seen.

In an attempt to take pressure off the database, we wrote an automated process to kill any active query in the database process list that had been running for an amount of time that implied any initiating request had already timed out. We could see this was actively killing long running processes, but metrics were still suffering.

Eventually we tracked the long running query we saw in the database process list down. It was, in fact, a sub-query inside a database trigger that ran when a HostedDispatchRegistration was created. This correlated with what we were seeing in the slow query log before. However, and interestingly, this sub-query targeted indexes in a way that should have made it nearly instant.

We surmised that the trigger was possibly causing a row lock on ScormRegistration, seemingly for the duration of the trigger running (rather than just the sub-query). This caused contention with any other process attempting to touch the locked row. This had a cascading effect which left a long line of processes waiting to interact with ScormRegistration.

Ultimately we gauged that the feature that was enabled by this trigger (using SCORM Cloud as an LTI 1.3 Tool) has almost no recent utilization. We decided to drop the trigger with the intent of replacing the regressed functionality as a patch to the application code. This relatively immediately eased the constraints on the database, and latencies began recovering.

Summary

We were very happy to have restored system stability, but a question lingered: This trigger had been in place for well over 2 years, how have we never seen this level of issue with it?

It would seem that the initial issues we saw with the work queue and the failing ConvertScormToTinCan jobs taxed the database just enough that latencies increased on the HostedDispatchRegistration operations to the point that they began fighting against themselves.

Fixes Implemented

As mentioned in the analysis, several measures were taken during the service disruption that had some impact on metrics but ultimately no measurable impact on the overarching issues. The fix that finally restored service was removing a database trigger that had been in place for over 2 years.

Future Improvements

Several actionable improvements were found throughout the investigation of these issues. First and foremost, we must replace the functionality that the trigger enabled with the same functionality in the application code, where we have more control over its performance and where we source the data.

Also of note, we must retune the timeouts on the queue and worker servers. The drift here allowed for the duplication of messages in a way that shouldn't have been allowed. We will ensure that a box responds to the queue before the queue times it out.

We also must investigate the ConvertScormToTinCan job failures and see if the issues were related to the underlying registration data. We've not seen a recurrence of this issue since the incident.

Timeline

All times are CDT

  • 2024-07-29 04:36 - Service wide request latencies began fluctuating, but error rates remained at a normal, minimal, rate.
  • 2024-07-29 09:47 - We found that the number of messages in the off-stack worker job queue had grown significantly over the weekend.
  • 2024-07-29 09:53 - First indications that general service error rates are rising.
  • 2024-07-29 10:50 - Logs of the off-stack worker servers revealed that there were many jobs for ConvertScormToTinCan that had multiple long running database operations.
  • 2024-07-29 12:53 - Deployed an automated process to kill any active query in the database process list that had been running longer than a timeout
  • 2024-07-29 14:43 - The database trigger on HostedDispatchRegistration was deleted, and latencies began improving
Posted Aug 02, 2024 - 21:19 UTC

Resolved

Request times and service health have returned to normal and are holding strong. We are continuing to monitor service health to ensure the fixes put in place continue to be effective.
Posted Jul 29, 2024 - 20:41 UTC

Update

Performance remains impacted for some customers as our servers recover from surges in traffic this morning. Service should continue to improve as impacted servers are replaced with healthy ones. We are continuing to monitor the situation.
Posted Jul 29, 2024 - 17:36 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jul 29, 2024 - 15:22 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Jul 29, 2024 - 15:22 UTC

Investigating

We're experiencing an elevated level of errors and are currently looking into the issue.
Posted Jul 29, 2024 - 15:04 UTC
This incident affected: SCORM Cloud Website and SCORM Cloud API.