Slow Response Times
Incident Report for SCORM Cloud
Postmortem

September 15 Incident Postmortem

Incident Description

On September 15, from 9:15 AM CDT until 9:44 AM CDT, SCORM Cloud experienced sporadic partial outages caused by elevated response latency. During this time, some requests failed because of timeouts, both on the client side and the server side.

Root Cause

On September 14 at 9:50 AM CDT, we deployed an update to the underlying operating system used on SCORM Cloud application servers. The update was deployed in the routine manner and passed our usual battery of quality assurance tests.

The updated operating system incidentally changed system operations in two ways. First, the base operating system used more disk space than the previous version. Second, the updated operating system produced a much greater volume of Linux kernel audit logs when serving application requests. In other words, we had less free baseline disk space than prior, and additionally our disk space requirements increased dramatically.

This increase in disk usage, in addition to log rotation configuration that was insufficient for the task, caused a slow disk space leak on application servers, with the root partition going from 65% full to 90% full over the course of approximately 8 hours (depending on incoming request traffic).

To mitigate the impact of full disks on servers, our application's health check system automatically replaces any server that has a disk with greater than 90% usage, since once the disk is full that server can no longer service requests. This health check system was able to transparently handle the underlying disk space problem throughout the day on September 14 and through the early hours of September 15. However, the increase in traffic associated with US business mornings happened to coincide with many application servers crossing the 90% disk space threshold simultaneously. When that happened, our health check system diligently removed them from the traffic-serving rotation and began the process of spinning up new servers to replace them.

As a result, right as morning traffic was ramping up, with more than 15 course launches per second, we lost a substantial portion of our request processing capacity. During this time, the remaining servers continued to process requests but at a degraded rate: high CPU usage caused request latency to skyrocket.

Analysis

The increase in audit log usage appears to be expected. We will continue to capture the audit log for purposes of intrusion detection.

Our disk space health check has served us well in the past, and we are currently planning to continue its use. We may investigate a way to replace servers that fail this specific component of the health check more gracefully.

The disk space issue made it through our usual quality assurance tests because it was a slow enough leak that it required many hours of production-level traffic to trigger any failures.

Our application is designed to survive the loss of a full data center — that is, half its servers — and still continue to function. In this case, the surge of morning traffic meant that more than half the running application servers were required to handle the volume of incoming requests. We will investigate raising the maximum number of application servers allowed in our auto-scaling configuration so that losing half of them is still viable even in the morning rush.

Fixes Implemented

We have since deployed an update that both increases the base disk size for application servers and configures more aggressive log rotation (so that the audit log takes up less disk space overall). Additionally, we’ve configured additional alarms to alert on this situation.

Future Improvements

We plan on implementing the following improvements in the future to mitigate/prevent issues of this kind from reoccurring:

  • Better alarms when multiple application servers simultaneously are marked unhealthy
  • More graceful replacement of servers when they cross the disk space threshold
  • Ensure that the application can survive 50% server loss during peak traffic

Incident Timeline

All times are in CDT.

  • September 14, 9:47 AM: Operating system update is deployed
  • September 15, 8:33 AM: Low disk space issue is first identified by developers and investigation begins
  • 9:15 AM: Initial high API latency alarms fire
  • 9:19 AM: First downtime alarm fires
  • 9:30 AM: An incident notification is posted on the status page during incident triage
  • 9:35 AM: Disk space replacements are identified as the cause of the incident
  • 9:37 AM: An engineer manually scales out the application server stack to increase capacity
  • 9:44 AM: Additional servers come online, temporarily resolving the immediate incident
  • 10:03 AM: Temporary additional baseline disk space deployed to production to prevent incident reoccurrence
  • 12:19 PM: Incident marked as resolved on status page
  • September 17, 8:25 AM: Additional baseline disk space and log rotation rules deployed to production to permanently resolve the incident
Posted Sep 24, 2021 - 18:21 UTC

Resolved
This incident has been resolved.
Posted Sep 15, 2021 - 17:19 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Sep 15, 2021 - 15:16 UTC
Investigating
We're experiencing performance issues that are resulting in slow response times and are actively looking into the issue.
Posted Sep 15, 2021 - 14:30 UTC
This incident affected: SCORM Cloud Website and SCORM Cloud API.