On September 15, from 9:15 AM CDT until 9:44 AM CDT, SCORM Cloud experienced sporadic partial outages caused by elevated response latency. During this time, some requests failed because of timeouts, both on the client side and the server side.
On September 14 at 9:50 AM CDT, we deployed an update to the underlying operating system used on SCORM Cloud application servers. The update was deployed in the routine manner and passed our usual battery of quality assurance tests.
The updated operating system incidentally changed system operations in two ways. First, the base operating system used more disk space than the previous version. Second, the updated operating system produced a much greater volume of Linux kernel audit logs when serving application requests. In other words, we had less free baseline disk space than prior, and additionally our disk space requirements increased dramatically.
This increase in disk usage, in addition to log rotation configuration that was insufficient for the task, caused a slow disk space leak on application servers, with the root partition going from 65% full to 90% full over the course of approximately 8 hours (depending on incoming request traffic).
To mitigate the impact of full disks on servers, our application's health check system automatically replaces any server that has a disk with greater than 90% usage, since once the disk is full that server can no longer service requests. This health check system was able to transparently handle the underlying disk space problem throughout the day on September 14 and through the early hours of September 15. However, the increase in traffic associated with US business mornings happened to coincide with many application servers crossing the 90% disk space threshold simultaneously. When that happened, our health check system diligently removed them from the traffic-serving rotation and began the process of spinning up new servers to replace them.
As a result, right as morning traffic was ramping up, with more than 15 course launches per second, we lost a substantial portion of our request processing capacity. During this time, the remaining servers continued to process requests but at a degraded rate: high CPU usage caused request latency to skyrocket.
The increase in audit log usage appears to be expected. We will continue to capture the audit log for purposes of intrusion detection.
Our disk space health check has served us well in the past, and we are currently planning to continue its use. We may investigate a way to replace servers that fail this specific component of the health check more gracefully.
The disk space issue made it through our usual quality assurance tests because it was a slow enough leak that it required many hours of production-level traffic to trigger any failures.
Our application is designed to survive the loss of a full data center — that is, half its servers — and still continue to function. In this case, the surge of morning traffic meant that more than half the running application servers were required to handle the volume of incoming requests. We will investigate raising the maximum number of application servers allowed in our auto-scaling configuration so that losing half of them is still viable even in the morning rush.
We have since deployed an update that both increases the base disk size for application servers and configures more aggressive log rotation (so that the audit log takes up less disk space overall). Additionally, we’ve configured additional alarms to alert on this situation.
We plan on implementing the following improvements in the future to mitigate/prevent issues of this kind from reoccurring:
All times are in CDT.