February 28 Outage Postmortem

Root Cause

On February 28, 2017, SCORM Cloud experienced a prolonged complete outage (from 11:41 CST until 16:59 CST). The root cause was a total outage in the S3 storage service operated by our hosting provider, Amazon Web Services (AWS). The S3 outage lasted from approximately 11:40 CST until 15:54 CST. The outage only affected the N. Virginia AWS region (US-EAST-1), which is where SCORM Cloud is located.

SCORM Cloud uses Amazon S3 for course content storage as well as a general-purpose key-value store (for various website functions). When S3 experiences an outage, SCORM Cloud is effectively inoperable. While we have plans to mitigate the effect of S3 outages on the service -- see below -- key functionality (like launching courses) would still be affected in this case.

Cascading Failures

SCORM Cloud uses the Amazon Java SDK for communication with S3. The timeout values were unconfigured in the S3 client, which means the default timeouts provided by Amazon were used: 50 seconds to open a connection and 50 seconds to read from a socket. When S3 began experiencing capacity problems, the default timeouts caused average request processing time on SCORM Cloud to skyrocket. The backlog of requests caused web requests to hold Apache web server worker threads open to wait for the request to finish, which quickly exhausted the Apache worker thread pools on our web servers.

SCORM Cloud uses AWS's Elastic Load Balancer (ELB) with Autoscaling to manage multiple web server instances. The load balancer is configured to periodically check the health of web servers, and if the web servers are unhealthy for any reason, eventually terminate those web servers. Usually, for transient issues, this architecture permits "self-healing" if particular instances are affected, and it prevents unhealthy servers from ever serving web requests.

In this case, since Apache was load-shedding by returning 503 Service Unavailable to these health check requests, the load balancer began marking the SCORM Cloud web servers as unhealthy. After the first 30 minutes of the outage, all of the web servers had been terminated because they were unhealthy. At this point, the load balancer had no available instances to process requests, so it began returning a 503 Service Unavailable immediately.

During this time, Amazon's Autoscaling service continually tried to launch more instances to replace the unhealthy instances. However, this process relies on S3 being available, so all the attempted launches failed.

Attempted Resolution

Around 13:00 CST, an engineer began making the preparations to stand up SCORM Cloud in the Oregon AWS region (US-WEST-2). However, standing up SCORM Cloud in a separate region is currently a complex and manual process, and several issues were encountered along the way that prevented the failover to US-WEST-2 before AWS resolved the underlying issues.

The failover to US-WEST-2 is currently intended to be a catastrophic recovery option -- to be used in a case where Amazon does not have an estimated resolution time for US-EASt-1 (possibly because of a disaster in that region).

Resolution Timeline

At 15:54 CST, Amazon reported that S3 had fully recovered. However, as mentioned in their postmortem above, several systems dependent on S3 had a backlog of work to process. In this case, SCORM Cloud did not have any web servers remaining to serve requests, so we were forced to wait for the Autoscaling service to recover.

At 16:42 CST, Amazon reported that the root cause of the Autoscaling service's interruption was being remediated. At 16:51 CST and 16:53 CST, we had two successful web server launches occur. These servers came online and began serving web requests correctly at 16:59 CST. Shortly thereafter, we had several more web instances launch successfully, so we cautiously updated our status page to reflect this.

Mitigation Going Forward

There are several improvements we plan to implement:

configure much lower timeout values for S3, so that S3 failures don't cause service overload;
improve our health check system so that service overload does not cause web servers to be marked unhealthy and terminated;
offload S3 reverse proxying (for course content) to a separate service so that the Cloud website itself is not affected during S3 downtime;
provide a better 503 Service Unavailable error page that points our customers to our status page; and
greatly shorten the time it takes to fail over to a separate AWS region, so as to mitigate region-wide AWS problems.

Incident Timeline

Below is a timeline of all events in this incident. All times are in CST.

11:40 - Initial warnings were sent to our operations team
11:41 - Major monitoring alarms trigger, engineers are paged
11:46 - Autoscaling service tries to add more web servers to replace "unhealthy" servers, but S3 outage causes launches to fail
11:48 - Engineers respond, update status page
11:55 - Engineers determine AWS to be cause of outage, wait for confirmation
12:06 - AWS confirmation received, update status page
13:00 - Engineer begins work on failing over to US-WEST-2
15:13 - AWS reports S3 GET operations are succeeding
15:54 - AWS reports S3 PUT operations are succeeding
16:42 - AWS reports Autoscaling is beginning recovery
16:51 - First successful web server launch occurs
16:53 - Second successful web server launch occurs
16:54 - Four more web servers begin launch
16:59 - Web servers begin to serve traffic
17:02 - Num. of web servers passes minimum safe threshold, status page cautiously updated
17:02 through 18:23 - Engineers wait for any reoccurring issues, then declare all-clear

Posted Mar 02, 2017 - 20:09 UTC

Resolved

All issues affecting SCORM Cloud appear to have been resolved, and the site is fully operational again. If you experience any issues or have any questions, don't hesitate to contact us at support@scorm.com. We apologize for the inconvenience.

Posted Mar 01, 2017 - 00:23 UTC

Update

SCORM Cloud is recovering from the hosting provider issues. We're still monitoring the situation to ensure nothing goes wrong, but we are cautiously optimistic.

Posted Feb 28, 2017 - 23:02 UTC

Update

Our hosting provider has resolved one of the core issues affecting SCORM Cloud. We continue to monitor their status page for updates on the remaining dependent service.

Posted Feb 28, 2017 - 22:38 UTC

Update

We're still waiting for more information from our hosting provider.

Posted Feb 28, 2017 - 20:54 UTC

Monitoring

We're still monitoring the underlying hosting provider issues causing this outage.

Posted Feb 28, 2017 - 19:27 UTC

Identified

Downtime appears to be caused by an issue with our hosting provider. We're monitoring the situation and will keep you updated.

Posted Feb 28, 2017 - 18:06 UTC

Investigating

Engineers are investigating an unexpected service interruption for SCORM Cloud

Posted Feb 28, 2017 - 17:50 UTC