On February 28, 2017, SCORM Cloud experienced a prolonged complete outage (from 11:41 CST until 16:59 CST). The root cause was a total outage in the S3 storage service operated by our hosting provider, Amazon Web Services (AWS). The S3 outage lasted from approximately 11:40 CST until 15:54 CST. The outage only affected the N. Virginia AWS region (US-EAST-1), which is where SCORM Cloud is located.
SCORM Cloud uses Amazon S3 for course content storage as well as a general-purpose key-value store (for various website functions). When S3 experiences an outage, SCORM Cloud is effectively inoperable. While we have plans to mitigate the effect of S3 outages on the service -- see below -- key functionality (like launching courses) would still be affected in this case.
SCORM Cloud uses the Amazon Java SDK for communication with S3. The timeout values were unconfigured in the S3 client, which means the default timeouts provided by Amazon were used: 50 seconds to open a connection and 50 seconds to read from a socket. When S3 began experiencing capacity problems, the default timeouts caused average request processing time on SCORM Cloud to skyrocket. The backlog of requests caused web requests to hold Apache web server worker threads open to wait for the request to finish, which quickly exhausted the Apache worker thread pools on our web servers.
SCORM Cloud uses AWS's Elastic Load Balancer (ELB) with Autoscaling to manage multiple web server instances. The load balancer is configured to periodically check the health of web servers, and if the web servers are unhealthy for any reason, eventually terminate those web servers. Usually, for transient issues, this architecture permits "self-healing" if particular instances are affected, and it prevents unhealthy servers from ever serving web requests.
In this case, since Apache was load-shedding by returning 503 Service Unavailable to these health check requests, the load balancer began marking the SCORM Cloud web servers as unhealthy. After the first 30 minutes of the outage, all of the web servers had been terminated because they were unhealthy. At this point, the load balancer had no available instances to process requests, so it began returning a 503 Service Unavailable immediately.
During this time, Amazon's Autoscaling service continually tried to launch more instances to replace the unhealthy instances. However, this process relies on S3 being available, so all the attempted launches failed.
Around 13:00 CST, an engineer began making the preparations to stand up SCORM Cloud in the Oregon AWS region (US-WEST-2). However, standing up SCORM Cloud in a separate region is currently a complex and manual process, and several issues were encountered along the way that prevented the failover to US-WEST-2 before AWS resolved the underlying issues.
The failover to US-WEST-2 is currently intended to be a catastrophic recovery option -- to be used in a case where Amazon does not have an estimated resolution time for US-EASt-1 (possibly because of a disaster in that region).
At 15:54 CST, Amazon reported that S3 had fully recovered. However, as mentioned in their postmortem above, several systems dependent on S3 had a backlog of work to process. In this case, SCORM Cloud did not have any web servers remaining to serve requests, so we were forced to wait for the Autoscaling service to recover.
At 16:42 CST, Amazon reported that the root cause of the Autoscaling service's interruption was being remediated. At 16:51 CST and 16:53 CST, we had two successful web server launches occur. These servers came online and began serving web requests correctly at 16:59 CST. Shortly thereafter, we had several more web instances launch successfully, so we cautiously updated our status page to reflect this.
There are several improvements we plan to implement:
Below is a timeline of all events in this incident. All times are in CST.