First, we deeply apologize for the unexpected service interruption on May 29, 2020. We know that SCORM Cloud is a foundational pillar of many customer applications, and we understand the importance of its reliability.
The outage began at 04:21 PM UTC and lasted until 05:01 PM UTC. We have detailed the root cause, corrective actions taken and preventive measures we will put in place in response to this outage.
SCORM Cloud runs using a handful of internal services that support the main web application. One of these internal services manages runtime configuration and secrets provisioning. For example, this service is responsible for the automatic daily rotation of backend database credentials, providing credentials for third-party services (like our credit card processor), and so forth.
Because of the sensitivity of this service, all communications between the primary application servers and the security service are secured with bidirectional TLS certificates, one on the client and one on the server. The root cause of the outage was an expired client certificate. Once expired, our security service no longer recognized the validity of the application servers' clients. The application servers were no longer able to acquire the secrets necessary to run the web application, resulting in a total outage.
Below is a timeline of incident events. All times are in Coordinated Universal Time (UTC).
Client certificates for the security service are currently generated by hand, stored in our security service, and are installed automatically onto application server machine images during our automated build process. We will change this process so that client certificates are provisioned dynamically at build time. This change should completely resolve this class of failure going forward, since we deploy changes to SCORM Cloud frequently.
During the outage, we also ran into some confusing error messages that slowed our debugging efforts. As a result we will clarify and alert on the appropriate error messages to make this sort of failure with the security service easier to identify in the future.