Elevated Errors
Incident Report for SCORM Cloud
Postmortem

May 29 Outage

First, we deeply apologize for the unexpected service interruption on May 29, 2020. We know that SCORM Cloud is a foundational pillar of many customer applications, and we understand the importance of its reliability.

The outage began at 04:21 PM UTC and lasted until 05:01 PM UTC. We have detailed the root cause, corrective actions taken and preventive measures we will put in place in response to this outage.

Root Cause

SCORM Cloud runs using a handful of internal services that support the main web application. One of these internal services manages runtime configuration and secrets provisioning. For example, this service is responsible for the automatic daily rotation of backend database credentials, providing credentials for third-party services (like our credit card processor), and so forth.

Because of the sensitivity of this service, all communications between the primary application servers and the security service are secured with bidirectional TLS certificates, one on the client and one on the server. The root cause of the outage was an expired client certificate. Once expired, our security service no longer recognized the validity of the application servers' clients. The application servers were no longer able to acquire the secrets necessary to run the web application, resulting in a total outage.

Incident Timeline

Below is a timeline of incident events. All times are in Coordinated Universal Time (UTC).

  • 04:21 PM: monitoring system detects a system outage and pages engineers
  • 04:22 PM: engineers acknowledge page
  • 04:24 PM: virtual meeting room is created to discuss the issue
  • 04:27 PM: status page is updated with incident information
  • 04:30 PM: first proximate cause discovered (application server failing because of security service issue)
  • 04:44 PM: TLS certificate issue discovered
  • 04:49 PM: root cause identified: client TLS certificate expired
  • 04:52 PM: engineers re-issue client certificate using already-documented methodology
  • 04:58 PM: roll out of new certificate to already-running servers begins, and engineers monitor automatic server scaling to catch new servers coming online
  • 05:01 PM: service stability is deemed to have been restored
  • 7:00 PM: a new certificate is baked into baseline application server machine image, resolving the issue

Steps Going Forward

Client certificates for the security service are currently generated by hand, stored in our security service, and are installed automatically onto application server machine images during our automated build process. We will change this process so that client certificates are provisioned dynamically at build time. This change should completely resolve this class of failure going forward, since we deploy changes to SCORM Cloud frequently.

During the outage, we also ran into some confusing error messages that slowed our debugging efforts. As a result we will clarify and alert on the appropriate error messages to make this sort of failure with the security service easier to identify in the future.

Posted May 29, 2020 - 21:02 UTC

Resolved
This incident has been resolved. We will share further details as a postmortem attached to this incident soon.
Posted May 29, 2020 - 19:08 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 29, 2020 - 17:08 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted May 29, 2020 - 17:04 UTC
Investigating
We're experiencing an elevated level of errors and are currently looking into the issue.
Posted May 29, 2020 - 16:29 UTC
This incident affected: SCORM Cloud Website and SCORM Cloud API.