Service Interruption
Incident Report for SCORM Cloud
Postmortem

At 1:15 AM UTC on March 13th, we updated a value in our production Consul server to remove a MIME-type black list entry. This change was expected and approved, but an error occurred in the manual update process. The error did not surface as a problem until our database credentials rotated on their regular schedule. Once our monitoring systems detected the problem, our SREs responded. A timeline of the response is detailed below (all times are in UTC):

  • 10:55 AM, automated systems paged the on-call SRE
  • 10:56 AM, the SRE acknowledged the page
  • 10:58 AM, SCORM Cloud went offline (returning a 404 for all requests)
  • 11:08 AM, after an initial investigation failed, the on-call SRE paged the backup SRE for assistance
  • 11:16 AM, the backup SRE acknowledged the page
  • 11:20 AM, the backup SRE began investigation
  • 11:21 AM, SREs updated the status page
  • 11:23 AM, SREs identified the root cause
  • 11:26 AM, SREs fixed the invalid Consul entry and performed a rolling restart of existing unhealthy instances
  • 11:30 AM, monitoring reported that the service was back online

We have implemented new procedures for all future updates to our Consul server. We have also identified two improvements to our dynamic configuration system. These changes will make our dynamic configuration more resilient to errors and notify us of errors immediately.

Posted Mar 16, 2020 - 21:09 UTC

Resolved
This incident has been resolved. We will write a postmortem with more information soon and post it on our status page.
Posted Mar 13, 2020 - 11:50 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 13, 2020 - 11:34 UTC
Investigating
We are currently investigating an unexpected service interruption.
Posted Mar 13, 2020 - 11:21 UTC
This incident affected: SCORM Cloud Website and SCORM Cloud API.