Service Interruption

Incident Report for SCORM Cloud

Postmortem

At 1:15 AM UTC on March 13th, we updated a value in our production Consul server to remove a MIME-type black list entry. This change was expected and approved, but an error occurred in the manual update process. The error did not surface as a problem until our database credentials rotated on their regular schedule. Once our monitoring systems detected the problem, our SREs responded. A timeline of the response is detailed below (all times are in UTC):

10:55 AM, automated systems paged the on-call SRE
10:56 AM, the SRE acknowledged the page
10:58 AM, SCORM Cloud went offline (returning a 404 for all requests)
11:08 AM, after an initial investigation failed, the on-call SRE paged the backup SRE for assistance
11:16 AM, the backup SRE acknowledged the page
11:20 AM, the backup SRE began investigation
11:21 AM, SREs updated the status page
11:23 AM, SREs identified the root cause
11:26 AM, SREs fixed the invalid Consul entry and performed a rolling restart of existing unhealthy instances
11:30 AM, monitoring reported that the service was back online

‌

We have implemented new procedures for all future updates to our Consul server. We have also identified two improvements to our dynamic configuration system. These changes will make our dynamic configuration more resilient to errors and notify us of errors immediately.

Posted Mar 16, 2020 - 21:09 UTC

Resolved

This incident has been resolved. We will write a postmortem with more information soon and post it on our status page.

Posted Mar 13, 2020 - 11:50 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 13, 2020 - 11:34 UTC

Investigating

We are currently investigating an unexpected service interruption.

Posted Mar 13, 2020 - 11:21 UTC

This incident affected: SCORM Cloud Website and SCORM Cloud API.