Elevated Errors

Incident Report for SCORM Cloud

Postmortem

Problem Description & Scope

Affected time range (times are CDT): 2022-07-20 9:40 AM to 2022-07-20 11:19 AM

Description: SCORM Cloud experienced a failure in a regularly scheduled automated rotation of database credentials. This resulted in our application servers losing the ability to communicate with the database, and caused major disruption to the service.

Root Cause

We use Hashicorp’s Consul and Vault to provide our application servers with database credentials, which are then cycled at regular intervals. When new credentials are issued, each of the application servers soft-evicts their database connections and refreshes the credentials. We then wait a bit and invalidate the old credentials.

At around 4:00 PM on July 19th, the leader node of our Consul cluster gave out, meaning that the application servers could not get the updated credentials. Therefore, when the old credentials were invalidated, the boxes were not using the updated ones, and any calls to the database were rejected as unauthorized, causing the outage.

Analysis

During our investigation to find both the root cause and the solution, we uncovered a few areas where we could improve upon both our Consul cluster and the monitoring surrounding it.

Since we use multiple nodes in our cluster for Consul and Vault, we have the nodes sitting behind a load balancer on AWS, similar to how our application servers are behind a load balancer. However, unlike our application servers, the Consul nodes are not in an autoscaling group, meaning that if one of the nodes gives out, we don’t have a way to automatically detect the outage and replace that node.

Additionally, we noted an increase in some metrics and logs that would have alerted us to the possibility of an outage before it actually happened. First we noticed logs showing failed connections to Consul and Vault that started before the outage. Around the same time, we also noticed a sustained increase in CPU on the Consul cluster boxes. Alerts on either or both of these would have allowed us to see that something was wrong before the database credentials were invalidated and the outage occurred.

Fixes Implemented

The immediate fix for the problem involved simply stopping and then restarting the affected box in our Consul cluster. Having the instance entirely stopped meant that Consul was easily able to elect a new leader and return to normal operations.

We then restarted the affected box and monitored it while it rejoined the cluster. Once it was up and running and successfully rejoined the Consul cluster, we were back up to a normal operating capacity.

We then added an alert for sustained high CPU usage on the Consul cluster boxes. In looking back at what happened, this metric was one of the ones that would have alerted us to the issue before the outage actually occurred. So now we have an alert that should allow us to fix this issue before it affects any customers.

Future Improvements

Apart from the increased monitoring and alerts that we’ve already implemented, the next improvement we can make is to add additional health checks to the Consul cluster boxes.

Currently, we only rely on AWS’s health checks for EC2 instances. Usually those are good, however there are instances where the box is marked healthy even though it is unresponsive or otherwise not working. Adding our own health checks will allow us to detect these unresponsive boxes even if the default health checks are returning healthy. That way, we can automate the process of cycling the box instead of relying on a manual process in response to an alert.

Incident Timeline

All times are in CDT.

2022-07-19 5:03 PM - First logs appeared that would have alerted us to the Consul node outage
2022-07-20 9:40 AM - First alerts that SCORM Cloud’s website and API are having issues
2022-07-20 9:45 AM - We notice that vault is down and causing health checks to fail, causing application servers to be spun up and killed rapidly
2022-07-20 9:50 AM - The status page is updated to reflect the outage
2022-07-20 10:06 AM - We identify the Consul cluster node that is out, and develop a plan to get it stood up
2022-07-20 10:27 AM - The affected cluster node is stopped, allowing the other nodes to pick up all of the traffic from the application servers
2022-07-20 10:36 AM - The status page is updated to show the fix is being implemented
2022-07-20 10:45 AM - We determine that the application servers are successfully communicating with the remaining Consul nodes
2022-07-20 10:50 AM - The status page is updated to reflect a return to full operations
2022-07-20 11:08 AM - The originally affected Consul node is started up again and monitored until it returns to the cluster and is communicating with the application servers
2022-07-20 11:19 AM - The issue is marked as resolved

Posted Aug 03, 2022 - 16:00 UTC

Resolved

This incident has been resolved. We will continue to monitor the health of the system.
A post mortem is currently underway, and the results will be attached to this incident once they are complete.

Posted Jul 20, 2022 - 16:19 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jul 20, 2022 - 15:50 UTC

Update

We have identified the source of the outage for SCORM Cloud, and are currently implementing a fix.
Services are being restored. We will continue to monitor the recovery while investigating the root cause.

Posted Jul 20, 2022 - 15:36 UTC

Investigating

We're experiencing an elevated level of errors and are currently looking into the issue.

Posted Jul 20, 2022 - 14:50 UTC

This incident affected: SCORM Cloud Website and SCORM Cloud API.