Affected time range (times are CDT): 2022-07-20 9:40 AM to 2022-07-20 11:19 AM
Description: SCORM Cloud experienced a failure in a regularly scheduled automated rotation of database credentials. This resulted in our application servers losing the ability to communicate with the database, and caused major disruption to the service.
We use Hashicorp’s Consul and Vault to provide our application servers with database credentials, which are then cycled at regular intervals. When new credentials are issued, each of the application servers soft-evicts their database connections and refreshes the credentials. We then wait a bit and invalidate the old credentials.
At around 4:00 PM on July 19th, the leader node of our Consul cluster gave out, meaning that the application servers could not get the updated credentials. Therefore, when the old credentials were invalidated, the boxes were not using the updated ones, and any calls to the database were rejected as unauthorized, causing the outage.
During our investigation to find both the root cause and the solution, we uncovered a few areas where we could improve upon both our Consul cluster and the monitoring surrounding it.
Since we use multiple nodes in our cluster for Consul and Vault, we have the nodes sitting behind a load balancer on AWS, similar to how our application servers are behind a load balancer. However, unlike our application servers, the Consul nodes are not in an autoscaling group, meaning that if one of the nodes gives out, we don’t have a way to automatically detect the outage and replace that node.
Additionally, we noted an increase in some metrics and logs that would have alerted us to the possibility of an outage before it actually happened. First we noticed logs showing failed connections to Consul and Vault that started before the outage. Around the same time, we also noticed a sustained increase in CPU on the Consul cluster boxes. Alerts on either or both of these would have allowed us to see that something was wrong before the database credentials were invalidated and the outage occurred.
The immediate fix for the problem involved simply stopping and then restarting the affected box in our Consul cluster. Having the instance entirely stopped meant that Consul was easily able to elect a new leader and return to normal operations.
We then restarted the affected box and monitored it while it rejoined the cluster. Once it was up and running and successfully rejoined the Consul cluster, we were back up to a normal operating capacity.
We then added an alert for sustained high CPU usage on the Consul cluster boxes. In looking back at what happened, this metric was one of the ones that would have alerted us to the issue before the outage actually occurred. So now we have an alert that should allow us to fix this issue before it affects any customers.
Apart from the increased monitoring and alerts that we’ve already implemented, the next improvement we can make is to add additional health checks to the Consul cluster boxes.
Currently, we only rely on AWS’s health checks for EC2 instances. Usually those are good, however there are instances where the box is marked healthy even though it is unresponsive or otherwise not working. Adding our own health checks will allow us to detect these unresponsive boxes even if the default health checks are returning healthy. That way, we can automate the process of cycling the box instead of relying on a manual process in response to an alert.
All times are in CDT.