Elevated Errors

Incident Report for SCORM Cloud

Postmortem

Problem Description & Scope

Affected time range (times are CDT): 2023-04-24 1:29 PM to 2023-04-24 2:14 PM

Description: The SCORM Cloud database experienced a deadlock from an internal tool connecting to the database’s reader instance. This deadlock effectively stopped all database traffic, freezing the application. Stopping the internal tool allowed SCORM Cloud to reestablish its connections with the database.

Root Cause

In order to maximize database availability, SCORM Cloud uses an AWS Aurora cluster with both a reader and a writer instance. That way, we can query the reader instance when extracting large batches of data without interrupting the throughput of other insert, update, or delete queries to the writer instance. Although there are multiple instances that we can connect to, all instances in an Aurora cluster are connected to the same storage volume on the back end, per the AWS docs. This can lead to issues where a long-running query on a reader instance can cause serious performance issues on the writer instance in that cluster.

At around 1:30 PM on April 24th, an internal application that was extracting large volumes of data was connected to the reader instance, and a query it executed caused performance issues across the whole cluster, effectively stopping all traffic between the SCORM Cloud application and the database, which registered almost all of SCORM Cloud’s operations non-responsive.

Analysis

At first glance, we didn’t understand how a query running on the Aurora reader instance could affect performance on the writer. We assumed that the reader instance was a true “read replica” of the writer instance, meaning it was a clone of the database that was being synchronized using some sort of replication, like binlogs.

We first verified that the internal application was connecting to the reader instance by ensuring that the Aurora reader endpoint resolved to the correct instance endpoint. We thought maybe that a DNS caching issue had caused the internal tool to connect to the writer by using an out-of-date record. However, after confirming that the tool was indeed connected to the correct instance, we began to wonder how this query on the reader slowed down our writer.

We then came across this blog post from Plaid where they had run into similar issues using Aurora. They go into a deeper dive on how MySQL handles transactions, persistence, and replication, and how that is implemented in Aurora MySQL.

Basically, even though the reader and writer are using two separate compute nodes for query execution, they still share the persistence architecture that mirrors what happens in plain MySQL (i.e. everything to do with MVCC). Overall, the conclusion they came to, which lines up with what we saw, is this:

because Aurora MySQL primary and replica instances share a storage layer, they share a set of undo logs. This means that, for a REPEATABLE READ isolation level, the storage instance must maintain undo logs at least as far back as could be required to satisfy transactional guarantees for the primary or any read replica instance. Long-running replica transactions can negatively impact writer performance in Aurora MySQL.

This was confirmed by looking at the “RollbackSegmentHistoryListLength” metric for our cluster in CloudWatch. We noticed it started to climb when the internal tool ran its queries, and it then tapered off once we stopped the tool.

Fixes Implemented

The above blog post outlines some ways to mitigate the impact that a query on the reader has on the writer. To start, we followed the advice from the Aurora documentation and made sure that the internal tool was using the transaction isolation level “READ COMMITTED”. The drawbacks mentioned in the documentation are negligible because of the way the internal tool works, so we can always make sure this is the case when it is running.

‌

We also added a CloudWatch alarm to the “RollbackSegmentHistoryListLength” metric to notify us if that level is rising significantly. This should notify us if a query is running too long on a reader instance, giving us enough time to find and stop the process before it has time to impact the performance of the SCORM Cloud application.

Future Improvements

The main improvement we can make in the future is to ensure that any process that runs a large or potentially long-running query against the SCORM Cloud reader instance uses the “READ COMMITTED” isolation level where possible.

If that’s not possible due to data constraints, or if the query is just too big, we can make sure to run the query against a separate database instance, like a binlog replica. If we run the query against a different instance entirely, the performance of that query won’t have any effect on the regular SCORM Cloud Aurora cluster.

Incident Timeline

All times are in CDT.

2023-04-24 1:29 PM - Our monitoring service first noted that application nodes had larger-than-normal numbers of database connection pool waits.
2023-04-24 1:32 PM - We first start to notice spikes in CPU usage across most of the application nodes as they all are stuck waiting to connect to the database.
2023-04-24 1:36 PM - Our monitoring service notifies us that the SCORM Cloud website and API has become non-responsive.
2023-04-24 1:37 PM - The large spikes in CPU usage, combined with errors from the database connections, cause application servers to fail health checks, leading the auto-scaling group to replace them.
2023-04-24 1:39 PM - We update the SCORM Cloud status page to reflect the outage.
2023-04-24 1:44 PM - We identify the query that is causing the issue and correlate it with the internal application that is running it.
2023-04-24 1:48 PM - The internal application is stopped, meaning the database can start to recover.
2023-04-24 1:52 PM - Our monitoring service starts to alert us of issues being resolved.
2023-04-24 2:00 PM - The application is marked as operational and the incident status is moved to “Monitoring.”
2023-04-24 2:14 PM - The issue is marked as “Resolved.”

Posted May 05, 2023 - 20:32 UTC

Resolved

This incident has been resolved.

Posted Apr 24, 2023 - 19:14 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Apr 24, 2023 - 19:00 UTC

Investigating

We're experiencing an elevated level of errors and are currently looking into the issue.

Posted Apr 24, 2023 - 18:39 UTC

This incident affected: SCORM Cloud Website and SCORM Cloud API.