Slow Response Time

Incident Report for SCORM Cloud

Postmortem

On Oct 11th, 2019 at 13:35 UTC our internal monitoring system alerted us of latency in SCORM Cloud API request processing. The incoming alerts quickly cleared and then resurfaced several times for a duration of 20 minutes. At 13:55 UTC it became evident that normal scaling procedures were not adequate to address the problem, and at 14:09 UTC we posted notification of the latency on our status page. Investigation of the latency revealed that the deletion of expired authorization tokens that take place when launching an xAPI course was causing our database connection pools to become saturated with long running operations. Indexes for the corresponding tables were identified and applied to remedy the problem at 15:00 UTC. Notification of the latency being resolved was sent out at 15:10 UTC. The resolution however proved to be premature, when all of the outstanding xAPI launches that were previously blocked began to successfully launch.

At 15:49 UTC, a second set of alerts from our internal monitoring system indicated increased latency and systemic application failure. After initial investigation the previous incident was reopened, and notification was posted on our status page at 16:18 UTC. We identified a single customer’s traffic as the source of the problem. After notifying the customer, we took steps to rate limit their xAPI requests in order to stabilize the application servers at 16:49 UTC. Additional actions were required to remove application servers that had become unstable due to saturated database connection pools, and the service returned to normal at 17:08 UTC.

After stabilizing SCORM Cloud’s service, we were able to determine that writes to a specific table used as part of xAPI storage were the bottleneck causing database connection pool saturation. Several changes have been tested to improve throughput for the table in question, and we have settled on one change that improves overall throughput for the database by fifty percent. This will allow us to support expected xAPI growth, and we will continue to implement additional improvements.

Posted Oct 29, 2019 - 19:51 UTC

Resolved

Performance for SCORM Cloud has stabilized. While we have not yet implemented a long term solution for this problem, safeguards have been put in place to allow us to prevent the same traffic patterns from causing systemic failure. These safeguards will be reenacted if we see the same traffic patterns prior to deploying a long term solution.

Posted Oct 11, 2019 - 20:34 UTC

Monitoring

A temporary fix has been put in place. We will continue to monitor this problem until a permanent solution is identified.

Posted Oct 11, 2019 - 17:20 UTC

Identified

The source of the current performance problems has been identified and we are preparing a temporary solution.

Posted Oct 11, 2019 - 16:49 UTC

Investigating

We're experiencing performance issues that are resulting in slow response times and are actively looking into the issue.

Posted Oct 11, 2019 - 16:18 UTC

This incident affected: SCORM Cloud Website and SCORM Cloud API.