On Oct 11th, 2019 at 13:35 UTC our internal monitoring system alerted us of latency in SCORM Cloud API request processing. The incoming alerts quickly cleared and then resurfaced several times for a duration of 20 minutes. At 13:55 UTC it became evident that normal scaling procedures were not adequate to address the problem, and at 14:09 UTC we posted notification of the latency on our status page. Investigation of the latency revealed that the deletion of expired authorization tokens that take place when launching an xAPI course was causing our database connection pools to become saturated with long running operations. Indexes for the corresponding tables were identified and applied to remedy the problem at 15:00 UTC. Notification of the latency being resolved was sent out at 15:10 UTC. The resolution however proved to be premature, when all of the outstanding xAPI launches that were previously blocked began to successfully launch.
At 15:49 UTC, a second set of alerts from our internal monitoring system indicated increased latency and systemic application failure. After initial investigation the previous incident was reopened, and notification was posted on our status page at 16:18 UTC. We identified a single customer’s traffic as the source of the problem. After notifying the customer, we took steps to rate limit their xAPI requests in order to stabilize the application servers at 16:49 UTC. Additional actions were required to remove application servers that had become unstable due to saturated database connection pools, and the service returned to normal at 17:08 UTC.
After stabilizing SCORM Cloud’s service, we were able to determine that writes to a specific table used as part of xAPI storage were the bottleneck causing database connection pool saturation. Several changes have been tested to improve throughput for the table in question, and we have settled on one change that improves overall throughput for the database by fifty percent. This will allow us to support expected xAPI growth, and we will continue to implement additional improvements.