A high volume of slow queries were putting a high load on mysql, resulting in 3scale API and UI requests with slow responses and some requests dropped because the web server did not have enough workers to process that quantity of requests.
In order to mitigate the issue, we delayed processing the background jobs:
Service deletion is delayed until 27/03/2019 13:00 UTC
Synchronization of the applications credentials and gateway configuration was back to normal on 26/03/2019 17:37 UTC
Webhooks were back to normal on 26/03/2019 17:37 UTC
Timeline:
March 26, 2019 12:41:11 UTC - 3scale APIs errors
March 26, 2019 12:42:11 UTC - 3scale APIs back to 100% traffic after 1 minute
March 26, 2019 12:43:11 UTC - 3scale APIs errors
March 26, 2019 13:00:11 UTC - 3scale APIs back to 100% traffic after 17 minutes
March 26, 2019 13:10:11 UTC - 3scale APIs errors
March 26, 2019 13:25:11 UTC - 3scale APIs back to 100% traffic after 15 minutes
March 26, 2019 13:27:11 UTC - 3scale APIs errors
March 26, 2019 13:28:11 UTC - 3scale APIs back to 100% traffic after 1 minute
March 26, 2019 13:30:11 UTC - 3scale APIs errors
March 26, 2019 13:45:11 UTC - 3scale APIs back to 100% traffic after 15 minutes
March 26, 2019 13:49:11 UTC - 3scale APIs errors
March 26, 2019 14:05:11 UTC - 3scale APIs back to 100% traffic after 16 minutes
Root cause:
We enabled a feature some months ago to automatically suspend inactive free tenants, schedule them for deletion, and eventually delete them for real. This happens for tenants that satisfy some conditions, among those, how long they have been in a certain state. Due to this, yesterday was the day that thousands of old inactive tenants were being destroyed, along with all their associations. However, in the attempt to delete them, the database was repeating a query without a proper index, therefore doing many table scans, which the database was not able to handle.
Preventive Actions:
We have temporarily disabled the automated execution of the background job that deletes tenants scheduled for deletion 15 days ago or more. It will be enabled again once the root cause is fixed.
We are optimizing the queries, in order to avoid the problems generated by trying to destroy so many tenants at once.