We are currently experiencing a service disruption in our 3scale APIs (Account Management API, Analytics API, Billing API)
Incident Report for Red Hat 3scale
Postmortem

A high volume of slow queries were putting a high load on mysql, resulting in 3scale API and UI requests with slow responses and some requests dropped because the web server did not have enough workers to process that quantity of requests.

In order to mitigate the issue, we delayed processing the background jobs:

Service deletion is delayed until 27/03/2019 13:00 UTC

Synchronization of the applications credentials and gateway configuration was back to normal on 26/03/2019 17:37 UTC

Webhooks were back to normal on 26/03/2019 17:37 UTC

Timeline:

March 26, 2019 12:41:11 UTC - 3scale APIs errors

March 26, 2019 12:42:11 UTC - 3scale APIs back to 100% traffic after 1 minute

March 26, 2019 12:43:11 UTC - 3scale APIs errors

March 26, 2019 13:00:11 UTC - 3scale APIs back to 100% traffic after 17 minutes

March 26, 2019 13:10:11 UTC - 3scale APIs errors

March 26, 2019 13:25:11 UTC - 3scale APIs back to 100% traffic after 15 minutes

March 26, 2019 13:27:11 UTC - 3scale APIs errors

March 26, 2019 13:28:11 UTC - 3scale APIs back to 100% traffic after 1 minute

March 26, 2019 13:30:11 UTC - 3scale APIs errors

March 26, 2019 13:45:11 UTC - 3scale APIs back to 100% traffic after 15 minutes

March 26, 2019 13:49:11 UTC - 3scale APIs errors

March 26, 2019 14:05:11 UTC - 3scale APIs back to 100% traffic after 16 minutes

Root cause:

We enabled a feature some months ago to automatically suspend inactive free tenants, schedule them for deletion, and eventually delete them for real. This happens for tenants that satisfy some conditions, among those, how long they have been in a certain state. Due to this, yesterday was the day that thousands of old inactive tenants were being destroyed, along with all their associations. However, in the attempt to delete them, the database was repeating a query without a proper index, therefore doing many table scans, which the database was not able to handle.

Preventive Actions:

We have temporarily disabled the automated execution of the background job that deletes tenants scheduled for deletion 15 days ago or more. It will be enabled again once the root cause is fixed.

We are optimizing the queries, in order to avoid the problems generated by trying to destroy so many tenants at once.

Posted 6 months ago. Mar 28, 2019 - 09:11 CET

Resolved
This incident has been resolved.
Posted 6 months ago. Mar 26, 2019 - 15:05 CET
Investigating
Our operations team is working to identify the root cause and implement a solution.
Posted 6 months ago. Mar 26, 2019 - 13:42 CET
This incident affected: Account Management API, Analytics API, Billing API.