We are currently experiencing an elevated level of API errors in our Service Management
Incident Report for Red Hat 3scale
Postmortem

On Jan. 29th of 2019, after a change on our database sharding and balancing layer, due to an increase on the CPU of our instances we did a rollback on the configuration, doing the rollback shouldn’t have downtime but a failure on our procedure provoqued that we had the old instances deleted before than configuration was deployed on the new ones, having errors until the new ones were deployed.

Timeline:

Jan. 29, 2019 09:16 UTC - deleting of old configuration instances.

Jan. 29, 2019 09:17 UTC - All traffic with errors

Jan. 29, 2019 09:22 UTC - deployed configuration into new servers.

Jan. 29, 2019 09:23 UTC - Traffic back to normal.

SLA impact:

Jan. 29, 2019 09:16 UTC - Jan. 29, 2019 09:23 UTC- backend not accepting requests - 5XX response codes returned.

Jan. 29, 2019 09:24 UTC - back to 100% traffic.

Root Cause

The cause of the incident was caused because a change on the configuration, we did a deploy of the new instances and deleted the old ones, before than having all the new instances running correctly.

Preventative Actions

Review instance deploy and automate processes.

Posted 3 months ago. Jan 29, 2019 - 17:54 CET

Resolved
This incident has been resolved.
Posted 3 months ago. Jan 29, 2019 - 12:55 CET
Update
We are continuing to monitor for any further issues.
Posted 3 months ago. Jan 29, 2019 - 12:55 CET
Monitoring
A fix has been implemented and we are monitoring the results.
Posted 3 months ago. Jan 29, 2019 - 10:57 CET
Identified
The issue has been identified and a fix is being implemented.
Posted 3 months ago. Jan 29, 2019 - 10:50 CET
Investigating
Our operations team is working to identify the root cause and implement a solution.
Posted 3 months ago. Jan 29, 2019 - 10:43 CET
This incident affected: Service Management API.