We are currently experiencing an elevated level of API errors in our Service Management

Incident Report for Red Hat 3scale

Postmortem

On Jan. 29th of 2019, after a change on our database sharding and balancing layer, due to an increase on the CPU of our instances we did a rollback on the configuration, doing the rollback shouldn’t have downtime but a failure on our procedure provoqued that we had the old instances deleted before than configuration was deployed on the new ones, having errors until the new ones were deployed.

Timeline:

Jan. 29, 2019 09:16 UTC - deleting of old configuration instances.

Jan. 29, 2019 09:17 UTC - All traffic with errors

Jan. 29, 2019 09:22 UTC - deployed configuration into new servers.

Jan. 29, 2019 09:23 UTC - Traffic back to normal.

SLA impact:

Jan. 29, 2019 09:16 UTC - Jan. 29, 2019 09:23 UTC- backend not accepting requests - 5XX response codes returned.

Jan. 29, 2019 09:24 UTC - back to 100% traffic.

Root Cause

The cause of the incident was caused because a change on the configuration, we did a deploy of the new instances and deleted the old ones, before than having all the new instances running correctly.

Preventative Actions

Review instance deploy and automate processes.

Posted Jan 29, 2019 - 17:54 CET

Resolved

This incident has been resolved.

Posted Jan 29, 2019 - 12:55 CET

Update

We are continuing to monitor for any further issues.

Posted Jan 29, 2019 - 12:55 CET

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 29, 2019 - 10:57 CET

Identified

The issue has been identified and a fix is being implemented.

Posted Jan 29, 2019 - 10:50 CET

Investigating

Our operations team is working to identify the root cause and implement a solution.

Posted Jan 29, 2019 - 10:43 CET

This incident affected: Service Management API.