Experienced elevated level of API errors in our Service Management API

Incident Report for Red Hat 3scale

Postmortem

On Sept. 26th of 2018 we had a network partition on our AWS infrastructure. This triggered a failover of our data layer infrastructure.

Timeline:

Sept. 26, 2018 05:12 UTC - network partition on one of our shards, failover starts.
Sept. 26, 2018 05:23 UTC - slaves synchronized and traffic back to normal.

SLA impact:

Sept. 26, 2018 05:12 UTC - Sept. 26, 2018 05:23 UTC- backend not accepting requests - 5XX response codes returned.
Sept. 26, 2018 05:24 UTC - back to 100% traffic.

Root Cause

The initial trigger was a network partition failure in the AWS infrastructure which led to failover of one of the server instances in our database system.

Preventative Actions

Review failover process to shorten failover time.

Posted Sep 26, 2018 - 12:58 CEST

Resolved

The incident has been resolved.

Summary:
* Incident started at: 05:12 UTC
* Incident resolved at: 05:23 UTC

Posted Sep 26, 2018 - 07:44 CEST

Monitoring

During the period from 05:12 to 05:23 UTC we have had an elevated number of errors on our Service Management API due to a failover on our platform, the failover has been automaticly done and resolved.

Posted Sep 26, 2018 - 07:40 CEST

This incident affected: Service Management API.