Experienced elevated level of API errors in our Service Management API
Incident Report for Red Hat 3scale
Postmortem

On Sept. 26th of 2018 we had a network partition on our AWS infrastructure. This triggered a failover of our data layer infrastructure.

Timeline:

  • Sept. 26, 2018 05:12 UTC - network partition on one of our shards, failover starts.
  • Sept. 26, 2018 05:23 UTC - slaves synchronized and traffic back to normal.

SLA impact:

  • Sept. 26, 2018 05:12 UTC - Sept. 26, 2018 05:23 UTC- backend not accepting requests - 5XX response codes returned.
  • Sept. 26, 2018 05:24 UTC - back to 100% traffic.

Root Cause

  • The initial trigger was a network partition failure in the AWS infrastructure which led to failover of one of the server instances in our database system.

Preventative Actions

  • Review failover process to shorten failover time.
Posted Sep 26, 2018 - 12:58 CEST

Resolved
The incident has been resolved.

Summary:
* Incident started at: 05:12 UTC
* Incident resolved at: 05:23 UTC
Posted Sep 26, 2018 - 07:44 CEST
Monitoring
During the period from 05:12 to 05:23 UTC we have had an elevated number of errors on our Service Management API due to a failover on our platform, the failover has been automaticly done and resolved.
Posted Sep 26, 2018 - 07:40 CEST
This incident affected: Service Management API.