We are currently experiencing a Small service disruption in our Service Management API

Incident Report for Red Hat 3scale

Postmortem

On April 26th of 2018 we had a network partition on our AWS infrastructure. This triggered a failover of our data layer infrastructure. This layer is composed of different shards and normally the failover occurs with less downtime. But in this case, during the next 9 minutes we had more network partitions, that caused all the slaves in one shard to fail and required synchronizing data from zero, during the synchronization task some incoming requests to our infrastructure were unable to be processed.

Timeline:

Apr 26, 2018 07:24 UTC - network partition on all our shards, masters failover start.
Apr 26, 2018 07:33 UTC - network partition again, slaves desynchronized and resynchronizing again.
Apr 26, 2018 07:40 UTC - slaves synchronized.

SLA impact: * Apr 26, 2018 07:24 UTC - Apr 26, 2018 07:33 UTC- backend not accepting requests - 5XX response codes returned.

Apr 26, 2018 07:33 UTC - Apr 26, 2018 07:41 UTC- traffic at 85%
Apr 26, 2018 07:41 UTC - back to 100% traffic.

Posted Apr 26, 2018 - 14:57 CEST

Resolved

The issue has been resolved and fixed

Posted Apr 26, 2018 - 10:08 CEST

Identified

The issue has been identified and a fix is being implemented.

Posted Apr 26, 2018 - 09:52 CEST

Investigating

Our operations team is working to identify the root cause and implement a solution.

Posted Apr 26, 2018 - 09:49 CEST