We are currently experiencing a service disruption in our Service Management API

Incident Report for Red Hat 3scale

Postmortem

On October 9th of 2019 we had a network partition on our AWS infrastructure. This triggered a failover on one shard of our data layer infrastructure at 7:59 UTC without issues. This layer is composed of different shards, and normally the failover occurs without any downtime. However in that case, 15 minutes after the failover (at 8:14 UTC), due to a memory issue, new shard master started to synchronize again from disk to memory, and during the next 7 minutes we had a downtime on Service Mamangement API, backend was not accepting requests, returning 5XX respondes code because of data layer issue.

Timeline:

Oct 9, 2019 07:59 UTC - network partition on one shard, master failover.
Oct 9, 2019 08:14 UTC - 15 minutes after failover (new shard master was working OK), new shard master synchronized again from disk to memory.
Oct 9, 2019 08:21 UTC - new shard master full synchronized again (on memory).

SLA:

Oct 9, 2019 08:14 UTC - Oct 9, 2019 08:21 UTC - backend not accepting requests - 5XX response codes returned because of data layer issue.
Oct 9, 2019 08:21 UTC - back to 100% traffic.

Root cause:

The initial trigger was a network partition failure in the AWS infrastructure which led to failover of one of the server instances in our database system.

Prevention actions:

Review failover process to shorten failover time.

Posted Oct 09, 2019 - 18:39 CEST

Resolved

This incident has been resolved.

Posted Oct 09, 2019 - 10:21 CEST

Investigating

Our operations team is working to identify the root cause and implement a solution.

Posted Oct 09, 2019 - 10:14 CEST

This incident affected: Service Management API.