Outage on Service Management API

Incident Report for Red Hat 3scale

Postmortem

On January 2nd of 2018 AWS forced numerous servers to be rebooted due to Kernel Side-channel Attacks bug (CVE-2017-5753, CVE-2017-5715, CVE-2017-5754). This triggered a failover of our data layer infrastructure on backend. Normally the failover occurs with zero downtime with one of two slaves being promoted to master. In this case though the second slave was also lost, which meant that both slaves had to re-sync data at the same time. Our data layer cluster only allows writing on the master node when it has more than one slave, therefore our code layer couldn’t write to the data layer, and caused the outage. We set a manual exception to allow writing to master in this case in order to restore systems as soon as possible. We will review options to configure this exception automatically for a future scenario like this.

Timeline:

Jan 2, 2017 16:52 UTC - failover start

Jan 2, 2017 17:01 UTC - we set the min-slaves-to-write to 0

Jan 2, 2017 17:06 UTC - first slave synchronized

Jan 2, 2017 17:09 UTC - second slave synchronized

SLA impact:

Jan 2, 2017 16:52 UTC - Jan 2, 2017 17:02 UTC - backend not accepting requests - 5XX response codes.

Jan 2, 2017 17:02 UTC - Jan 2, 2017 17:05 UTC - traffic at 75%

Jan 2, 2017 17:05 UTC - back to 100% traffic.

Posted Jan 05, 2018 - 17:00 CET

Resolved

Solved issue on our Service Management API from 16:51UTC to 17:03UTC.

Posted Jan 02, 2018 - 18:15 CET

Identified

Detected elevated number of errors in our Service Management API from 16:51UTC to 17:03UTC.

Posted Jan 02, 2018 - 18:15 CET