On January 2nd of 2018 AWS forced numerous servers to be rebooted due to Kernel Side-channel Attacks bug (CVE-2017-5753, CVE-2017-5715, CVE-2017-5754). This triggered a failover of our data layer infrastructure on backend. Normally the failover occurs with zero downtime with one of two slaves being promoted to master. In this case though the second slave was also lost, which meant that both slaves had to re-sync data at the same time. Our data layer cluster only allows writing on the master node when it has more than one slave, therefore our code layer couldn’t write to the data layer, and caused the outage. We set a manual exception to allow writing to master in this case in order to restore systems as soon as possible. We will review options to configure this exception automatically for a future scenario like this.
Jan 2, 2017 16:52 UTC - failover start
Jan 2, 2017 17:01 UTC - we set the min-slaves-to-write to 0
Jan 2, 2017 17:06 UTC - first slave synchronized
Jan 2, 2017 17:09 UTC - second slave synchronized
Jan 2, 2017 16:52 UTC - Jan 2, 2017 17:02 UTC - backend not accepting requests - 5XX response codes.
Jan 2, 2017 17:02 UTC - Jan 2, 2017 17:05 UTC - traffic at 75%
Jan 2, 2017 17:05 UTC - back to 100% traffic.