On April 26th of 2018 we had a network partition on our AWS infrastructure. This triggered a failover of our data layer infrastructure. This layer is composed of different shards and normally the failover occurs with less downtime. But in this case, during the next 9 minutes we had more network partitions, that caused all the slaves in one shard to fail and required synchronizing data from zero, during the synchronization task some incoming requests to our infrastructure were unable to be processed.
Timeline:
Apr 26, 2018 07:24 UTC - network partition on all our shards, masters failover start.
Apr 26, 2018 07:33 UTC - network partition again, slaves desynchronized and resynchronizing again.
Apr 26, 2018 07:40 UTC - slaves synchronized.
SLA impact: * Apr 26, 2018 07:24 UTC - Apr 26, 2018 07:33 UTC- backend not accepting requests - 5XX response codes returned.
Apr 26, 2018 07:33 UTC - Apr 26, 2018 07:41 UTC- traffic at 85%
Apr 26, 2018 07:41 UTC - back to 100% traffic.