On Sept. 26th of 2018 we had a network partition on our AWS infrastructure. This triggered a failover of our data layer infrastructure.
Timeline:
Sept. 26, 2018 05:12 UTC - network partition on one of our shards, failover starts.
Sept. 26, 2018 05:23 UTC - slaves synchronized and traffic back to normal.
SLA impact:
Sept. 26, 2018 05:12 UTC - Sept. 26, 2018 05:23 UTC- backend not accepting requests - 5XX response codes returned.
Sept. 26, 2018 05:24 UTC - back to 100% traffic.
Root Cause
The initial trigger was a network partition failure in the AWS infrastructure which led to failover of one of the server instances in our database system.
Preventative Actions
Review failover process to shorten failover time.
Posted Sep 26, 2018 - 12:58 CEST
Resolved
The incident has been resolved.
Summary: * Incident started at: 05:12 UTC * Incident resolved at: 05:23 UTC
Posted Sep 26, 2018 - 07:44 CEST
Monitoring
During the period from 05:12 to 05:23 UTC we have had an elevated number of errors on our Service Management API due to a failover on our platform, the failover has been automaticly done and resolved.