On October 9th of 2019 we had a network partition on our AWS infrastructure. This triggered a failover on one shard of our data layer infrastructure at 7:59 UTC without issues. This layer is composed of different shards, and normally the failover occurs without any downtime. However in that case, 15 minutes after the failover (at 8:14 UTC), due to a memory issue, new shard master started to synchronize again from disk to memory, and during the next 7 minutes we had a downtime on Service Mamangement API, backend was not accepting requests, returning 5XX respondes code because of data layer issue.
The initial trigger was a network partition failure in the AWS infrastructure which led to failover of one of the server instances in our database system.
Review failover process to shorten failover time.