On Apr. 16th of 2019, after a change on our networking, some services associated to the Admin Portal UI failed. We received an alert just after this situation, and we proceed to review and detach this failing instances from the load balancer. Due to that, only half of the servers were affected, we had 50% of traffic responding with 5xx codes. After some time, another server failed, and we did the same procedure.
Apr. 16, 2019 18:19 UTC - Change of networking and some servers failing
Apr. 16, 2019 18:20 UTC - Ops team receive an alert and start to investigate
Apr. 16, 2019 18:21 UTC - Detection of the error and the associated servers
Apr. 16, 2019 18:23 UTC - Detach of failing servers
Apr. 16, 2019 18:23 UTC - Traffic back to normal
Apr. 16, 2019 19:01 UTC - One more server affected
Apr. 16, 2019 19:02 UTC - Traffic back to normal
Apr. 16, 2019 18:19 UTC - Apr. 16, 2019 18:23 UTC - 50% of traffic responding with 5xx HTTP codes
Apr. 16, 2019 18:23 UTC - back to 100% traffic
Apr. 16, 2019 19:01 UTC - Apr. 16, 2019 19:02 UTC - 25% of traffic responding with 5xx HTTP codes
Apr. 16, 2019 19:02 UTC - back to 100% traffic
The cause of the incident was caused by a change on the network, after this change some servers with the change started to fail.
Review of code to make our platform more resilient to network failures and change