We are currently experiencing a service disruption in our Admin Portal UI

Incident Report for Red Hat 3scale

Postmortem

On Apr. 16th of 2019, after a change on our networking, some services associated to the Admin Portal UI failed. We received an alert just after this situation, and we proceed to review and detach this failing instances from the load balancer. Due to that, only half of the servers were affected, we had 50% of traffic responding with 5xx codes. After some time, another server failed, and we did the same procedure.

Timeline:

Apr. 16, 2019 18:19 UTC - Change of networking and some servers failing

Apr. 16, 2019 18:20 UTC - Ops team receive an alert and start to investigate

Apr. 16, 2019 18:21 UTC - Detection of the error and the associated servers

Apr. 16, 2019 18:23 UTC - Detach of failing servers

Apr. 16, 2019 18:23 UTC - Traffic back to normal

Apr. 16, 2019 19:01 UTC - One more server affected

Apr. 16, 2019 19:02 UTC - Traffic back to normal

SLA impact:

Apr. 16, 2019 18:19 UTC - Apr. 16, 2019 18:23 UTC - 50% of traffic responding with 5xx HTTP codes

Apr. 16, 2019 18:23 UTC - back to 100% traffic

Apr. 16, 2019 19:01 UTC - Apr. 16, 2019 19:02 UTC - 25% of traffic responding with 5xx HTTP codes

Apr. 16, 2019 19:02 UTC - back to 100% traffic

Root Cause

The cause of the incident was caused by a change on the network, after this change some servers with the change started to fail.

Preventative Actions

Review of code to make our platform more resilient to network failures and change

Posted Apr 17, 2019 - 18:13 CEST

Resolved

This incident has been resolved.

Posted Apr 16, 2019 - 19:01 CEST

Investigating

Our operations team is working to identify the root cause and implement a solution.

Posted Apr 16, 2019 - 19:01 CEST

This incident affected: Admin Portal UI.