Partial service disruption in our Admin Portal UI and 3scale APIs (Account Management API, Analytics API, Billing API)
Incident Report for Red Hat 3scale
Postmortem

Summary:

On March 10th of 2020 a new deployment of the user interface application combined with high memory consumption in some frontend servers, caused an exhaustion of memory. During that timeframe, an increase of 502 errors was detected in our monitoring system and the engineering team took action.

Timeline:

  • Mar 10, 11:40 CET
    An increase of 502 errors and memory consumption in some frontend servers raise alerts in our monitoring system.
  • Mar 10, 11:42 CET
    The issue is detected during the deployment of a new release and the process is stopped. The affected servers are restored and the investigation starts.
  • Mar 10, 12:14 CET
    A new deployment is executed in a reduced number of servers.
  • Mar 10, 12:29 CET
    A new deployment is executed in the rests of the servers.
  • Mar 10, 12:30 CET
    The service is stable and no more 502 response codes are detected.

Root cause:

Abnormal memory consumption affecting some frontend servers during a deployment.

Preventive Actions:

  • Increased the resilience of the frontend layer
  • Improved the deployment process to avoid service affectations during new releases
Posted Mar 13, 2020 - 13:49 CET

Resolved
This incident has been resolved.
Posted Mar 10, 2020 - 12:30 CET
Investigating
Our operations team is working to identify the root cause and implement a solution.
Posted Mar 10, 2020 - 11:40 CET
This incident affected: Admin Portal UI.