Service disruption affecting all APIs
Incident Report for Red Hat 3scale
Postmortem

On May 24th of 2023 all the 3scale SaaS APIs suffered a major outage for 26 minutes.

Timeline:

  • May 24, 2023 09:03 UTC: rollout of a new version of the control plane that the 3scale SaaS uses to distribute its ingress layer configuration is completed.
  • May 24, 2023 09:04 UTC: alerts start firing for all 3scale SaaS APIs. Statistics show all requests to all APIs failing. Given the widespread outage the rollback procedure is started some minutes later to bring the control plane back to its previous version.
  • May 24, 2023 09:30 UTC - the rollback procedure is completed and service is recovered for all APIs.

Root Cause:

The root cause has been identified as being related to the rollout strategy used to deploy the new versions of the control plane.

Preventative Actions:

Mitigations include modifying this strategy appropriately and adding some control mechanisms to make the process more robust and avoid this from happening in the future.

Posted May 24, 2023 - 16:37 CEST

Resolved
This incident has been resolved.
Posted May 24, 2023 - 11:29 CEST
Investigating
Our operations team is working to identify the root cause and implement a solution.
Posted May 24, 2023 - 11:07 CEST
This incident affected: Service Management API.