Service Management API - Service disruption
Incident Report for Red Hat 3scale
Postmortem

On December 22th of 2021, a global outage at AWS EC2 service and network connectivity in one Availability Zone of the AWS North Virginia Region caused a 1 hour and 26 minutes downtime of Service Management API.

The AWS EC2 unavailability affected a high percentage of our backend storage AWS EC2 instances (both proxy and storage instances). Our operations team tried to mitigate the issue by adding new AWS EC2 instances to the healthy Availability Zones, but the new storage proxy AWS EC2 instances were unable to pull the application container images due to another outage at quay.io container registry (caused by the same AWS incident). Remediation actions were undertaken in this proxy layer to restore the service.

Timeline:

  • Dec 21, 2021 12:12 UTC - high rate of HTTP 5XX errors starts (100% of traffic) - Caused by the AWS global outage
  • Dec 21, 2021 13:14 UTC - back to 100% traffic OK.
  • Dec 21, 2021 13:19 UTC - high rate of HTTP 5XX errors starts (100% of traffic) - Caused by quay.io container registry global outage
  • Dec 21, 2021 13:49 UTC - back to 100% traffic OK.

Root cause:

Global outages on both AWS EC2 (https://status.aws.amazon.com) and quay.io container registry (https://status.quay.io/incidents/rm3k6b5nby8m).

Posted Dec 22, 2021 - 18:49 CET

Resolved
This incident has been resolved.

A postmortem will be published soon.
Posted Dec 22, 2021 - 14:49 CET
Update
The service affectation is related to an ongoing AWS outage. We are working on a mitigation.
Posted Dec 22, 2021 - 14:06 CET
Investigating
Our operations team is working to identify the root cause and implement a solution.
Posted Dec 22, 2021 - 13:12 CET
This incident affected: Service Management API.