Red Hat 3scale Status - Service Management API

Service Management API - Service disruption

Incident Report for Red Hat 3scale

Postmortem

On December 22th of 2021, a global outage at AWS EC2 service and network connectivity in one Availability Zone of the AWS North Virginia Region caused a 1 hour and 26 minutes downtime of Service Management API.

The AWS EC2 unavailability affected a high percentage of our backend storage AWS EC2 instances (both proxy and storage instances). Our operations team tried to mitigate the issue by adding new AWS EC2 instances to the healthy Availability Zones, but the new storage proxy AWS EC2 instances were unable to pull the application container images due to another outage at quay.io container registry (caused by the same AWS incident). Remediation actions were undertaken in this proxy layer to restore the service.

Timeline:

Dec 21, 2021 12:12 UTC - high rate of HTTP 5XX errors starts (100% of traffic) - Caused by the AWS global outage
Dec 21, 2021 13:14 UTC - back to 100% traffic OK.
Dec 21, 2021 13:19 UTC - high rate of HTTP 5XX errors starts (100% of traffic) - Caused by quay.io container registry global outage
Dec 21, 2021 13:49 UTC - back to 100% traffic OK.

Root cause:

Global outages on both AWS EC2 (https://status.aws.amazon.com) and quay.io container registry (https://status.quay.io/incidents/rm3k6b5nby8m).

Posted Dec 22, 2021 - 18:49 CET

Resolved

This incident has been resolved.

A postmortem will be published soon.

Posted Dec 22, 2021 - 14:49 CET

Update

The service affectation is related to an ongoing AWS outage. We are working on a mitigation.

Posted Dec 22, 2021 - 14:06 CET

Investigating

Our operations team is working to identify the root cause and implement a solution.

Posted Dec 22, 2021 - 13:12 CET

This incident affected: Service Management API.