On December 22th of 2021, a global outage at AWS EC2 service and network connectivity in one Availability Zone of the AWS North Virginia Region caused a 1 hour and 26 minutes downtime of Service Management API.
The AWS EC2 unavailability affected a high percentage of our backend storage AWS EC2 instances (both proxy and storage instances). Our operations team tried to mitigate the issue by adding new AWS EC2 instances to the healthy Availability Zones, but the new storage proxy AWS EC2 instances were unable to pull the application container images due to another outage at quay.io container registry (caused by the same AWS incident). Remediation actions were undertaken in this proxy layer to restore the service.
Timeline:
Dec 21, 2021 12:12 UTC - high rate of HTTP 5XX errors starts (100% of traffic) - Caused by the AWS global outage
Dec 21, 2021 13:14 UTC - back to 100% traffic OK.
Dec 21, 2021 13:19 UTC - high rate of HTTP 5XX errors starts (100% of traffic) - Caused by quay.io container registry global outage