Service Management API - Service disruption
Incident Report for Red Hat 3scale
Postmortem

On November 8th of 2021 a very high rate of 500 HTTP responses was detected in the Service Management API frontend affecting 90% of the traffic. The root cause was tracked down to an intermediate proxy layer between the frontend instances and the sharded storage layer. Remediation actions were undertaken in this proxy layer to restore the service.

Timeline:

  • Nov 8, 2021 09:55 UTC - high rate of HTTP 5XX errors starts (90% of traffic).

  • Nov 8, 2021 10:33 UTC - back to 100% traffic OK.

Root cause:

  • Unexpected error on an intermediate proxy layer between Service Management API fronted instances and sharded storage layer.
Posted Nov 08, 2021 - 16:21 CET

Resolved
This incident has been resolved.

Several incidents were opened and closed during the service disruption due to the fact that our status page is updated automatically based on HTTP probe results in order to give users quick feedback on the status of the service. As a 10% of the requests were still being correctly processed, some probes were successful during the incident, causing the status page incidents to be resolved.

A postmortem will be published soon.
Posted Nov 08, 2021 - 11:33 CET
Investigating
Our operations team is working to identify the root cause and implement a solution.
Posted Nov 08, 2021 - 10:55 CET
This incident affected: Service Management API.