We are currently experiencing a service disruption in our Service Management API

Incident Report for Red Hat 3scale

Postmortem

On November 7th 2023 the Service Management API suffered a partial outage for 35 minutes affecting about the 25% of the API requests.

Timeline:

November 7, 2023 11:04 UTC: Rollout of a new version of the Service Management API application.
November 7, 2023 11:06 UTC: Alerts start firing for the internal API. Statistics show a rapid growth of the events queue due to a fix in that functionality that eventually overloaded the internal API, increasing the latency of the requests and causing some timeouts.
November 7, 2023 11:33 UTC - The rollback procedure is executed to stop queuing new events.
November 7, 2023 11:41 UTC - A cleanup of the events causing the issue is executed and the service is restored.

Root Cause:

A fix for the internal events fetching mechanism that pulls data using an internal API on the component serving Service Management API, overloaded the component due to accumulated data in the events queue. This caused increased latency and partial outage of the Service Management API.

Preventative Actions:

Mitigations include developing a different mechanism that does not affect the performance of the API and improve testing to cover this scenario on the staging and development environments.

Posted Nov 07, 2023 - 17:43 CET

Resolved

This incident has been resolved, a postmortem will be available shortly.

Posted Nov 07, 2023 - 12:41 CET

Investigating

Our operations team is working to identify the root cause and implement a solution.

Posted Nov 07, 2023 - 12:06 CET

This incident affected: Service Management API.