Service Management API - Degraded service

Incident Report for Red Hat 3scale

Postmortem

On 15th July 2021 the 3scale SRE team performed an update of the 3scale Saas Service Management API ingress layer software to new higher-performance Ingress proxies based on Envoy Proxy.

This change introduced an unintended removal of the HTTP 1.0 protocol, as this protocol was disabled by default in the new ingress software stack, producing 426 errors to our clients.

The timeline of the service affectation for clients using HTTP 1.0 was:

Between 08:19 UTC and 10:38 UTC, incoming Service Management API traffic was being served by both old and new platforms, depending on the DNS resolution of each client. During this time, between 0-4% of the total traffic was affected. The previous ingress stack was still working in parallel, serving HTTP 1.0 requests without any problem for customers still using it (due to DNS caching they were still hitting the IPs of the previous stack). HTTP 1.0 requests reaching the new stack, the clients were facing an “HTTP 426 Upgrade Required” response.
Between 10:38 UTC and 14:43 UTC, all production traffic was being processed by the new platform. During this time, about 4% of the incoming traffic was affected.
At 14:43 UTC, the 3scale SRE Team enabled support for HTTP 1.0 in the new ingress stack and clients using HTTP 1.0 were able to use the Service Management API again.

‌

The new stack had been functionality and load tested in our Staging cluster, but with synthetic traffic generated by us and (unaware that the new stack defaulted to HTTP 1.0 OFF) this did not include simulation of traffic from clients still using HTTP 1.0.

Error reporting metrics from the Envoy proxy (and hence our monitoring, dashboarding and alerting) does not break out the 426 errors from 4XX errors overall, not allowing us to track and alert on this type of error that is under our control while avoiding spurious alerts from other 4XX errors that are due to client software.

Actions being taken to avoid a recurrence are:

Analyze, prior to a migration, traffic characteristics and validate if the new platform can handle all cases. We can do this by analysing Elasticsearch/kibana logs, and prometheus metrics.
We are investigating replaying a % of production traffic against the staging environment to help detect issues earlier.
Improve the detail available about our traffic monitoring, dashboards and alerts by:
- Extracting this information from the logs of the ingress layer. This level of detail is not provided by the current available prometheus metrics:
  - Current backend application prometheus metrics / grafana dashboards, did not report the HTTP 426 errors. Those requests did not reach the backend-listener application as the HTTP1.0 was disabled at the ingress layer.
  - Current envoy cluster prometheus metrics / grafana dashboards, did not report the HTTP 426 errors.
  - With the available envoy listener prometheus metrics, although they were not directly reporting HTTP 426 errors, a slight increase in the rate of 4XX (a metrics that groups any 4xx status code) could be seen.
- Submitting an issue to the Envoy Proxy project (and possibly a PR implementing it depending on the reception to the issue) suggesting more fine-grained Prometheus metrics allowing us to monitor and possibly alert on some that are more under our control, while avoiding spurious alerts due to other 4XX errors that are caused by client software.

Posted Jul 16, 2021 - 17:17 CEST

Resolved

This incident has been resolved, details can be found in the Postmortem.

Posted Jul 15, 2021 - 10:00 CEST