On 15th July 2021 the 3scale SRE team performed an update of the 3scale Saas Service Management API ingress layer software to new higher-performance Ingress proxies based on Envoy Proxy.
This change introduced an unintended removal of the HTTP 1.0 protocol, as this protocol was disabled by default in the new ingress software stack, producing 426 errors to our clients.
The timeline of the service affectation for clients using HTTP 1.0 was:
The new stack had been functionality and load tested in our Staging cluster, but with synthetic traffic generated by us and (unaware that the new stack defaulted to HTTP 1.0 OFF) this did not include simulation of traffic from clients still using HTTP 1.0.
Error reporting metrics from the Envoy proxy (and hence our monitoring, dashboarding and alerting) does not break out the 426 errors from 4XX errors overall, not allowing us to track and alert on this type of error that is under our control while avoiding spurious alerts from other 4XX errors that are due to client software.
Actions being taken to avoid a recurrence are:
Improve the detail available about our traffic monitoring, dashboards and alerts by:
Extracting this information from the logs of the ingress layer. This level of detail is not provided by the current available prometheus metrics:
Submitting an issue to the Envoy Proxy project (and possibly a PR implementing it depending on the reception to the issue) suggesting more fine-grained Prometheus metrics allowing us to monitor and possibly alert on some that are more under our control, while avoiding spurious alerts due to other 4XX errors that are caused by client software.