Delays in our data processing infrastructure.
Incident Report for Red Hat 3scale
Postmortem

Timeline

  • Failover occurred at March 7, 2018 - 04:13 UTC
  • Update of the Status Page done at March 7, 2018 - 04:30 UTC
  • Incident Resolved on March 7, 2018 - 06:28 UTC
  • Changed to monitoring phase in Status Page at March 7, 2018 - 06:35 UTC
  • Incident Resolved on Status Page at March 7, 2018 - 06:58 UTC

Service Impact

  • During the time of the incident (04:13 - 06:28) approximately 70% of authorization&reporting requests failed

Root Cause

  • The initial trigger was a network partition failure in the AWS infrastructure which led to failover of one of the server instances in the job queuing system.
  • However clients of the job queue continued to send requests to machines which could not process the requests, because they could not recognise the new configuration after failover.
  • The root cause of this was a bug in job queueing service. This had been fixed but had not been deployed to all servers.

Preventative Actions

  • Troubleshooting and escalation procedures have been reviewed to reduce the time to resolution
  • Deployment processes have been reviewed to ensure that all servers are using the same latest versions of tested configurations
  • The process to update the status page will be reviewed to improve the timeliness of updates while continuing to avoid false alerts.
Posted Mar 07, 2018 - 08:27 CET

Resolved
The incident has been resolved.

Summary:
* Incident started at: 05:30 CET
* Incident resolved at: 07:35 CET
Posted Mar 07, 2018 - 07:58 CET
Monitoring
A fix has been implemented and our team is monitoring the results.
Posted Mar 07, 2018 - 07:35 CET
Identified
The issue has been identified, and our team is working on a fix.
Posted Mar 07, 2018 - 07:31 CET
Investigating
Our operations team is working to identify the root cause and implement a solution.
Posted Mar 07, 2018 - 05:30 CET