Update of the Status Page done at March 7, 2018 - 04:30 UTC
Incident Resolved on March 7, 2018 - 06:28 UTC
Changed to monitoring phase in Status Page at March 7, 2018 - 06:35 UTC
Incident Resolved on Status Page at March 7, 2018 - 06:58 UTC
Service Impact
During the time of the incident (04:13 - 06:28) approximately 70% of authorization&reporting requests failed
Root Cause
The initial trigger was a network partition failure in the AWS infrastructure which led to failover of one of the server instances in the job queuing system.
However clients of the job queue continued to send requests to machines which could not process the requests, because they could not recognise the new configuration after failover.
The root cause of this was a bug in job queueing service. This had been fixed but had not been deployed to all servers.
Preventative Actions
Troubleshooting and escalation procedures have been reviewed to reduce the time to resolution
Deployment processes have been reviewed to ensure that all servers are using the same latest versions of tested configurations
The process to update the status page will be reviewed to improve the timeliness of updates while continuing to avoid false alerts.
Posted Mar 07, 2018 - 08:27 CET
Resolved
The incident has been resolved.
Summary:
* Incident started at: 05:30 CET
* Incident resolved at: 07:35 CET
Posted Mar 07, 2018 - 07:58 CET
Monitoring
A fix has been implemented and our team is monitoring the results.
Posted Mar 07, 2018 - 07:35 CET
Identified
The issue has been identified, and our team is working on a fix.
Posted Mar 07, 2018 - 07:31 CET
Investigating
Our operations team is working to identify the root cause and implement a solution.