Timeline

Service Impact

During the time of the incident (04:13 - 06:28) approximately 70% of authorization&reporting requests failed

The initial trigger was a network partition failure in the AWS infrastructure which led to failover of one of the server instances in the job queuing system.
However clients of the job queue continued to send requests to machines which could not process the requests, because they could not recognise the new configuration after failover.
The root cause of this was a bug in job queueing service. This had been fixed but had not been deployed to all servers.

Troubleshooting and escalation procedures have been reviewed to reduce the time to resolution
Deployment processes have been reviewed to ensure that all servers are using the same latest versions of tested configurations
The process to update the status page will be reviewed to improve the timeliness of updates while continuing to avoid false alerts.

Posted Mar 07, 2018 - 08:27 CET

Resolved

The incident has been resolved.

Summary:
* Incident started at: 05:30 CET
* Incident resolved at: 07:35 CET

Posted Mar 07, 2018 - 07:58 CET

Monitoring

A fix has been implemented and our team is monitoring the results.

Posted Mar 07, 2018 - 07:35 CET

Identified

The issue has been identified, and our team is working on a fix.

Posted Mar 07, 2018 - 07:31 CET

Investigating

Our operations team is working to identify the root cause and implement a solution.

Posted Mar 07, 2018 - 05:30 CET