On Jan. 29th of 2019, after a change on our database sharding and balancing layer, due to an increase on the CPU of our instances we did a rollback on the configuration, doing the rollback shouldn’t have downtime but a failure on our procedure provoqued that we had the old instances deleted before than configuration was deployed on the new ones, having errors until the new ones were deployed.
Jan. 29, 2019 09:16 UTC - deleting of old configuration instances.
Jan. 29, 2019 09:17 UTC - All traffic with errors
Jan. 29, 2019 09:22 UTC - deployed configuration into new servers.
Jan. 29, 2019 09:23 UTC - Traffic back to normal.
Jan. 29, 2019 09:16 UTC - Jan. 29, 2019 09:23 UTC- backend not accepting requests - 5XX response codes returned.
Jan. 29, 2019 09:24 UTC - back to 100% traffic.
The cause of the incident was caused because a change on the configuration, we did a deploy of the new instances and deleted the old ones, before than having all the new instances running correctly.
Review instance deploy and automate processes.