Operations | Monitoring | ITSM | DevOps | Cloud

December 2021

What we learned from AWS's us-east-1 outage

In case you missed it, for several hours on December 7, 2021, AWS's us-east-1 region had an outage impacting multiple AWS APIs, taking out various websites across the internet. According to our own monitoring at OnlineOrNot, the outage started at 2021-12-07 15:32 UTC and began to recover well at 2021-12-07 22:48 UTC (with minor signs of life for a few minutes around 2021-12-07 20:08 UTC). Had we relied solely on AWS to update their status page before reacting, we would have been waiting a while.

Dealing with Noisy Error Monitoring

Say you've been tasked with monitoring an application, so you set up some alerts to let you know when errors are coming in. The minutes roll by, the errors start coming... ...and they don't stop coming... Oh my, there seems to be quite a few errors coming through. Alerting on each error isn't going to help, better report on changes in the error rate instead right? Not quite.