Operations | Monitoring | ITSM | DevOps | Cloud

OnlineOrNot

On moving over a million uptime checks per week onto fly.io

The other day, a friend told me about fly.io's nice developer experience (DX). For my day job, I work on improving wrangler2's DX, so naturally it had me curious. I went from "I'll just play around with it, maybe give it a toy workload" to "holy shit, what if I quickly rewrite my business's AWS Lambda + SQS stack to fit entirely within their free tier" in about 90 minutes. It wasn't that simple in the end, but I did manage to migrate most of my active workload from AWS Lambda to fly.io.

How to monitor your uptime with OnlineOrNot

Jumping into monitoring software for the first time can be pretty overwhelming. If you're not in an exploring mood, it can be easy to get lost, and you're not entirely sure what all these knobs and buttons do. To help lighten this feeling for OnlineOrNot, I thought it might be useful to let folks know how I use OnlineOrNot, to monitor OnlineOrNot (as part of running OnlineOrNot day to day).

Communicating to Users During Incidents

Imagine you're having a regular day at work, opening up your browser, double checking something for a client in that web app your team built for them, when suddenly, you see this screen: You hit refresh a few times, just to be sure. Nope. Still down. What happens next depends on how well your team has planned for incidents like this (some folks call it unplanned downtime).

Improving your team's on-call experience

Your engineers probably dislike going on-call for your services. Some might even dread it. It doesn't have to be this way. With a few changes to how your team runs on-call, and deals with recurring alerts, you might find your team starting to enjoy it (as unimaginable as that sounds). I wrote this article as a follow-up to Getting over on-call anxiety.

Getting over on-call anxiety

You've joined a company, or worked there a little while, and you've just now realised that you'll have to do on-call. You feel like you don't know much about how everything fits together, how are you supposed to fix it at 2am when you get paged? So you're a little nervous. Understandable. Here are a few tips to help you become less nervous.

Communicating to Users During Incidents

Imagine you're having a regular day at work, opening up your browser, double checking something for a client in that web app your team built for them, when suddenly, you see this screen: You hit refresh a few times, just to be sure. Nope. Still down. What happens next depends on how well your team has planned for incidents like this (some folks call it unplanned downtime).

What we learned from AWS's us-east-1 outage

In case you missed it, for several hours on December 7, 2021, AWS's us-east-1 region had an outage impacting multiple AWS APIs, taking out various websites across the internet. According to our own monitoring at OnlineOrNot, the outage started at 2021-12-07 15:32 UTC and began to recover well at 2021-12-07 22:48 UTC (with minor signs of life for a few minutes around 2021-12-07 20:08 UTC). Had we relied solely on AWS to update their status page before reacting, we would have been waiting a while.

Dealing with Noisy Error Monitoring

Say you've been tasked with monitoring an application, so you set up some alerts to let you know when errors are coming in. The minutes roll by, the errors start coming... ...and they don't stop coming... Oh my, there seems to be quite a few errors coming through. Alerting on each error isn't going to help, better report on changes in the error rate instead right? Not quite.

Scaling AWS Lambda and Postgres to thousands of uptime checks

When you're building a serverless web app, it can be pretty easy to forget about the database. You build a backend, send some data to a frontend, write some tests, and it'll scale to infinity with no effort, right? Not quite. Especially not with Postgres. As the number of users of your frontend increases, your app will open more and more database connections until the database is unable to accept any more. It gets worse.