Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Ways to avoid losing your domain

Imagine you're sitting in your office, and you start noticing emails coming in asking if you'd like to buy your domain. "Huh, that's weird, I already own that domain" you think to yourself. A few more emails come in, and they're getting past the spam filter, so you decide to double check your domain manager. Doubt starts creeping into your mind, you start panicking, and you frantically scroll down to where the domain should be, and... It's gone.

Our lessons from the latest AWS us-east-1 outage

In case you missed it, AWS experienced an outage or "elevated error rates" on their AWS Lambda APIs in the us-east-1 region between 18:52 UTC and 20:15 UTC on June 13, 2023. If this sounds familiar, it's because it's almost a replay of what happened on December 7, 2021, although that outage was significantly more severe and took longer to restore.

No, the average cost of downtime is not $5600 per minute

A fairly common claim among website uptime monitoring services is that downtime costs $5600 per minute. Chances are, you'll have one of two reactions to this claim: The reality of what downtime costs your business lies somewhere in between. As a company that runs 3.6 million uptime checks per week, we have a bit of insight into the cost of downtime, so if you're curious - read on.

On writing better error messages

You're browsing your favorite website, clicking around, when suddenly, you're rudely interrupted by a white screen, proclaiming: (I don't mean to pick on Varnish cache here, It's just a screenshot I had handy) As a developer, my eyes scan error messages like these for numbers - in this case, the "503" - indicating that the error isn't my fault, and I can move on with my life.

Monitoring our monitoring

Last Saturday, our API went down. Not even a funny error message or slightly slower responses either, it just completely vanished off the internet for 18 minutes. I'm not normally one to point fingers at my hosting provider when things go wrong (since ultimately, I chose to use them, so it's my problem to fix), but when fly.io publicly posts on their forums about their reliability issues, I may as well link to them.

What I learned running a SaaS for a second year

Two years ago, OnlineOrNot started as a little toy app I built in an afternoon to see what it's like using the Next.js framework, to see if a URL is down from around the world. I gave myself a week to turn that toy into a SaaS people could pay for. It looked like this when it went live: It wasn't ready for real users, but that didn't matter. I had something out there, that people could sign-up for, tell me what they were expecting, and how OnlineOrNot fell short of their expectations.

Saving your team from alert fatigue

It's a story as old as the web itself: someone on your team gets excited to install a new tool. The tool promises to finally give you a clear view into the problems your users have with your product. Your team agrees to give it a go. The errors start coming... ...and they don't stop coming... Soon enough, most of your team has either created an email filter to manage all the alerts, or has unsubscribed themselves entirely. Just like all the other tools. Welcome to alert fatigue.