Operations | Monitoring | ITSM | DevOps | Cloud

OnlineOrNot

Our lessons from the latest AWS us-east-1 outage

In case you missed it, AWS experienced an outage or "elevated error rates" on their AWS Lambda APIs in the us-east-1 region between 18:52 UTC and 20:15 UTC on June 13, 2023. If this sounds familiar, it's because it's almost a replay of what happened on December 7, 2021, although that outage was significantly more severe and took longer to restore.

No, the average cost of downtime is not $5600 per minute

A fairly common claim among website uptime monitoring services is that downtime costs $5600 per minute. Chances are, you'll have one of two reactions to this claim: The reality of what downtime costs your business lies somewhere in between. As a company that runs 3.6 million uptime checks per week, we have a bit of insight into the cost of downtime, so if you're curious - read on.

On writing better error messages

You're browsing your favorite website, clicking around, when suddenly, you're rudely interrupted by a white screen, proclaiming: (I don't mean to pick on Varnish cache here, It's just a screenshot I had handy) As a developer, my eyes scan error messages like these for numbers - in this case, the "503" - indicating that the error isn't my fault, and I can move on with my life.

Monitoring our monitoring

Last Saturday, our API went down. Not even a funny error message or slightly slower responses either, it just completely vanished off the internet for 18 minutes. I'm not normally one to point fingers at my hosting provider when things go wrong (since ultimately, I chose to use them, so it's my problem to fix), but when fly.io publicly posts on their forums about their reliability issues, I may as well link to them.

What I learned running a SaaS for a second year

Two years ago, OnlineOrNot started as a little toy app I built in an afternoon to see what it's like using the Next.js framework, to see if a URL is down from around the world. I gave myself a week to turn that toy into a SaaS people could pay for. It looked like this when it went live: It wasn't ready for real users, but that didn't matter. I had something out there, that people could sign-up for, tell me what they were expecting, and how OnlineOrNot fell short of their expectations.

Saving your team from alert fatigue

It's a story as old as the web itself: someone on your team gets excited to install a new tool. The tool promises to finally give you a clear view into the problems your users have with your product. Your team agrees to give it a go. The errors start coming... ...and they don't stop coming... Soon enough, most of your team has either created an email filter to manage all the alerts, or has unsubscribed themselves entirely. Just like all the other tools. Welcome to alert fatigue.

The unreasonable effectiveness of shipping every day

It's fairly common for folks in tech to dream of quitting their day job and working on their side projects. I find when you ask them how their projects are going, they tend to have 2-3 projects running at the same time, none of the projects are actually available for potential users to try out. The question they seem to ask me most is "you seem to complete your projects, how do you stay motivated?" My secret? It's a habit. I ship something every day.