Operations | Monitoring | ITSM | DevOps | Cloud

The balancing act of reliability and availability

As consumers, we expect the products and software we buy to work 100% of the time. Unfortunately, that’s impossible. Even the most reliable products and services experience some disruption in service. Crashes, bugs, timeouts. There are a ton of contributing factors, so it's impossible to distill disruptions down to a single cause. That said, technology is becoming more and more sophisticated, and so is the infrastructure that supports it.

The connection between incident management and problem management

Sometimes, two concepts overlap so much that it’s hard to view them in isolation. Today, incident management and problem management fit this description to a tee. This wasn’t always the case. For a long time, these two ITIL concepts were seen as distinct—with specialized roles overseeing each. Incident management existed in one corner and problem management in the other. Then came the DevOps movement and the lines suddenly became blurred. So where do they stand today?

Practical guidance for getting started as a site reliability engineer

At the beginning of May, I joined incident.io as the first site reliability engineer (SRE), a very exciting but slightly daunting move. With only some high-level knowledge of what the company and its systems looked like prior to this point, it’s fair to say that I didn’t have much certainty in what exactly I’d be working on or how I’d deliver it.

July 2023 newsletter: Changelog-The Deluxe Edition

🎵 Gotta give the people, give the people what they want! 🎵 You've been asking. And we've been listening. Over the past few weeks, we've been shipping frequently requested features to help you bring your incident management to the next level. It may be the dog days of summer, but let's ignore that, yeah? Just take a look at this recent changelog. Note that this is the biggest one we've ever published.

incident.io: A scalable incident management solution built for enterprises

For enterprise businesses, a lot is riding on the efficiency of their incident response. These organizations have large customer bases, complex products, and many incidents. They also have loads of incident responders across various roles, making it difficult to coordinate internally.

Why you need an internal status page

When we launched incident.io Status Pages a few months ago, we stressed the importance of communicating clearly with your customers about ongoing issues. To help with this, we spent a lot of time carefully designing a status page that’s easy to understand for everyone - whether they come from a technical background, work in a different area, or just want to get on with their day.

We used GPT-4 during a hackathon-here's what we learned

We recently ran our first hackathon in quite some time. Over two days, our team collaborated in groups on various topics. By the end of it, we had 12 demos to share with the rest of the team. These ranged from improvements in debugging HTTP request responses to the delightful “automatic swag sharer.” Within our groups, a number of us tried integrating with OpenAI’s GPT to see what smarts we could bring to our product.