Operations | Monitoring | ITSM | DevOps | Cloud

Introducing: incident.io for Microsoft Teams

There’s a major outage. Support tickets are mounting. Everybody from engineering to legal is scrambling for information. You have more Teams notifications clamouring for attention than you do minutes to address them, and it’s hard to know where to begin. What comes next is a balancing act—mitigating the impact, updating colleagues, managing action items, or updating a status page that will be seen by millions.

Building On-call: Continually testing with smoke tests

With the release of On-call, our system’s reliability had to be solid from the outset. Our customers have high expectations of a paging product—and internally, we would not be comfortable with releasing something that we weren’t sure would perform under pressure. While our earlier product, Response, was the core of a customer’s incident response process after an incident was detected, we’re now the first notification an engineer gets when something’s wrong.

Redefining incident management: the power and pitfalls of AI

Like it or not, AI is having a monumental impact on our lives. Most of the products we engage with today have AI features and functionality, aimed at assisting or completely replacing the actions normally taken by humans. When it comes to incidents, we’re firm believers of accelerating human actions, and believe the risk of over-automation far outweighs the benefits. In this live event we’ll dig a little deeper on why, as we cover the power and pitfalls of AI.

Where does the time go after you resolve an incident?

We were curious: once an incident is over, how long does it take companies to document, review, create learnings, finish clean-up items, and complete any other follow-up action items? We work with a wide variety of companies, from small start-ups to Enterprises with thousands of engineers. But we wanted to know: where is their time spent after they resolve an incident? Here’s what we found!

How our data team handles incidents

Historically, data teams have not been closely involved in the incident management process (at least, not in the traditional “get woken up at 2AM by a SEV0” sense). But with a growing involvement of data (and therefore data teams) in core business processes, decision making, and user-facing products, data-related incidents are increasingly common, and more important than ever.

The Debrief: Debriefing on the Crowdstrike incident

In this episode, Norberto (VP of Engineering) and Lawrence (Product Engineer) delve into the recent CrowdStrike incident that began on July 19th. Rather than focus on technical specifics, they provide a thoughtful exploration of key aspects that matter to us at incident.io, such as effective communication, overall response strategies, and proactive problem-solving during crises.

A tough day for incident responders: lessons from the CrowdStrike update

Today marks a particularly challenging day for incident responders across the globe. As many of you may have noticed, a recent update from CrowdStrike has triggered widespread disruptions, causing chaos in various sectors. The ripple effects have been far-reaching and severe: While the technical specifics of the issue might not be the focus here—and indeed, there are experts better suited to dissect the cause—what's crucial is understanding the impact on those who manage such crises.

Time, timezones, and scheduling

Our On-call product has been in the wild for a few months now, and in this post I want to talk about building a time-sensitive system and what we did to handle some of the challenges. I’ll cover what our scheduler is responsible for, the basics of working with time, and talk a bit about how we tested our system.

The complexity of phone networks

Arguably the most important part of an on-call product is knowing that you will be notified when things break, wherever you are. When it comes to SMS and phone call notifications, we have to leave the familiar realm of the internet and JSON responses, and deal with systems that provide limited observability and insight into what’s gone wrong.

Building a multi-platform on-call mobile app

A significant part of being on-call is the ability to respond to pages and handle escalations on the go. In the early stages of developing incident.io On-call, we considered whether a Minimum Viable Product (MVP) could rely solely on SMS and phone calls. However, we quickly realized that a fully featured mobile app was going to be essential to the on-call experience. This led us to the question: how should we build this mobile app?