%term

Building On-call: Our observability strategy

Aug 22, 2024 By Martha Lambert In Incident.io

At incident.io, we run an on-call product. Our customers need to be sure that when their systems go wrong, we’ll tell them about it—high availability is a core requirement for us. To achieve the level of reliability that’s essential to our customers, excellent observability (o11y) is one of the most important tools in our belt. When done right, observability improves your product experience from two angles.

Read Post

Incident.io

Read more about Building On-call: Our observability strategy

Introducing: incident.io for Microsoft Teams

Aug 13, 2024 By Ed Dean In Incident.io

There’s a major outage. Support tickets are mounting. Everybody from engineering to legal is scrambling for information. You have more Teams notifications clamouring for attention than you do minutes to address them, and it’s hard to know where to begin. What comes next is a balancing act—mitigating the impact, updating colleagues, managing action items, or updating a status page that will be seen by millions.

Read Post

Incident.io

Read more about Introducing: incident.io for Microsoft Teams

Building On-call: Continually testing with smoke tests

Aug 9, 2024 By Rory Malcolm In Incident.io

With the release of On-call, our system’s reliability had to be solid from the outset. Our customers have high expectations of a paging product—and internally, we would not be comfortable with releasing something that we weren’t sure would perform under pressure. While our earlier product, Response, was the core of a customer’s incident response process after an incident was detected, we’re now the first notification an engineer gets when something’s wrong.

Read Post

Incident.io

Read more about Building On-call: Continually testing with smoke tests

Redefining incident management: the power and pitfalls of AI

Jul 31, 2024 By Incident.io In Incident.io

Like it or not, AI is having a monumental impact on our lives. Most of the products we engage with today have AI features and functionality, aimed at assisting or completely replacing the actions normally taken by humans. When it comes to incidents, we’re firm believers of accelerating human actions, and believe the risk of over-automation far outweighs the benefits. In this live event we’ll dig a little deeper on why, as we cover the power and pitfalls of AI.

View Video

Incident.io

Read more about Redefining incident management: the power and pitfalls of AI

Where does the time go after you resolve an incident?

Jul 29, 2024 By Eryn Carman In Incident.io

We were curious: once an incident is over, how long does it take companies to document, review, create learnings, finish clean-up items, and complete any other follow-up action items? We work with a wide variety of companies, from small start-ups to Enterprises with thousands of engineers. But we wanted to know: where is their time spent after they resolve an incident? Here’s what we found!

Read Post

Incident.io

Read more about Where does the time go after you resolve an incident?

How our data team handles incidents

Jul 26, 2024 By Navo Das In Incident.io

Historically, data teams have not been closely involved in the incident management process (at least, not in the traditional “get woken up at 2AM by a SEV0” sense). But with a growing involvement of data (and therefore data teams) in core business processes, decision making, and user-facing products, data-related incidents are increasingly common, and more important than ever.

Read Post

Incident.io

Read more about How our data team handles incidents

The Debrief: Debriefing on the Crowdstrike incident

Jul 24, 2024 By Incident.io In Incident.io

In this episode, Norberto (VP of Engineering) and Lawrence (Product Engineer) delve into the recent CrowdStrike incident that began on July 19th. Rather than focus on technical specifics, they provide a thoughtful exploration of key aspects that matter to us at incident.io, such as effective communication, overall response strategies, and proactive problem-solving during crises.

View Video

Incident.io

Incident Management

Read more about The Debrief: Debriefing on the Crowdstrike incident

A tough day for incident responders: lessons from the CrowdStrike update

Jul 19, 2024 By Stephen Whitworth In Incident.io

Today marks a particularly challenging day for incident responders across the globe. As many of you may have noticed, a recent update from CrowdStrike has triggered widespread disruptions, causing chaos in various sectors. The ripple effects have been far-reaching and severe: While the technical specifics of the issue might not be the focus here—and indeed, there are experts better suited to dissect the cause—what's crucial is understanding the impact on those who manage such crises.

Read Post

Incident.io

Read more about A tough day for incident responders: lessons from the CrowdStrike update

Time, timezones, and scheduling

Jul 18, 2024 By Henry Course In Incident.io

Our On-call product has been in the wild for a few months now, and in this post I want to talk about building a time-sensitive system and what we did to handle some of the challenges. I’ll cover what our scheduler is responsible for, the basics of working with time, and talk a bit about how we tested our system.

Read Post

Incident.io

Read more about Time, timezones, and scheduling

The complexity of phone networks

Jul 16, 2024 By Leo Sjöberg In Incident.io

Arguably the most important part of an on-call product is knowing that you will be notified when things break, wherever you are. When it comes to SMS and phone call notifications, we have to leave the familiar realm of the internet and JSON responses, and deal with systems that provide limited observability and insight into what’s gone wrong.

Read Post

Incident.io

Read more about The complexity of phone networks

Operations | Monitoring | ITSM | DevOps | Cloud

Building On-call: Our observability strategy

Introducing: incident.io for Microsoft Teams

Building On-call: Continually testing with smoke tests

Redefining incident management: the power and pitfalls of AI

Where does the time go after you resolve an incident?

How our data team handles incidents

The Debrief: Debriefing on the Crowdstrike incident

A tough day for incident responders: lessons from the CrowdStrike update

Time, timezones, and scheduling

The complexity of phone networks

Monthly Archive

Follow Us