Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

How Abbott transformed its incident management process with Workflow Automation

Eliminating errors and streamlining the incident management process are top priorities for many ITOps, NOC, SRE, and DevOps teams. With organizations using multiple tools in their IT stack, manually finding the right information at the right time becomes crucial during incident triage. By automating tasks and workflows, businesses can eliminate manual tasks that are time-consuming, repetitive, and prone to mistakes.

Debugging Kubernetes with Automated Runbooks & Ephemeral Containers

In our previous blog, we discussed the difficulty in capturing all relevant diagnostics during an incident before a “band-aid” fix is applied. The most common, concrete example of this is an application running in a container and the container is redeployed—perhaps to a prior version or the same version—simply to solve the immediate issue.

Reflecting on one of the biggest incidents in our history

We have to come clean. During KubeCon, we experienced an incident that we weren’t ready to discuss until now. This incident caused quite a disruption and, had it been left unresolved, would have had a massive snowball effect. At the time, we didn’t want to raise any alarms, so we kept it quiet while our team rallied to resolve it. And to be honest, most folks probably didn’t even realize that it happened since we moved so quickly.

It's time to rethink the way you do external comms

April was a month to remember at incident.io. Not only did we attend our second conference ever with KubeCon in Amsterdam, but we also very subtly released our brand-new Status Pages product. OK, it probably wasn't subtle. Both moments required months of preparation, feedback loops, iteration, and so much more behind-the-scenes work to get right. So if you ran into us at KubeCon, thank you for stopping by and meeting with our team.

Mastering IT Response Time

In today’s fast-paced digital landscape, businesses heavily rely on their IT departments to ensure smooth operations and deliver exceptional customer experiences. When it comes to IT support, one critical metric stands out: response time. A prompt and efficient response can be the difference between a satisfied customer and a frustrated one. In this blog post, we will explore strategies to improve IT response times, enhance customer satisfaction, and optimize overall productivity.

Sponsored Post

Scaling Site Reliability Engineering Teams the Right Way

Most SRE teams eventually reach a point in their existence where they appear unable to meet all the demands placed upon them. This is when these teams may need to scale. However, it's important to understand that increasing team capacity is not the same as increasing the number of people on the team. Let's unpack what scaling a team is all about, what are the indicators, what are steps you can take, and how you know if you're done.

Forgot to declare an incident? Add it retroactively in FireHydrant.

Have you ever quickly worked through an issue with your team and later thought, “Huh. That probably should have been an incident.” It happened to us just a few weeks back. After one of our engineers surfaced a failed build, a few folks chimed in to problem solve and within 30 minutes things were up and running like normal. But we probably should have declared an incident.

New Features: Next-Generation Notifications UI, Take-On Call Widget, Alert Templates, Dynamic Policy Routing, Service Groups

This post highlights some of the features and improvements that we have released in the last two months. If you want to submit your own ideas or vote on existing feature requests, you can now use our public roadmap at roadmap.ilert.com.