Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

A seven-step framework for running incident debriefs

Ever wrapped up an incident, thought 'Phew, glad that’s over,' only to feel your stomach drop when you see the dreaded "Incident Debrief" on your calendar? We've all been there. Incident debriefs don't need to feel like sitting through your least favorite school subject. They can (and should!) actually be engaging and useful. At incident.io, we've found a simple, repeatable, and blameless framework.

How we responded to a 2+ hour partial outage in Grafana Cloud

On Tuesday, Feb. 18, 2025, we experienced an outage that lasted approximately 150 minutes and impacted roughly 25% of our Grafana Cloud services. To our customers: we are very sorry and more than a little embarrassed that we stepped outside our own processes and advice to cause this. You rely on us to help monitor and troubleshoot your environments, and this type of incident obviously makes it harder for you to do that.

Scientific Incident Management with Dan Slimmon

Dan Slimmon is an incident management veteran who's worked at Etsy, HashiCorp, and now leads consulting and training on pragmatic, non-bureaucratic incident response. In this episode, Dan shares his philosophy on "scientific incident response," the importance of hypothesis-driven troubleshooting, and why incidents should be seen as normal in complex systems.

The Importance of Customer Experience for Business Success

In today’s customer-centric landscape, businesses must go beyond just ensuring high availability and fast response times. Customers now expect seamless, personalized digital experiences, with little to no disruptions to service, and failing to meet these expectations can drive them to competitors. Studies show that companies prioritizing customer experience (CX) achieve significantly higher revenue growth and retention rates.

Welcome to The Fire Academy: Learn FireHydrant, Your Way

Getting started with any new platform can feel like a lot. We get it. That’s why we built The Fire Academy — our new Customer Learning Platform that makes getting started on FireHydrant as seamless as possible. Our goal is simple: we want you to feel confident customizing and configuring FireHydrant to fit your needs without having to dig for answers or wait for support. Everything you need is at your fingertips, so you can work at your own pace and get the most out of the platform.

EMEA Rundeck by PagerDuty Meetup - March 2025

Join us for an informal 1-hour virtual event where the open-source Rundeck by PagerDuty community comes together to share automation stories and use cases. Whether you're new to Rundeck or looking to elevate your automation game, this meetup is packed with valuable takeaways for everyone! CERN Orchestrates with Rundeck.

ITSM vs ITIL: Differences and How They Align

Understanding ITSM and ITIL is essential to strengthen your IT service management. Although they are closely related and often used interchangeably, ITSM and ITIL have distinct purposes and methodologies. To gain efficiency and competitive advantage in IT management, understanding their differences while exploring how they complement each other is a must.

Silence during chaos: Why the X outage is a call to arms for proactive monitoring

When X (formerly Twitter) suffered a global outage on March 10-11, 2025, millions of users and businesses were left in the dark. Apart from a solitary post from CEO Elon Musk claiming a cyber-attack, X has remained silent. Yet Catchpoint’s Internet Sonar detected the crisis in real time—highlighting the critical role independent, proactive monitoring plays when vendor communication fails.