Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Better Incidents Winter Bonfire: Inside On-Call

Engineers are bombarded with pages left and right. There's uncertainty about how to escalate. A constant blur exists between what's urgent and what can wait. This never-ending ping-pong game takes a toll. Burnout creeps in, and your engineering culture has taken a nose dive before you know it.

LLM Monitoring and Observability

Large Language Models (LLMs) are advanced artificial intelligence models designed to comprehend and generate human-like language. With millions or even billions of [parameters, these models, like GPT-3, excel in natural language processing, understanding context, and generating coherent and contextually relevant text across various applications.

Lessons in Incident Response I Learned While Waiting Tables

Before I stumbled into the tech industry (a story for another day), I spent several years in the customer service world as a server and front-of-house manager in restaurants. It was in these jobs that I first honed some critical skills that would later lead me on the path to incident response.

Getting started with IT operations automation

Tech companies face a daunting challenge: a staggering 90% of their IT teams are stuck doing mundane, repetitive tasks, leaving only 10% to focus on strategic innovation. Companies know that automation is the solution to these repetitive, low-level incident response actions; however, many need support to begin automating.

The ultimate guide to incident management KPIs and metrics

IT incident management aims to swiftly identify, address, and resolve IT disruptions to restore normal service operations. Tracking IT incident management key performance indicators (KPIs) is a vital step toward minimizing disruptions for customers and users. But there are several different KPI and metrics choices, and it’s not easy to identify the right ones that can drive meaningful improvements in incident management.

Adobe Experience Cloud Outage: The Impact of Relying on Third-party Services

On December 8, 2023, Adobe's extensive customer base was impacted by a series of outages in the Adobe Experience Cloud, starting from 8:00 AM EST and continuing until 1:45 AM EST on December 9. We haven't seen a third-party outage of this magnitude since the DoubleClick outage of 2018.

The Debrief: Incident management for data teams

If you're on a data team, have you ever considered using an incident management tool to respond to pipeline issues? If the answer is no, then you might want to check out this episode. Here, we chat with Jack, Data Analyst at incident.io, to better understand why data teams can—and should—look to incident management tools like incident.io to manage issues. We chat about: Read Jack's blog post about incident management for data teams.