Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Introducing Past Incident Feature | Incident Context and History | Squadcast

Introducing Squadcast's Past Incidents feature which helps incident responders by presenting them with past incidents related to the same service. It employs data science techniques to match and display a historical list of similar incidents from the same service you are currently investigating. This aids in expediting issue resolution by offering valuable insights, such as historical context, prior incident details, timing patterns, and past solutions.

Internet Sonar: A Game-Changer for Incident Detection

When outages cost you tens of thousands of dollars each minute, pinpointing the source of disruptions as quickly as possible becomes mission-critical. This is not a time for finger-pointing and hastily assembled war rooms searching for that needle in the haystack. You need simple, intelligent, trustworthy Internet health information to expedite your incident detection.

Speed, Scale, and Special Sauce: The Evolution of the PagerDuty Brand

At PagerDuty, our purpose is to empower teams with the time and efficiency to build the future. That means that our own teams are constantly building and relentlessly innovating to help organizations drive transformative change in the way they operate.

Do you need better cloud observability - or AI-powered cloud visibility?

Maybe you’re still using monolithic applications, built and refined over many years. You understand that shifting to microservices or containerized architectures is a huge and daunting task. You’re probably grappling with the limitations of legacy systems—maybe they’re slow, tough to update, or can’t scale as you’d like. And you’re likely using more traditional IT monitoring tools or even some cloud observability tools.

Kubernetes Incident Management: A Practical Guide

As more organizations embrace containerized applications, Kubernetes has emerged as the leading platform for orchestrating these containers. However, its complexity, combined with the inevitable reality of IT incidents, demands a well-defined strategy for managing disruptions. This article introduces Kubernetes incident management, describes common Kubernetes errors, and provides practical guidance to efficiently handle incidents.

AI-Generated Runbooks

AI-generated Runbooks lower the barrier to entry to new automation developers and speeds up the time to create new automation for experienced automation authors. This feature works seamlessly with the user’s preferred scripting language, offering a low-code solution for what used to be a high-code task. Watch how Runbook Automation users can write the task they wish to automate in plain-English and let AI build a template of automation for that particular task.

Avoiding a Major Incident with PagerDuty AIOps

A global retailer has a major incident occurring and the team doesn’t know it yet. Before PagerDuty AIOps, the NOC would get hit by alert storms and page multiple teams. This resulted in large conference calls and customer downtime. Now, a major incident right before Black Friday has been averted with PagerDuty AIOps. The result is better overall customer experience, no matter how stressed the system is.

Learning Flows: Bringing consistency to your post incident processes

To get the most out of your incident response processes, consistency is crucial. The more predictable you can be whenever issues crop up, whether a small bug or a major outage, the quicker and more confidently you can respond. In practice, incident response is equal parts knowing how to actually resolve the issue and having the confidence that the processes in place will help get you through without added stress.