Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Demo Roundup: PagerDuty Operations Cloud for Kubernetes

In this demo, Corbin Mills shows how to use the PagerDuty Operations Cloud to streamline and automate how a node failure is resolved. You’ll see how he uses event orchestration (in PagerDuty AIOps) to enrich an alert with pod names, and automatically runs a job to check the Kube API status, so that a responder has instant context. AIOps is also grouping and suppressing alerts. Then you’ll see how the responder can run more health status checks without the need to SSH into the environment or interrupt a co-worker for access.

Kubernetes Incident Management Best Practices

Creating just any infrastructure on Kubernetes is not enough. There are so many basic configurations you could apply and create the infrastructure for your application for the time being and it might work just fine. The incident responses won’t always remain 100% reliable. You will run into newer potholes, and that’s okay.

Understanding Blameless Postmortems

Progress often accompanies unforeseen challenges and mishaps in organizations. Traditionally, these setbacks resulted in pointing fingers, hindering progress, and creating a negative work atmosphere. However, a "Blameless Postmortems" approach transforms how organizations respond to failure. In this blog, we will delve into the importance of cultivating a blameless postrmortem culture when faced with setbacks.

Introducing Squadcast's Key Based Deduplication

We are excited to share another feature update with all our valued customers! We have recently gone live with our Key Based Deduplication feature, enabling you to define dedup keys using customizable templates for configured alert sources. With this feature, you can automatically group similar incidents and effectively deduplicate alerts.
Sponsored Post

Best Practices for SaaS and Network Incident Management

Computer and network systems have (obviously) become vital to business operations. Occasionally, there are SaaS or network incidents and these systems do not operate as needed. Enterprises want to minimize the potential damage and get their systems back online ASAP. Integrated incident management and a strong End User Experience Management (EUEM) platform that provides synthetic and real-user monitoring is a foundation for meeting that objective.

Why you need an internal status page

When we launched incident.io Status Pages a few months ago, we stressed the importance of communicating clearly with your customers about ongoing issues. To help with this, we spent a lot of time carefully designing a status page that’s easy to understand for everyone - whether they come from a technical background, work in a different area, or just want to get on with their day.

Trending: Automation in I&O Optimization according to the Gartner 2023 Hype Cycle

In this blog, we take you through the latest trends in I&O optimization as Gartner’s report Hype Cycle for I&O Automation, 2023 predicts the widespread adoption of automated tools supporting IT infrastructure. This blog focuses on tools—like OnPage’s incident alert management solution—likely to be widely adopted as a standard for I&O optimization in the near future.

The Unplanned Show, Episode 7: Death of the Single Security Pane of Glass with Heather Hinton

In this episode, Heather Hinton describes how security teams can evolve away from spending cycles on “silly little jobs” and scouring multiple sources to try to identify the kinds of unplanned interrupt work that needs to be dealth with urgently. Instead, they can complete projects faster and take on more because on-call rotations are spent getting work done (with the occasional interruption) instead of “seeking” for the interrupt work. We also discuss how this fits in with encouraging broader employees to participate in security hygiene practices.