Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

The New SEC Rules and You

The Securities and Exchanges Commission published new rules for SEC registrants around disclosing incident details and response policies. Compliance with these new rules should be top of mind for any company – even if your org hasn’t hit the milestone of registering with the SEC, you should be prepared to be compliant when you take that step. ‍

What you need to know about the The Digital Operational Resilience Act (DORA)

The European Commission has introduced the Digital Operational Resilience Act (DORA) to bolster the digital infrastructure of the financial sector within the European Union (EU). As part of the EU's wider digital finance strategy, DORA's objective is to create a comprehensive framework governing digital operational resilience. Financial institutions must ensure full compliance with DORA by January 2025.

Mastering Root Cause Analysis: A Guide for Site Reliability Engineers

Site Reliability Engineers (SREs) play a vital role in ensuring the stability and performance of web services and are key in incident management. One of the core skills SREs need is the ability to conduct effective Root Cause Analysis (RCA) when issues arise. This guide is about how to improve your RCA skills for more effective post-incident analysis.Let's dive in.🔖 What is Prometheus Alertmanager? Read here!

The Unplanned Show, Episode 19: Cloud Security response with Ashley Ward

As organizations move to the cloud, where is there overlap between security and IT and engineering? In this session, Dormain will sit down with Orca Security's Principal Technical Evangelist, Ashley Ward, to learn about how working practices have to evolve with the speed of change in the cloud.

What is IT incident management - and how can AIOps optimize it?

Imagine you’re in the middle of a critical project, and suddenly, your system crashes. Or perhaps it’s the middle of the night, and your server goes down, affecting countless users. Some IT incidents are inevitable, but the way you manage them makes all the difference in minimizing their impact. You know that proper incident management is critical – and that incidents can become costly.

How we manage incidents at Datadog

Incidents put systems and organizations to the test. They pose particular challenges at scale: in complex distributed environments overseen by many different teams, managing incidents requires extensive structure and planning. But incidents, by definition, break structures and foil plans. As a result, they demand carefully orchestrated yet highly flexible forms of response. This post will provide a look into how we manage incidents at Datadog. We’ll cover our entire process.

The Journey Into Automation: Optimizing Care Delivery

In a world where efficiency and precision are the cornerstones of progress, automation has become the unsung hero across diverse industries. From manufacturing floors to customer service, its transformative power has reshaped the way we work and deliver services. Today, we embark on a journey to explore the profound influence of automation on healthcare, where each automated process is a progressive step towards optimizing care delivery and reshaping the future of patient-centered care delivery.

xMatters Support - Broadcast Groups

In xMatters, groups determine how and when people are notified using on-call schedules, escalation timelines, and rotations. But what if you don't use complex on-call schedules, or need to notify all members of the group simultaneously? Broadcast groups make it easier for customers who don't always need on-call schedules. Let’s take a look.