Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Accelerating Root Cause Analysis of IT Incidents

The moment after an incident is resolved is perhaps the most relaxing for any IT team. When your system is finally functioning properly it puts the entire organization at ease, but the most daunting task is yet to come: root cause analysis (RCA). Akin to football teams watching previous plays to pinpoint areas of improvement, root cause analysis goes through data and finds what initially caused the incident.

Availability, Maintainability, Reliability: What's the Difference?

We live in an era of reliability where users depend on having consistent access to services. When choosing between competing services, no feature is more important to users than reliability. But what does reliability mean? To answer this question, we’ll break down reliability in terms of other metrics within reliability engineering: availability and maintainability. Distinguishing these terms isn’t a matter of semantics.

Add Datadog alerts to your xMatters incident workflows

xMatters provides flexible, smart tools for incident response and management. With configurable workflows that bring together data from sources like Github, Jenkins, and Zendesk, you can automate crucial tasks and send enriched notifications to streamline team communications.

Let's Talk AIOps: Part 2: Things to Think About & the PagerDuty Approach

This is the second in a two-part blog series about AIOps where I sit down with Julian Dunn, Director of Product Marketing at PagerDuty, to level-set on the hot DevOps topic. The first post discussed whether AIOps was just marketing fluff and whether ITOps actually has an AIOps problem. Let’s continue…

September 2020 Update: All-new Webhook and API Key Management

Our September update provides you with self-service API key management from the SIGNL4 web portal. Finally, you can fully exploit our comprehensive REST API. We also improved management of outbound webhook which can update your systems with any information on Signl handling. It is now possible to manage API keys for the SIGNL4 API in the SIGNL4 account portal. Click on the “Developer” menu item to manage API keys. Keys issued here can then be used to call SIGNL4 REST API functions.

Three Ways to Maintain IT Productivity During Difficult Times

As IT leaders, we are facing an era of unprecedented events. Not only are IT teams still adapting to working and living from home — with many companies now announcing their support for a remote workforce indefinitely — but they’re also facing a novel combination of heightened external pressures from family, friends and colleagues.

Adaptable Incident Response With Splunk Phantom Modular Workbooks

Splunk Phantom is a security orchestration, automation and response (SOAR) technology that lets customers automate repetitive security tasks, accelerate alert triage, and improve SOC efficiency. Case management features are also built into Phantom, including “workbooks,” that allow you to codify your security standard operating procedures into reusable templates.

SRE Leaders Panel: Testing in Production

Blameless recently had the privilege of hosting some fantastic leaders in the SRE and resilience community for a panel discussion. Our panelists discussed testing in production, how feature flagging and testing can help us do that, and how to get managers to be on board with testing in production. The transcript below has been lightly edited, and if you’re interested in watching the full panel, you can do so here.

No More False Alerts at Night

Do you know this situation? You are on-call and in the middle of the night you get a phone call. Loud enough to wake you up. Loud enough to wake your wife up, as well. You stand up and check your emails to see what the problem is. OK, you got it. Then you log on to the console of your monitoring tool and – green. Green? False alert? Why did you get the call then? After double-checking, still a bit sleepy, you recognize that the problem has been recovered automatically.