Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

What is the Mean Time to Resolution (MTTR)? Why It Matters and How to Resolve

How quickly can you restore service when an incident hits your system? Most IT teams are not slowed down by detecting incidents. The challenge starts after something breaks, when the goal is to bring services back online as quickly as possible. Modern systems are highly distributed. Alerts arrive from multiple tools, dependencies are complex, and it is often difficult to immediately understand what actually failed.

New in PagerDuty's Slack Experience: Dedicated Channels, Quick Declare & New On-Call Paging Commands

For teams that live in Slack, incident management is getting a whole lot smoother. EA planned for May includes dedicated incident channels, one-click escalation, centralized configuration, onboarding tutorials, and new commands to page responders without leaving Slack.#IncidentResponse.

Humans aren't fast enough for 4 9's

When thinking about Service Level Objectives (SLOs) and contractual Service Level Agreements (SLAs) for availability, I always like to put the percentages into concrete numbers. It’s easy to lose track of what’s meant when saying “99.95%” availability, and even more is lost when thinking how much harder it is to achieve 99.99% compared to 99.95%. On a monthly basis, and in concrete terms, 99.95% availability means you get 21 minutes and 55 seconds of downtime.

AWS outage takes down more than 150 cloud services

On May 7th and 8th, 2026, Amazon Web Services (AWS) experienced an outage affecting Amazon Elastic Compute Cloud (EC2) in the dreaded US East 1 region. The original region of AWS located in Northern Virginia, us-east-1 or just “US East” as it is known, has been the subject of some of the internet’s most high profile and destructive outages and remains Amazon’s least reliable region.

SLA Best Practices for Enterprise IT Teams

How to Draft, Customize, and Keep Service Level Agreements Defensible Most enterprises do not discover the weaknesses in their SLAs during the drafting process. They discover them during an incident review, a customer escalation, or a contract dispute, when the language that seemed reasonable at signing turns out to be too vague to measure, too broad to enforce, or disconnected from the operational data that would make it defensible.

How to Set Up SIGNL4 in Under 5 Minutes | Quick Start Guide

Getting started with SIGNL4 is fast and simple. In this video, we show you how to set up a new SIGNL4 account in under 5 minutes so you can start receiving critical alerts and managing incidents right away. Whether you're new to incident management or looking for a faster way to implement mobile alerting and on-call scheduling, SIGNL4 makes onboarding effortless. Follow along step-by-step and see how quickly your team can be up and running.

New in PagerDuty's Slack Experience: Dedicated Channels, Quick Declare & New On-Call Paging Commands

For teams that live in Slack, incident management is getting a whole lot smoother. EA planned for May includes dedicated incident channels, one-click escalation, centralized configuration, onboarding tutorials, and new commands to page responders without leaving Slack.#IncidentResponse.

KPI vs SLA: What's the Difference?

Why Confusing Them Costs You More Than a Missed Target Every operations leader tracks KPIs. Every enterprise IT team has SLAs. Both involve targets, both involve measurement, and both surface in the same board reviews and vendor conversations. So it is not surprising that the two get treated as variations of the same thing.

How to Customize an SLA Template

A Practical Guide for Help Desk, IT Operations, and Enterprise SRE Teams A service level agreement template is only useful if it can be customized. The version that ships with your ITSM platform was designed to be generic enough to apply anywhere, which makes it precise enough to apply nowhere. The teams that maintain defensible SLAs are not the ones with the most sophisticated legal language.
Sponsored Post

How to Reduce MTTR When Third-Party Services Go Down

Most MTTR guides assume the problem is in your infra. For modern apps, it's often not - it's Stripe, AWS, Auth0, or another vendor. Vendor status pages lie by omission. The lag between impact and acknowledgment can stretch to an hour or more. You need two runbooks, proactive vendor monitoring, and graceful degradation baked in before the 3 AM page hits. This post shows you exactly how.