Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

How to Set Up SMS Alerting w/ OnPage

In this quick tutorial, learn how to set up SMS alerting in OnPage to ensure your team never misses a critical notification. We’ll walk you through the step-by-step process: This setup ensures reliable message delivery using redundancy rules, so important alerts reach the right person at the right time. Let us know if you have any other questions!

Why SIGNL4 Is the Right Alarm Management Software to Maximize Machine Availability

A plant runs at its best when equipment stays online, processes remain stable, tolerances are met, raw materials are delivered in time, and scrap stays low. That’s how operations teams hit production targets, meet customer SLAs, stay on schedule, keep costs under control, and maintain consistent quality. But does everything always run according to plan? Of course not.

Code Is Cheap, Reliability Isn't: Owning Production in the AI era w/ Swizec Teller

In this episode, Swizec Teller, author of the bestselling Scaling Fast, makes a bold claim: code is cheap, reliability is not. As AI coding tools accelerate feature development, the real competitive advantage shifts to operating systems reliably in production. We explore the hidden complexity of SRE work, the addictive nature of agentic coding, and why ownership — not automation — remains at the core of modern software engineering.

Amazon Web Services outage - February 10, 2026

On February 10, 2026, Amazon Web Services (AWS) experienced an outage that triggered widespread reports of CloudFront failures and DNS resolution issues. While AWS later acknowledged the incident, StatusGator detected the disruption earlier using Early Warning Signals, giving customers valuable lead time before the provider confirmed anything publicly.

4 on-call burnout signs (and how to address them)

Being on-call can sometimes feel overwhelming. If that feeling goes unnoticed for too long, it often translates into burnout. And early burnout signs usually show up in ways, like how people respond to incidents or how they feel about the schedule. This guide walks through four such signs that can be useful to watch for before on-call burnout sets in.

Claude outage - February 10, 2026

On February 10, 2026, Claude users around the world began reporting service failures affecting chat sessions, API integrations, and Claude Code workflows. The first verified outage report reached StatusGator at 19:33 UTC. StatusGator issued an Early Warning Signal at 20:24 UTC. Claude did not post an official “Investigating” update until 22:11 UTC. This incident clearly demonstrates the gap between real user impact and official status page updates.

Incident Alerting: What We Believe It Should Do

Incident alerting is a critical part of modern operations, yet it’s often misunderstood or reduced to “sending notifications.” In reality, it is about ensuring that the right people are informed at the right time – and that incidents move from detection to action without confusion or delay. This page explains why fast, reliable alerting matters, where it fits between monitoring and incident response, and what best practices look like.

5 Offbeat on-call rotations that work

Most teams choose standard on-call patterns like weekly or daily rotations. But sometimes a less conventional rotation can solve a specific problem or just fit better with how your team works. This guide walks you through five offbeat on-call rotations. For each, we look at why it might work for you and the challenges involved. This helps you see the full picture before you decide to try them out. Let’s dive in!