%term

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

AWS Outage Incident Response: What July 24 Taught Us

Jul 25, 2026 By Falit Jain In Pagerly

On the morning of July 24, 2026, a large slice of the internet blinked out at once. An AWS outage centered on the US-West-2 region in Oregon rippled outward and took DoorDash, Reddit, Hulu, Apple Pay, Snapchat, Fortnite, and the PlayStation Network offline for millions of users. If your team runs anything on Amazon Web Services, this is the incident to study, because the hard part was never fixing AWS. The hard part was AWS outage incident response.

Read Post

Pagerly

Read more about AWS Outage Incident Response: What July 24 Taught Us

Azure outage on July 23, 2026: StatusGator detected it 1 hour before Microsoft acknowledged it

Jul 24, 2026 By Colin Bartlett In StatusGator

On July 23, 2026, Azure users around the world began hitting gateway timeouts, DNS failures, and unreachable virtual machines well before Microsoft posted anything on its status page. The first reports reached StatusGator at 15:06 UTC. By 15:28 UTC, StatusGator had sent an Early Warning Signal to subscribers. Microsoft did not acknowledge the incident until 16:29 UTC.

Read Post

StatusGator

Read more about Azure outage on July 23, 2026: StatusGator detected it 1 hour before Microsoft acknowledged it

The July 23 2026 Azure West US Outage: IP Route Removal and Downstream Impact

Jul 24, 2026 By Hrishikesh Barua In IncidentHub

On July 23, 2026, Microsoft Azure experienced a connectivity outage in the West US region that blocked traffic entering or leaving the region for nearly five hours. Workloads that stayed entirely inside West US were not affected. Microsoft's preliminary Post Incident Review (PIR) attributes the failure to a bug in maintenance request conversion software that removed IP routes from more devices than intended during routine device maintenance.

Read Post

IncidentHub

Read more about The July 23 2026 Azure West US Outage: IP Route Removal and Downstream Impact

Incident Response Communication: Why Ops Teams Own the Narrative

Jul 24, 2026 By OpsMatters In OpsMatters

Your monitoring stack flagged the outage in 90 seconds. A customer posted about it in 40. That gap is now the defining challenge of incident response communication. Ops teams have spent years driving down recovery times, yet very few track how quickly a public explanation takes shape. This article looks at how teams can monitor both timelines - and respond before speculation hardens into accepted fact.

Read Post

OpsMatters

Read more about Incident Response Communication: Why Ops Teams Own the Narrative

The Failure Mode Your Runbook Probably Does Not Cover

Jul 24, 2026 By OpsMatters In OpsMatters

Operations teams rehearse plenty of scenarios. Failed deployments, database corruption, certificate expiry, a region going dark, the on-call engineer who cannot be reached. What gets rehearsed far less often is the building losing power for eleven hours, because that feels like somebody else's problem, filed under facilities alongside the air conditioning and the parking barrier. It stops being somebody else's problem at the moment the UPS batteries drain and everything still running on premises goes down at once.

Read Post

OpsMatters

Read more about The Failure Mode Your Runbook Probably Does Not Cover

Ensuring Business Continuity in Adverse Conditions

Jul 24, 2026 By OpsMatters In OpsMatters

Businesses will always face disruptions. Whether it's a big storm, a broken supply chain, or a power outage, unexpected problems can bring operations to a halt, hurting your income, your reputation, and how much customers trust you. The companies that make it through these tough times, and those that don't, often come down to one thing: resilience. Being a resilient organization isn't about building an unshakeable fortress. It's about being flexible, thinking ahead, and having the right systems to bounce back when disruptions occur.

Read Post

OpsMatters

Read more about Ensuring Business Continuity in Adverse Conditions

Actionable Alerts: What Makes Alerts Truly Helpful?

Jul 23, 2026 By SIGNL4 In SIGNL4

There’s a big difference between alerts generated by continuous IT monitoring and production systems and alerts that actually help people solve problems. We call those actionable alerts.

Read Post

SIGNL4

Read more about Actionable Alerts: What Makes Alerts Truly Helpful?

3 Things IT Leaders Are Learning About AI-First Operations: Key Takeaways From PagerDuty on Tour 2026

Jul 23, 2026 By PagerDuty In PagerDuty

In December 2025, an AI coding agent at AWS suddenly decided to delete and rebuild an entire production environment, causing a 13-hour service disruption and a PR headache for Amazon. As rapid adoption of AI leads to more high-profile, revenue-impacting incidents, resilience has moved from a technical concern to a board-level financial risk.

Read Post

PagerDuty

Read more about 3 Things IT Leaders Are Learning About AI-First Operations: Key Takeaways From PagerDuty on Tour 2026

The 13 Questions CEOs Ask After an Incident (And What IT Leaders Must Be Ready to Answer)

Jul 22, 2026 By PagerDuty In PagerDuty

It’s 2:47 p.m. Your checkout service has been down for 11 minutes. Customers are screenshotting errors and calling in. Your CEO walks into your office and starts asking questions. In this moment, there are two kinds of IT leaders: Whether you walk out with more budget authority (and executive trust) or less depends on your answers, and the infrastructure that supports them. But preparation isn’t just about surviving the incident. It’s actually a revenue opportunity.

Read Post

PagerDuty

Read more about The 13 Questions CEOs Ask After an Incident (And What IT Leaders Must Be Ready to Answer)

AT&T Email-to-Text Is Gone - Need an Alternative?

Jul 22, 2026 By Derdack SIGNL4 In SIGNL4

If you relied on AT&T Email-to-Text to notify people on the go or on call, SIGNL4 is an easy-to-use replacement. Alerting people stays simple with SIGNL4. It's easy to get started and goes beyond basic text messages by helping ensure someone responds.

View Video