Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

SIGNL4 Update: Centralize alerts. Automate response. Easier than ever.

Get ready for the new SIGNL4 update. The completely redesigned API makes it easier than ever to connect your systems and tools and consolidate alerts from every source – so nothing gets missed. With the new Automation menu, you can now manage automated alert routing and filtering from one central place, ensuring the right alerts reach the right person at the right time.

Best Practices in the Slack Experience

PagerDuty’s slack experience is evolving to help your teams organize better and resolve incidents faster. Use Triage Channels to collect telemetry and updates from your systems. Create dedicated Incident Channels for coordination and resolution. Give stakeholders the updates they need in Announcements Channels. Everyone in your organization can get the information they need easily.

Safety Incidents Need Better Data, Not Just Faster Reactions

Most operations teams are very good at measuring speed. They know how quickly an alert was acknowledged, how long a service took to recover, how many incidents were closed in a quarter, and whether the response time improved compared with the last reporting period. The dashboard looks mature. The numbers look controlled. The team looks busy, responsive, and accountable. The harder question is whether the organization actually understands what happened.

Shopify outage on May 22, 2026 impacted merchants worldwide

On May 22, 2026, merchants using Shopify experienced a brief but widespread disruption that affected access to product pages, collections, and administrative tools. While the outage lasted less than an hour, it created immediate challenges for businesses that rely on Shopify to manage inventory, update products, and operate online stores. StatusGator detected the developing incident at 10:20 UTC using Early Warning Signals, 18 minutes before Shopify officially acknowledged the outage at 10:38 UTC.

Microsoft Fabric outage disrupted analytics workloads on May 18, 2026

On May 18, 2026, organizations using Microsoft Fabric experienced a multi-hour outage that disrupted analytics workloads, reporting systems, and access to platform services across several regions. StatusGator detected the developing incident at 14:00 UTC using Early Warning Signals, 37 minutes before Microsoft officially acknowledged the outage at 14:37 UTC.

The $600 billion wake-up call: New Splunk research reveals downtime is a systemic business crisis

600 billion annual impact: Aggregate downtime costs for the Global 2000 have soared 50% in two years. $15,000 per minute: The average cost of downtime for organisations, highlighting the immediate financial impact of service disruptions. 3.4% stock price drop: The average decline in shareholder value following a single downtime incident.

Engineering teams in 2027

There's a conversation I keep having with our design partners at incident.io. It starts when I ask "what are you doing with AI internally?" and lands in a similar place every time. The shape of how their engineering teams work is changing fast. Not in vague "AI is transforming everything" ways, but in concrete, repeatable patterns. Different companies are building the same things. The frontier teams are six to twelve months ahead of the average, and they're describing the same future.

Alerting Software: 10 Must-Have Capabilities

Author: Matthes Derdack Businesses rely on countless systems, applications, and services to operate without disruptions. Whether it is cloud infrastructure, manufacturing equipment, IoT devices, healthcare platforms, or enterprise applications, every second of downtime can impact revenue, customer trust, and operational efficiency.

How to Manage Complex On-Call Rotations and Schedules

A simple round-robin rotation works well when you have a small team with a single service and predictable incident patterns. It breaks down quickly when you have engineers across three continents, multiple services with different criticality levels, a mix of senior and junior responders, and a team that expects fair, sustainable coverage across weekends, holidays, and different time zones.

Slack Round Robin Assignment: Guide and Best Tools

Round robin assignment distributes incoming work equitably across a group of team members by cycling through the list in order. Each new item goes to the next person in the rotation, ensuring no one person accumulates a disproportionate share of the workload. In Slack, where teams receive support tickets, alert notifications, PR review requests, and customer issues as incoming messages, round robin assignment gives those items clear ownership the moment they arrive.