Operations | Monitoring | ITSM | DevOps | Cloud

PagerDuty User Group Toronto: Incident Enrichment, Automating Maintenance & New Event Capabilities

Recorded during the PagerDuty User Group Toronto, May 2026 - part of Toronto Tech Week. About PagerDuty User Groups: Connect with PagerDuty users, share your experiences, and learn new ways to maximize the power of digital operations. It's a space where technical leaders and practitioners come together to collaborate, solve challenges, and get inspired by each other's successes.

Customers over control: how we measure On-call reliability

Our On-call product has a lot of great features: configuring escalation paths, viewing rotas and schedules, requesting cover, etc. However, when framing its reliability, we reduce it down to two critical pieces of functionality: It’s not that we’re happy if only these parts are working, but they are the most important parts. In this post, I'll go into more detail on how we think about their reliability.

Every pilot is ready for engine failure: are your engineers? w/ Hamed Silatani (Uptime Labs)

Every pilot who's never had an engine failure is still ready for one. The same can't be said for most software engineers facing their first major incident. Hamed Silatani, co-founder and CEO of Uptime Labs, and former Head of Reliability Engineering at IG Group, has spent two decades watching engineers learn incident response the hard way: alone, under pressure, with no training.

Root Cause Analysis: How Engineering Teams Fix Production Issues Faster?

When a production incident strikes, a sudden latency spike, a cascading API failure, a service returning 500s at scale, every minute of downtime has a cost. Root cause analysis (RCA) is the process that turns that chaos into a clear answer: what actually broke, and why. Not the symptom that triggered the alert. The underlying cause.

How BigPanda and ServiceNow are redefining agentic IT operations for enterprise IT

Enterprise ITOps leaders are realizing that legacy incident management processes are collapsing under the weight of today’s sprawling, hybrid-cloud enterprise environments. Monitoring and observability tools generate a relentless flood of alerts across cloud platforms, infrastructure, applications, and services. The signals are there, the volume of noise makes it harder than ever to identify what’s urgent.

SIGNL4 Update: Centralize alerts. Automate response. Easier than ever.

Get ready for the new SIGNL4 update. The completely redesigned API makes it easier than ever to connect your systems and tools and consolidate alerts from every source – so nothing gets missed. With the new Automation menu, you can now manage automated alert routing and filtering from one central place, ensuring the right alerts reach the right person at the right time.

Best Practices in the Slack Experience

PagerDuty’s slack experience is evolving to help your teams organize better and resolve incidents faster. Use Triage Channels to collect telemetry and updates from your systems. Create dedicated Incident Channels for coordination and resolution. Give stakeholders the updates they need in Announcements Channels. Everyone in your organization can get the information they need easily.

Shopify outage on May 22, 2026 impacted merchants worldwide

On May 22, 2026, merchants using Shopify experienced a brief but widespread disruption that affected access to product pages, collections, and administrative tools. While the outage lasted less than an hour, it created immediate challenges for businesses that rely on Shopify to manage inventory, update products, and operate online stores. StatusGator detected the developing incident at 10:20 UTC using Early Warning Signals, 18 minutes before Shopify officially acknowledged the outage at 10:38 UTC.

The $600 billion wake-up call: New Splunk research reveals downtime is a systemic business crisis

600 billion annual impact: Aggregate downtime costs for the Global 2000 have soared 50% in two years. $15,000 per minute: The average cost of downtime for organisations, highlighting the immediate financial impact of service disruptions. 3.4% stock price drop: The average decline in shareholder value following a single downtime incident.

Microsoft Fabric outage disrupted analytics workloads on May 18, 2026

On May 18, 2026, organizations using Microsoft Fabric experienced a multi-hour outage that disrupted analytics workloads, reporting systems, and access to platform services across several regions. StatusGator detected the developing incident at 14:00 UTC using Early Warning Signals, 37 minutes before Microsoft officially acknowledged the outage at 14:37 UTC.

Engineering teams in 2027

There's a conversation I keep having with our design partners at incident.io. It starts when I ask "what are you doing with AI internally?" and lands in a similar place every time. The shape of how their engineering teams work is changing fast. Not in vague "AI is transforming everything" ways, but in concrete, repeatable patterns. Different companies are building the same things. The frontier teams are six to twelve months ahead of the average, and they're describing the same future.

Alerting Software: 10 Must-Have Capabilities

Author: Matthes Derdack Businesses rely on countless systems, applications, and services to operate without disruptions. Whether it is cloud infrastructure, manufacturing equipment, IoT devices, healthcare platforms, or enterprise applications, every second of downtime can impact revenue, customer trust, and operational efficiency.

How to Manage Complex On-Call Rotations and Schedules

A simple round-robin rotation works well when you have a small team with a single service and predictable incident patterns. It breaks down quickly when you have engineers across three continents, multiple services with different criticality levels, a mix of senior and junior responders, and a team that expects fair, sustainable coverage across weekends, holidays, and different time zones.

Slack Round Robin Assignment: Guide and Best Tools

Round robin assignment distributes incoming work equitably across a group of team members by cycling through the list in order. Each new item goes to the next person in the rotation, ensuring no one person accumulates a disproportionate share of the workload. In Slack, where teams receive support tickets, alert notifications, PR review requests, and customer issues as incoming messages, round robin assignment gives those items clear ownership the moment they arrive.

SSL Certificate Monitoring: Best Tools and Practices

SSL certificate monitoring is the continuous process of checking whether your TLS certificates are valid, correctly configured, and not approaching their expiry date. When SSL monitoring is absent or inadequate, the first signal you get that something is wrong is a browser security warning blocking your users from accessing your site. By then, the damage has already started.

How to Assign Tasks to Slack Alerts Channels Guide

An alert fires in your Slack alerts channel. It sits there for four minutes while three engineers each assume someone else is going to respond. Nobody owns it. Nobody creates a ticket. By the time someone acts, the incident has escalated. This is the accountability gap that unstructured Slack alert channels create. Visibility without assignment is not enough.

How to Add On-Call Rotations to Google Calendar

Your on-call rotation lives in a scheduling tool or a spreadsheet. Your engineers' actual work schedules live in Google Calendar. When these two systems do not talk to each other, engineers are constantly context-switching to figure out who is on-call and when. They miss shift reminders. They schedule personal appointments during on-call windows. And handovers get messy because nobody has a single place to see the full picture.

What IT Incident Management Can Teach Workplace Safety

In most modern enterprises, the playbook for a production outage is well understood. An alert fires. An on-call engineer responds within a documented service level. The incident is triaged, assigned a severity, and worked through to resolution by a team that has rehearsed the steps. Afterward, a postmortem is written. The root cause is identified, blameless analysis is performed, and the findings flow back into runbooks, monitoring rules, and training materials. The cycle is closed.

Replace Verizon Email-to-Text with OnPage's Paging / Critical Alerting Capabilities

It’s 2:00 AM on a Saturday. An energy company’s thermal storage system temperature violently spikes past safe operating thresholds. The monitoring system instantly fires off an emergency alert via a standard Verizon email-to-text gateway. But instead of waking the engineer, the message is delayed by the carrier network. By the time the on-call responder sees the text hours later, the equipment has failed, resulting in catastrophic downtime.

Slack outage on May 14, 2026

On May 14, 2026, users across multiple regions began reporting problems with Slack, including messaging failures, sign-in issues, and problems loading attachments and images. While the outage did not affect every user, reports quickly showed the issue was widespread enough to disrupt business communication for organizations around the world. StatusGator identified the incident through customer outage reports and triggered an Early Warning Signals alert at 14:21 UTC.

When the Report Cannot Tell the Story: Building Incident Programs That Capture as They Respond

Two weeks after a payments outage took a regional bank offline for ninety-three minutes, the post-incident report landed on the CIO’s desk. It ran forty pages. It named the failed service, the ticket numbers, the restoration steps, and the engineers who paged in. It did not answer the question the board had actually asked, which was why the on-call team had spent the first forty-one minutes chasing a downstream symptom rather than the upstream cause.

Problem Management vs. Incident Management

Why Fixing Incidents Is Only Half the Work Fixing an incident is not the same as solving a problem. In enterprise IT operations, that distinction carries significant operational weight. Organizations that treat every disruption as a discrete, isolated event to be resolved and closed will continue to encounter the same disruptions, on the same infrastructure, from the same root causes. The cycle does not end because the underlying problem was never addressed.

Jira Notifications Management: The Enterprise Guide to Routing, Reducing Noise, and Closing the Loop

Jira is the system of record for engineering work at nearly every enterprise that runs agile delivery. It tracks epics, stories, bugs, sprints, releases, and the long tail of technical debt that keeps platform teams awake. What Jira was never designed to be is an alerting system.

Why IT Teams Choose OnPage Over Opsgenie: 5 Key Benefits

With Atlassian announcing the sunsetting of Opsgenie, IT teams, MSPs, and cybersecurity professionals find themselves at a critical crossroads. Technical leaders are actively searching the market for reliable opsgenie alternatives to keep their infrastructure running smoothly and minimize downtime. While migrating platforms can feel like a frustrating chore, it’s actually the perfect opportunity to upgrade your incident response strategy.

Product Update - May 2026

IncidentHub's latest product updates include a new Business plan with Teams support, early outage detection v1, and more integrations with ticketing systems. The public status now includes a disable feature. As before, many features are driven by feedback, and I am grateful to all our customers who have shared their feedback with us.

LLM Observability: Lessons From MLOps w/ Maria Vechtomova (Cauchy)

For nine years, Maria Vechtomova was shouting about monitoring. Nobody cared, until LLMs arrived. As co-founder of Cauchy, Databricks MVP, and one of the most followed voices in MLOps, Maria has watched the field evolve from hand-built experiment trackers to today's flood of observability tools, and her central claim might surprise you: globally, nothing has changed. The fundamentals are the same: track your code, data, and models so you can roll back when something breaks.

First Look at the Next-Generation OnPage Enterprise Web Management Console

Get a first look at the next-generation OnPage Enterprise Web Management Console, a modernized platform designed to help critical response and operations teams across IT, Healthcare, and other industries improve visibility, streamline communication workflows, and respond faster from one centralized interface.

New Features, Same Flow for Healthcare Professionals: Inside OnPage's Next-Gen Enterprise Web Console

You requested, we implemented it. OnPage’s new web console with an improved and more modern interface design is coming to you in the next few days! But we’re aware of how difficult it is to introduce change for healthcare organizations. Not because clinicians and hospital admins are averse to learning new tools. But more so because they’re wary of anything that may come in between them and their patients, taking away their valuable time from care delivery.

HIPAA-Compliant Messaging and Clinical Communication

In today’s fast-paced healthcare environment, patient outcomes rely entirely on immediate, accurate, and secure information transfer. Mismanaged communication is costly; industry data suggests that communication failures contribute to an estimated $12 billion in annual revenue loss and are linked to nearly 30% of malpractice claims.

What Is an Incident Commander? Role, Skills, and Best Practices

The fastest incident response teams treat coordination as a craft. Someone owns the call, drives the decisions, and keeps everyone moving in the same direction while the team puts the system back together. That person is the incident commander (IC), and getting the role right is what separates your 15-minute fix from a four-hour war room where nobody’s sure who’s making the call.

PagerDuty Appoints John DiLullo as Chief Executive Officer

Jennifer Tejada Transitions to Executive Chair of Board of Directors After Serving as CEO Since 2016. John DiLullo Brings Deep Enterprise, Product and Go-to-Market Leadership Experience to Lead Next Phase of Growth. Company Reaffirms First Quarter and Full Fiscal Year 2027 Guidance.

What is the Mean Time to Resolution (MTTR)? Why It Matters and How to Resolve

How quickly can you restore service when an incident hits your system? Most IT teams are not slowed down by detecting incidents. The challenge starts after something breaks, when the goal is to bring services back online as quickly as possible. Modern systems are highly distributed. Alerts arrive from multiple tools, dependencies are complex, and it is often difficult to immediately understand what actually failed.

Humans aren't fast enough for 4 9's

When thinking about Service Level Objectives (SLOs) and contractual Service Level Agreements (SLAs) for availability, I always like to put the percentages into concrete numbers. It’s easy to lose track of what’s meant when saying “99.95%” availability, and even more is lost when thinking how much harder it is to achieve 99.99% compared to 99.95%. On a monthly basis, and in concrete terms, 99.95% availability means you get 21 minutes and 55 seconds of downtime.

New in PagerDuty's Slack Experience: Dedicated Channels, Quick Declare & New On-Call Paging Commands

For teams that live in Slack, incident management is getting a whole lot smoother. EA planned for May includes dedicated incident channels, one-click escalation, centralized configuration, onboarding tutorials, and new commands to page responders without leaving Slack.#IncidentResponse.

AWS outage takes down more than 150 cloud services

On May 7th and 8th, 2026, Amazon Web Services (AWS) experienced an outage affecting Amazon Elastic Compute Cloud (EC2) in the dreaded US East 1 region. The original region of AWS located in Northern Virginia, us-east-1 or just “US East” as it is known, has been the subject of some of the internet’s most high profile and destructive outages and remains Amazon’s least reliable region.

KPI vs SLA: What's the Difference?

Why Confusing Them Costs You More Than a Missed Target Every operations leader tracks KPIs. Every enterprise IT team has SLAs. Both involve targets, both involve measurement, and both surface in the same board reviews and vendor conversations. So it is not surprising that the two get treated as variations of the same thing.

How to Customize an SLA Template

A Practical Guide for Help Desk, IT Operations, and Enterprise SRE Teams A service level agreement template is only useful if it can be customized. The version that ships with your ITSM platform was designed to be generic enough to apply anywhere, which makes it precise enough to apply nowhere. The teams that maintain defensible SLAs are not the ones with the most sophisticated legal language.

SLA Best Practices for Enterprise IT Teams

How to Draft, Customize, and Keep Service Level Agreements Defensible Most enterprises do not discover the weaknesses in their SLAs during the drafting process. They discover them during an incident review, a customer escalation, or a contract dispute, when the language that seemed reasonable at signing turns out to be too vague to measure, too broad to enforce, or disconnected from the operational data that would make it defensible.

How to Set Up SIGNL4 in Under 5 Minutes | Quick Start Guide

Getting started with SIGNL4 is fast and simple. In this video, we show you how to set up a new SIGNL4 account in under 5 minutes so you can start receiving critical alerts and managing incidents right away. Whether you're new to incident management or looking for a faster way to implement mobile alerting and on-call scheduling, SIGNL4 makes onboarding effortless. Follow along step-by-step and see how quickly your team can be up and running.

New in PagerDuty's Slack Experience: Dedicated Channels, Quick Declare & New On-Call Paging Commands

For teams that live in Slack, incident management is getting a whole lot smoother. EA planned for May includes dedicated incident channels, one-click escalation, centralized configuration, onboarding tutorials, and new commands to page responders without leaving Slack.#IncidentResponse.
Sponsored Post

How to Reduce MTTR When Third-Party Services Go Down

Most MTTR guides assume the problem is in your infra. For modern apps, it's often not - it's Stripe, AWS, Auth0, or another vendor. Vendor status pages lie by omission. The lag between impact and acknowledgment can stretch to an hour or more. You need two runbooks, proactive vendor monitoring, and graceful degradation baked in before the 3 AM page hits. This post shows you exactly how.

Turn Alerts into Action: Why Modern Operations Need More Than Monitoring

Modern ops stacks are very good at detecting problems. From IT infrastructure and cloud platforms to industrial systems, cybersecurity tools, and IoT environments, monitoring technologies generate alerts the moment something goes wrong. But there is a critical problem modern operations teams still struggle with: Detection does not ensure response. And that gap is becoming one of the biggest operational risks organizations face today.

AI matched or beat physicians on real-world clinical reasoning

A major new study from Harvard Medical School and Beth Israel Deaconess Medical Center has found that a large language model (LLM) outperformed physicians across a wide range of clinical reasoning tasks, including making emergency-room triage decisions from messy, real-world patient data. The findings, published April 30 in Science, represent one of the largest comparisons yet between AI and physicians on clinical tasks.

When an incident hits, who stays in the loop?

Your IT team gets alerted - but stakeholders? They’re left checking status pages or chasing updates. There’s a better way. With SIGNL4 Active Stakeholder Communication, everyone stays informed automatically — without adding extra work for your team. Send real-time updates instantly via push notifications Create stakeholder groups for different scenarios Track exactly who was notified — and when.
Featured Post

Resilience hinges on conversations as much as tooling

Too many businesses still treat resilience as a software procurement and IT operations issue. In reality resilience lives in the mutual relationship between tech, business leadership, and culture. It goes deep - resilience is baked into the organization in a multitude of ways. Some tech enabled, some policy-driven, and some by culture or employee goodwill.

Introducing Shift-Based Schedules: Smarter, Faster, and Easier for Any Team

This blog post is part of PagerDuty’s ongoing series on how we’re helping customers navigate their journey towards autonomous operations. Read on to learn about how PagerDuty’s Shift-Based Schedules (planned GA in May) builds towards this vision. PagerDuty has long been the gold standard for on-call management, helping thousands of teams build the foundations of digital reliability.

Activate Your Continuous Learning Flywheel With Post-Incident Reviews in PagerDuty UI

Earlier this year at our H1 2026 launch, we announced PagerDuty’s vision for autonomous operations: a future where AI agents learn from every incident, prevent failures before they happen, and progressively automate so teams can focus on innovation instead of firefighting.

Why Dedicated Incident Channels are the Modern Standard for Slack-Based Incident Response

Where do your teams go during a critical incident? For distributed teams, that war room is a channel in Slack or Microsoft Teams. The question is: are you creating a dedicated space for each incident, or are responders scrambling across DMs, email threads, and general channels trying to piece together what happened? The answer matters. Using dedicated incident channels has become the industry standard for high-performing incident response teams.

How to reduce alert noise without missing what matters

Reducing alert noise involves drawing a line between incidents that need an immediate response and ones that do not. Get this distinction wrong and your team is either interrupted unnecessarily or misses something critical. In this guide, we’ll help you make that distinction clear. We’ll cover what counts as noise and how to reduce it without missing what matters.

Inside the .de DNS Outage: Real-World Data from UptimeRobot.

In the evening of May 5th, 2026, large parts of the German web briefly went dark. For a few hours, anyone trying to load a.de address through a major DNS resolver got errors instead of websites. Bahn.de, Amazon.de, and Spiegel.de were among the affected. Major brands like Telekom, DHL, and Sparkassen felt it too, along with hosting providers Hetzner, Strato, and Ionos.

PagerDuty's Slack App: New Incident Management Capabilities

We’ll be rolling out new Slack capabilities to eliminate more manual toil from your incident workflow: click once to promote any alert to an incident, get dedicated channels created automatically, page responders without leaving Slack, and manage all your settings in one place. This is part of our path to autonomous operations: reducing toil, protecting your capacity, and letting you stay in flow. If you’re only using PagerDuty for on-call scheduling, you’re missing the full picture.

New enhancements to PagerDuty's SRE Agent: triage faster without waking a human

AI promise and AI capabilities often diverge, with developers often reporting much faster code production, but not enough change in how incidents are handled. When the rate of change is faster than ever, but the rate of recovery from incidents isn’t moving, developers wind up stuck in firefighting mode. And, when these systems fail, it’s costly. According to PagerDuty’s State of AI-First Operations, over a third of surveyed companies report losing $500K per hour of downtime.

What is alert fatigue? (And how does it happen)

Alert fatigue doesn’t announce itself. It builds quietly over weeks and months until one day a critical incident triggers and nobody responds with the urgency it deserves. By that point, the damage is already done. This guide walks through what alert fatigue actually is, how it happens, and what you can do about it.

PagerDuty's Product Drop (May 2026)

PagerDuty’s monthly drops are here! May’s drop delivers innovation, helping teams work faster and smarter with four major updates: SRE Agent Enhancements: Triage just got turbocharged. New connectivity + new capabilities = faster resolution. Shift-Based Schedules (GA planned for May): Schedules are more flexible than ever. Quick start options, custom shifts, and multi-responder support for shadow training or increased coverage.

Post-Incident Reviews in the PagerDuty UI

Turn incidents into learnings and build resilient operations with real-time collaboration and actionable insights built directly into your PagerDuty workflow. Post-incident Reviews in the PagerDuty UI are now in Early Access. Coming soon: AI-generated drafts and intelligent follow-up suggestions.#IncidentResponse.

Faster incident investigation with BigPanda and ServiceNow Now Assist

When an incident occurs, an L2/3 engineer or SRE can spend 20–30 minutes investigating across alert consoles, combing through change records, and pinging teams on Slack or Microsoft Teams. When you multiply that time spent across thousands of incidents per year by the cost of an IT outage at $14,056 per minute, the cost is staggering. Enterprises can’t afford to waste time searching across disparate tools.

A guide to setting up alerts for a new service

When you launch a new service in production, you’re working with a lot of unknowns. You don’t yet know how it behaves under real traffic or which incidents are worth waking someone up for. That makes alerting for a new service a little different from what you’re used to with an established one. The goal in the early days isn’t to get everything perfectly configured. It’s to learn enough about the service to get your alerting right.

April 2026 Early Warning Signals

April saw widespread disruptions across SaaS platforms, developer tools, and cloud services, with login failures, pipeline issues, and general service outages among the most common problems. StatusGator’s Early Warning Signals consistently identified these incidents ahead of official provider updates. In several cases, the lead time was significant. Bitbucket pipeline failures were detected 1 hour 17 minutes before acknowledgment, while Claude performance issues surfaced 59 minutes early.

Prevent outages with PagerDuty incident retrospectives

Recurring incidents are a symptom of a broken process. Your teams are working hard to get services back online, but constantly battling the same problems is frustrating and not a sustainable approach. What’s reflected here is not a failure in engineering abilities, but a deficiency in the learning that should follow an incident. When incident analysis focuses on finding a single person or team to blame, it creates a culture of fear.