Operations | Monitoring | ITSM | DevOps | Cloud

We Built an SRE Agent With Memory And It's Transforming Incident Response

If you feel like your incidents are multiplying while your stack gets more complex by the week, you’re not alone. Event volumes keep climbing, signals live in a dozen tools, and human responders are stretched thin. That’s exactly why we built the PagerDuty SRE Agent—a vendor‑agnostic AI teammate that improves with every response to make the next one faster, smarter, and more reliable.

Too Late to Learn: Why Security Post-Mortems Fail and How AI Can Help

An effective post-mortem can turn a security breach into a blueprint for lasting resilience. But too often, in the stress of an incident, documenting what happened takes a back seat to containment and recovery. The resulting analysis relies heavily on memory, scattered notes, and competing narratives. Valuable context gets lost, timelines blur, and lessons that could strengthen defenses never become institutional knowledge.

Your Next Incident Has Already Started. You Just Haven't Noticed Yet.

The best way to minimize the impact of an incident is to catch it early, before small issues snowball into major disruptions. That requires maintaining healthy systems and ensuring sufficient resources are available when problems arise. But developers and IT operations pros working in large enterprises face a challenge: Complex systems operate in an inherently degraded state. In his essay “How Complex Systems Fail,” Dr.

Your Top Engineers Should Be More than Expensive Button-Pushers

The engineer you pay $200,000 a year just spent an hour copy-pasting data between dashboards. Again. Software engineers have critical skills that are in the highest demand. And yet, many world-class engineers are currently spending too much of their time clearing tickets, routing alerts, and responding to the same types of incidents over and over again. This operational toil is costing you.

Meeting Developers Where They Work: PagerDuty + Spotify Portal for Backstage

From the beginning, PagerDuty has been built by developers, for developers. Our mission has always been to help development teams build faster and resolve incidents more efficiently by meeting them where they work. Building on PagerDuty’s existing plugin for Spotify for Backstage, we are thrilled to announce the PagerDuty plugin for Spotify Portal for Backstage to continue bringing enterprise-grade incident management into even more developer workflows.

PagerDuty Joins AWS QuickSuite: Connect Your Incident Management with 1,000+ Applications

Today, we’re announcing that PagerDuty is now available in AWS QuickSuite through the Model Context Protocol (MCP). This means PagerDuty’s incident management capabilities can now connect with the 1,000+ applications and data sources that QuickSuite integrates with, from AWS services to enterprise SaaS platforms, all accessible through natural language.

A Launch Day in the Life with AI Teammates

Alex, an SRE at Greenagonia, starts the day knowing there’s a big launch coming. Pre-orders suggest a 5-10x increase in normal traffic, which means coffee needs to be extra strong this morning. As Alex scans through overnight alerts, he realizes he’s completely forgotten about a dentist appointment that overlaps with his upcoming on-call shift. Six months ago, this would have meant frantic Slack messages or at least one phone call. Today? Alex’s AI teammate has it covered.

PagerDuty H2 2025 Release: 150+ Customer-Driven Features, AI Agents, and More

My first 6 months here at PagerDuty have been a thrilling ride! PagerDuty continues to set the pace in incident management. With our 16-year track record of helping companies forge a path towards modern operations, we’ve been trusted by over 32,000 companies as the incident management platform of choice. Over these years, we’ve continuously delivered value to our customers at a rapid pace. And our customers have been vocal with us about wanting more.

Introducing Runner Replicas: Scalable, Reliable Automation for Modern Ops

When you’re responsible for the reliability of complex systems, the execution layer of your automation is not something you want to think about—it should just work. Whether you’re deploying code, patching servers, or responding to an incident at 3 a.m., your automation engine should be as resilient and scalable as the infrastructure it’s operating on.