Operations | Monitoring | ITSM | DevOps | Cloud

Who should be on-call

There usually isn’t a hard and fast rule about who should be on-call. Teams often look for criteria like seniority, experience, or expertise. While those factors certainly help, they might matter less than you think. It is often more useful to look at whether your processes are ready. When incident responses rely on memory and intuition rather than documentation, even experienced engineers can struggle. They might handle things through internal knowledge that isn’t available to everyone else.

The Incident Checklist: Reducing Cognitive Load When It Matters Most

In the previous post, we looked at what happens after detection; when incidents stop being purely technical problems and become human ones, with cognitive load as the real constraint. This post assumes that context. The question here is simpler and more practical. What actually helps teams think clearly and act well once things are already going wrong? One answer, used quietly but consistently by high-performing teams, is the checklist.

Part Two: Turning Event Intelligence into Action - Real-World Value for Financial Enterprises

Event Intelligence Solutions are redefining how organizations manage complexity and risk across digital ecosystems. Their true power lies not only in detecting anomalies or suppressing noise, but in providing actionable, explainable intelligence that connects IT events to business impact.

Enterprises don't fail because systems go down

They fail because human response breaks down under pressure. Over the past decade, organizations have invested heavily in monitoring, observability, and automation. Dashboards are everywhere. Alerts fire instantly. Tickets are created automatically. And yet, when a critical incident happens, the outcome is often painfully familiar. Someone doesn’t respond. Escalations stall. Ownership is unclear. Waste work in following up is created. And valuable time is lost.

Agentic IT operations, powered by BigPanda

BigPanda delivers the next evolution in AIOps solutions, featuring agentic automation for ITOps and ITSM teams, all in a single platform. Agentic IT operations from BigPanda keep the digital world running by transforming reactive, manual IT processes into proactive, intelligent automation. Our platform uses AI to detect, respond to, and prevent IT incidents at machine speed.

Engineering reliable AI agents: The prompt structure guide

The difference between an AI assistant that "almost" works and one that consistently delivers high-value results is rarely a matter of raw model capability. Instead, the bottleneck is typically the quality and structure of the instructions provided. For DevOps and SRE teams building automated workflows, "magical prompt tricks" are no substitute for a repeatable, engineered structure.

What is IT Alerting?

IT alerting means that responsible and on-call employees receive IT alerts about disruptions and anomalies in IT systems and infrastructure. These notifications can come directly from the systems themselves or from monitoring tools. The goal is to reduce downtime, service limitations, security breaches, and data loss by responding quickly. In many cases, the stakes are high: data loss, reputational damage with customers, or even disruption of critical business processes.

Event Intelligence Solutions - A New Era for IT Operations

In an era where digital performance defines business success, large enterprises are embracing Event Intelligence Solutions (EIS) to keep services available, resilient, customer-facing operations protected from disruption. According to Gartner, Event Intelligence Solutions use AI and advanced analytics to enhance and automate how organizations respond to signals generated by digital services.

5 things to do before you go on-call for the first time

Going on-call for the first time can feel a bit overwhelming, but a little prep work makes it smooth and stress-free. This guide covers five things to set up before you start your first on-call shift. They help you stay on top of your schedule, get on-call notifications, and have a backup in place. By the end, you’ll be ready to handle your first on-call shift with confidence.

AI Impact on software engineering (as I see it)

When I first started using AI (Cursor, to be more specific) for coding, I was very impressed to see how it could generate such high-quality code, and I understand why it's now one of the most widely used tools for software engineers. As I continued to use them more regularly, I realized they are far from perfect. Their effectiveness depends heavily on how they are used and the context in which they are applied.

Verizon outage - January 14

When a major carrier like Verizon goes down, the impact is immediate and widespread. On January 14, 2026, thousands of users across the United States found themselves without cellular service, unable to make calls, send texts, or access data. While social media erupted with reports of “SOS mode” on iPhones, official acknowledgment from the provider lagged behind for hours.

What We Built in 2025, and Why It Matters Going Into 2026

As we move further into 2026, we wanted to pause for a moment and reflect on what the past year looked like for OnPage, not just in terms of features shipped, but in how the platform evolved to better support the way teams actually work in high-stakes environments. 2025 was a foundational year for us.

Why agentic AI is the future of IT change management

Every enterprise depends on continuous changes to its IT environment. New code releases, infrastructure updates, configuration changes, and security patches are all crucial to support continuous innovation. These same changes are also a leading source of operational risk and one of the most common causes of failures at the network, infrastructure, and software layers, resulting in outages.

Getting started with on-call

Setting up on-call is simpler than it seems. It comes down to a few clear decisions about your team and what your service actually needs. This guide walks you through those decisions. You’ll learn who to add in your rotation, how long shifts should last, when to hand off, and what coverage makes sense for your service. By the end, you’ll know exactly how to set up your first schedule and move from ad-hoc firefighting to organized incident response.

Why AI-driven automation in incident response is viable now

This article explains why AI-driven automation in incident response is feasible now. Teams can finally safely delegate repetitive and time-critical response tasks to AI Agents, which operate with contextual awareness and human oversight. The result is faster response, higher service uptime, and less alert noise – without losing control. ‍

How to Monitor SaaS Status in 2026 : A Complete Guide

This is an updated and expanded version of the older guide. According to the 2025 State of SaaS report, organizations use an average of 106 SaaS apps. Staying on top of your SaaS vendors' status is as important as monitoring your own services. The Cloudflare, AWS, Azure, and Google Cloud outages in 2025 were strong reminders of this fact.

Democratizing Reliability: Giving Non-Engineers Real Operational Power with Dileshni Jayasinghe

Many companies don’t invest in incident management until something goes wrong. commonsku took a different path. In this episode of Humans of Reliability, Sylvain sits down with Dileshni Jayasingha, VP of Technology at commonsku, to talk about what it really takes to introduce incident management in a mature, profitable SaaS that had never formalized it. From rolling out observability and incident tooling to practicing internal status updates before going public, Dileshni shares how her team built the right muscles before they were forced to.

PagerDuty Appoints Chris Ferro as Chief Legal Officer

PagerDuty, Inc. announces that Chris Ferro has joined the company as Chief Legal Officer. Ferro will oversee all legal functions at PagerDuty, including corporate, compliance, employment and product matters, with a focus on advancing business objectives while mitigating legal and regulatory risk.

AWS re:Invent 2025 - From Alert to Action: AWS + PagerDuty Agentic Ops

Hear how AWS and PagerDuty are transforming incident management with agentic & generative AI. Learn how agents within AWS Quick Suite and PagerDuty work together to detect, diagnose, and resolve incidents with less toil and swivel chair. This session explores how AI collaboration is reshaping resilience across cloud environments.

How agentic IT operations transform IT Service Management (ITSM)

Enterprise ITOps leaders are realizing that legacy incident management processes are collapsing under the weight of today’s sprawling, hybrid-cloud enterprise environments. The fastest path from reactive firefighting to proactive, automated control is an agentic AI-powered incident assistant that can understand context, coordinate people, and take intelligent action at machine speed. Enterprise IT doesn’t look anything like it did even five years ago.

AWS re:Invent 2025 - Smarter Incident Response with Logz.io and PagerDuty

In this session, Jacky Leybman from PagerDuty and David Lotan Bolotnikoff from Logz.io showcase how PagerDuty and Logz.io combine generative AI with rich historical context to automate root cause analysis and accelerate incident response. By correlating real-time telemetry with prior incidents and runbooks, teams reduce manual toil and MTTR while maintaining human-in-the-loop oversight and transparent reasoning.

AWS re:Invent 2025 AI-First Incident Management in Slack

Jacky Leybman from PagerDuty and Kaninie Knight from Slack share how their integration streamlines incident response and real-time collaboration. This session highlights practical workflows and measurable gains – such as faster triage and lower MTTR – achieved by connecting on-call operations directly in Slack.

From Ticket Creation to Human Acknowledgment: Closing the Incident Response Gap

Freshservice has become a trusted system of record for IT teams managing incidents, service requests, and operational issues at scale. Tickets are logged, categorized, prioritized, and tracked with discipline. SLAs are defined. Dashboards provide visibility. On paper, everything looks covered. Yet many teams still experience missed or delayed responses when incidents truly matter, especially after hours. The gap isn’t in ticket creation. It’s in what happens next.

A Recap of 2025

In the past, our yearly recaps were mostly about numbers. What we shipped, how much Spike grew, and a long list of stats. See past recaps: 2023, 2024. But 2025 felt different to me. It had many moments that shaped how Spike as a product and the company looks today. Some of them were exciting. Some were uncomfortable, and all of them changed how I think about building Spike. We’re still bootstrapped and operating lean, with a team of fewer than ten people.