Operations | Monitoring | ITSM | DevOps | Cloud

Syncing PagerDuty Schedules to Slack Groups

We’ve posted before about how engineers on call at Honeycomb aren’t expected to do project work, and that whenever they’re not dealing with interruptions, they’re free to work on whatever will make the on-call experience better. However, all of our engineering rotations rely on hand-off meetings where they update the Slack groups with everyone who’s on call. During my last shift, a small problem kept causing friction for some of our incident management automation.

How Effective are Your Alerting Rules?

Recently, I came across this Reddit post highlighting the challenges of having ineffective alerting rules: And, here at OnPage we have experience with various companies who have dealt with just that, so I felt I should share some of our top tips for creating effective alerting rules in this blog. Read on to discover…

How to build automatic remediation workflows in Grafana Cloud

When incidents occur, engineers must jump into action to get systems back to running at peak performance. However, there are a myriad of challenges that can prevent them from resolving the issues swiftly. Imagine a scenario where a team of DevOps engineers manages a cloud-based e-commerce platform that experiences occasional spikes in traffic during peak shopping seasons. During one of those major sales events, the team notices a sharp spike in CPU usage across several critical application servers.

Demo Roundups! Automation Standardization (Runbook Automation)

Solution consultants Asif Ahmad and Justyn Roberts show how PagerDuty's management and orchestration for the enterprise helps organizations connect and automate work across teams, systems, and environments. Level up your digital operations expertise with PagerDuty Demo Roundups — a series of live, interactive webinars where you can deepen your knowledge in the Operations Cloud and see how PagerDuty can work for you.

Create Round Robin Rotation in Slack using App

‍Pagerly, a Slack App designed for shift scheduling, makes it easy to create round-robin rotations for various teams. Whether it's support team, engineering team, sales team, customer support or any other department, Pagerly helps manage shift schedules and team rosters within your Slack Workspace. Pagerly app can be installed directly from the Slack App Directory, and it is a most comprehensive rotation app designed to optimize scheduling in Slack.
Sponsored Post

Financial Benefits of Incident Management: Cost Savings and ROI

Have you ever assessed the financial impact of an hour of downtime on your business? If not, the results might be more alarming than you expect. For large enterprises, the cost can easily reach millions-and that's only the beginning of the potential consequences. And that's just the tip of the iceberg.

How AI is Revolutionizing SaaS and Cloud Software: Key Trends for 2025

In recent years, artificial intelligence (AI) has ceased to be a mere technological trend and has established itself as a foundational element shaping the future of Software as a Service (SaaS) and cloud-based software solutions. By 2025, AI's integration into these domains will not just enhance existing functionalities but redefine what is possible in ways we’re only beginning to comprehend.

Improve your observability strategy with AIOps

Change is the only constant in the IT landscape. These changes might involve adding new observability tools, retiring existing monitoring systems, establishing new business units, or integrating IT systems from acquisitions. Managing these changes can challenge even expert ITOps teams. Organizing your monitoring setup can seem overwhelming, especially with issues like monitoring gaps, observability redundancy, complex toolsets, or significant technical debt.

Runbook Automation and Rundeck v5.6 Release Notes

The Runbook Automation and Rundeck product team are back with release v5.6, featuring some security updates and fixes, plus lots of contributions from Rundeck’s amazing open source community. Plus, Forrest takes us through some of the projects that community members can contribute to themselves, including the documentation and plugins.

Achieving quick time to value with AIOps

AI is everywhere, and while it’s transforming industries, many organizations are still trying to identify how to use it to achieve tangible value. This is especially true for AIOps, where platforms often fall short of the promises to automate IT operations and improve incident response. As a result, many leaders are skeptical about whether AIOps can deliver measurable results quickly or provide outcome-driven value in IT operations.

How To Monitor Public Status Pages of Cloud Providers - a Step-by-Step Approach

Incident updates on the public status pages of your cloud providers are often the first indication that they might have an outage. Providers also post updates about upcoming and ongoing maintenance on their status pages. Thus, monitoring your cloud status pages becomes crucial to your business operations. This article will guide you through the process of effectively monitoring such status pages.

Trusting AI for Incident Response: The Role of AI in Modern Incident Management

In an age where every second counts, the swift resolution of IT incidents can mean the difference between maintaining business continuity and enduring significant operational setbacks. As businesses increasingly embrace digitalization, the complexity and volume of incidents rise exponentially. This new reality calls for innovative approaches to incident management—ones that can manage the unpredictability, scale, and urgency of modern IT ecosystems. Enter artificial intelligence (AI).

How to get Pagerduty Integration On-call on Slack?

This article will explain how to get who-is-on call integration from Pagerduty onto your Slack. Pagerly is one of the leading Slack Apps for managing company's digital operations like incidents, tickets, alerts , oncalls on Slack. Pagerly integrates with the Pagerduty platform and manages the entire lifecycle of oncall and incident management all within Slack. With Pagerly, you can manage your pagerduty incidents and assign the tickets , messages, incidents to slack users who are currently oncall.

Unlocking Automation: A New IDC Report on Automation Standardization

Innovation in automation is transforming what’s possible in operational dynamics at an unprecedented pace. For modern enterprises, this shift is not just a technological evolution; it’s a strategic imperative. C-suite executives and boardrooms increasingly recognize the potential of technologies like GenAI as powerful tools for enhancing productivity, reducing risk, and optimizing costs.

Building a team for successful AIOps adoption

As pressure increases on enterprise IT teams to streamline processes and reduce downtime, many organizations are looking for new tools and strategies. Customers and stakeholders expect operational efficiency and service reliability. Tools within the AIOps industry can relieve the pressure by reducing alert noise, automating manual workflows, and reducing mean time to resolution (MTTR). However, the challenges don’t end at tool purchase.

Integrate Incident Alerts With Discord Using Webhooks

Staying on top of your third-party Cloud and SaaS service outages is crucial to maintain the reliability of your own applications. If Discord is your communication tool of choice, you can keep up with such incidents by pushing these events to a Discord channel. Discord webhooks allow external applications to send messages to specific channels within a Discord server. This article describes how to integrate Discord as a channel in your IncidentHub account using webhooks.

The human element of implementing AIOps

When implementing new tech, the challenges don’t end at tool selection, purchase, and initial deployment. You can have the best technology in the world, but it won’t help your organization if no one uses it. Many teams look to AIOps solutions like BigPanda to reduce noise, improve workflows, and resolve incidents faster through AI and automation. Bringing in a new platform is part of the equation. The other part is organizational change management to support platform adoption.

Enhancing Postmortem Reports with AI

Postmortem reports are essential in incident management, helping teams learn from past mistakes and prevent future issues. Traditionally, creating these reports was a slow, tedious process, requiring teams to gather data from multiple sources and piece together what happened. But with AI and Large Language Models (LLMs), this process can become faster, smarter, and much less of a headache.

Revolutionizing Remote-Location Operations With PagerDuty Automation

Consistency is key in today’s ultra-competitive retail environment. Whether a customer walks into a store in New York City, London, or Tokyo, or shops online, they expect the same seamless and personalized shopping experience, regardless of where they are. These consistent experiences are what creates customer loyalty and keep them coming back From an IT perspective, delivering these experiences across multiple distributed locations presents unique challenges.

A Step by Step Guide to Checking if a SaaS is Down

Modern businesses depend heavily on Software as a Service (SaaS). Almost all aspects of business operations - accounting, HR, payroll, marketing, IT, sales, support - depend on one or more SaaS applications. SaaS is not limited to being used by software development teams. Given this dependency on SaaS applications, their uptime becomes tightly tied to a business's uptime. Any SaaS downtime can affect both a business's daily operations as well as the user experience.

Demo Roundups! Digital Operations Resiliency

Guest Chris Duke, DevSecOps Coach at BT, explores why PagerDuty is the perfect ally for turning his organization outage-ready and shares some of their Incident Management best practices in an "Ask me Anything" session with Solutions Consultant Tesh Ruparell. Solutions Consultant Nick Castle shows how PagerDuty's Enterprise Incident Management, combined with AIOps and Automation capabilities, ensures fast incident resolution by automatically dispatching the right teams for quick fixes at scale, creating a proactive approach that helps maintain SLAs, drive innovation, and protect revenue.

The Future of SLOs in DevOps: Navigating Common Pitfalls in SLO Management

As the technology landscape continues to evolve, so do the methods by which organizations ensure optimal service delivery. Service Level Objectives (SLOs) have emerged as one of the most critical metrics in DevOps and Site Reliability Engineering (SRE), acting as a bridge between reliability and performance. SLOs reflect the target reliability of a service from the perspective of the user, providing measurable standards to maintain quality.

Using LLMs for Automated IT Incident Management

Large language models are algorithms designed to understand, generate, and manipulate human language. State-of-the-art large language models include OpenAI’s GPT-4o, Anthropic Claude Sonnet 3.5, and Meta LLaMA 3.1. They are built using neural networks with billions or even trillions of parameters. They are trained on vast datasets that can include text from the internet, books, code, and other information sources.

Jira and ServiceNow: A Comparative Analysis for Effective Incident Management

Incident management isn't just a buzzword—it's critical to keeping operations running smoothly. When systems fail, the ripple effects can be costly. For enterprises, maintaining service continuity and keeping customers satisfied depends on quick, efficient incident responses. That's where tools like Jira Service Management (JSM) and ServiceNow come in.

Preparedness as a Competitive Advantage: Building Resilience Year Round

The recent global IT outage is a stark reminder that even the most advanced organizations can have bad days. Major disruptions can have significant downstream impacts that can lead to disappointed customers, lost revenue, deferred processes and even legal action if the downtime is considerable. With the rapid pace of technological change and the continued digital transformation intensified by AI, disruptions are no longer “unexpected.” They are part of the normal course of business.

Reduce Noise through Intelligent Alert Grouping

In an ideal world, every alert would signal a unique and critical issue. However, in reality, alerts often come in waves. Alert noise refers to the overwhelming volume of notifications that incident response teams receive, many of which may be redundant or irrelevant. This can lead to alert fatigue, where critical issues might be overlooked due to the sheer number of notifications. ‍

What does SLO stand for? A complete guide to Service Level Objectives (SLOs)

The world of tech is full of acronyms. SLOs are one of those that everyone talks about, but maybe not everyone fully gets. Whether you're nodding along in meetings or just hearing “SLO” for the first time, we’ve got you covered. In this post, we’ll break down what Service Level Objectives (SLOs) actually are, why they matter, and how they can help keep your systems (and your sanity) in check.

The ultimate guide to on-call schedules

An Ultimate Guide to on-call schedules? You might think this sounds overly grandiose for what’s essentially putting people into a list and rotating through them. But you’d be flat-out wrong. Getting your on-call setup correct is as real and as important as it gets, and getting things wrong can lead to prolonged incidents, burnt out employees, and damaged company reputation.

Custom Milestones: Empowering Enterprise Incident Management

Milestones have been central to our platform since day one, helping users track incident progress and drive automation. We're excited to introduce our enhanced Milestone feature, offering unparalleled customization. Now, you can fine-tune your incident management process to perfectly align with your organization's specific policies and workflows.

The Role of Technology in Enhancing Incident Response Call Etiquette

The interconnectedness of today's business environment has significantly heightened the complexity of incident response (IR). The need for immediate action, precise communication, and real-time collaboration is more critical than ever. However, beyond the technical precision required in solving problems, there lies an often overlooked aspect of effective IR management: the etiquette of incident response calls.

4 New Ways to Improve Incident Management with Event Orchestration

In an era where efficiency and smart technology integration are key, 71% of technical leaders report their companies are expanding their investments in artificial intelligence (AI) and machine learning (ML) this year. With the sheer volume of data coming into the enterprise and the need for timely response, monitoring every incoming alert around the clock is impractical, and human vigilance alone is too imprecise.

6 top incident management use cases for AI copilots

The news is filled with buzz about how companies approach AI. As a result, many organizations are trying to identify how AI can effectively support their business goals. There seem to be infinite use cases, but finding those that add the most value is often the first challenge. In the ITOps environment, generative AI copilots can effectively improve team efficiency, share knowledge, and support day-to-day tasks to deliver immediate value.

Myth vs. Reality: Lessons in Reliability from the July 19 Outage

It was 3AM at Newark Liberty International Airport. I was groggy, waiting in line to get my boarding pass, only to be met with a blue screen on the check-in kiosk. Needing some coffee, I learned the vendor was only accepting cash. There was clearly a big outage and I quickly checked our systems at PagerDuty. Major outages happen multiple times per year, so frequently that we have an internal dashboard (colloquially referred to as “the internets are broken”).

AlertOps Announces Integration with ServiceNow to Enhance Incident Management and Response

AlertOps announced its new integration with ServiceNow to enhance incident management and response capabilities for ServiceNow customers. This joint effort enables AlertOps to create better experiences and drive value for customers by providing real-time notifications, bi-directional data synchronization, and seamless integrations. ServiceNow’s expansive partner ecosystem and partner program is critical in supporting the Now Platform’s $275 billion forecasted market opportunity through 2026.

Achieving Faster Mean Time to Resolution MTTR with AIOps

In today’s fast-paced digital world, customer satisfaction is the top priority of every other business. To ensure that customer stays satisfied with your service and application at all times, businesses must work on reducing their downtime and guarantee quick resolutions. Excessive downtime can be expensive for any business and its brand reputation. Hence, adapting practices that eliminate issues responsible for downtime is crucial for maintaining seamless IT operations.

IT Outage Notification Templates and Incident Communication Examples

Outages cost millions and even billions for businesses across different spheres. For example, Amazon may lose up to $34 billion in sales within an hour of downtime, and a service outage back in March cost Meta nearly 100 million in revenue. However, that’s not all that was lost. Due to poor outage notifications and a lack of resolution details, many Meta users were kept in the dark about the outage. This Reddit thread shows many users were frustrated.

Navigating the Incident Management Lifecycle: A Complete Guide

Ever wonder why some IT teams can quickly resolve incidents while others struggle? The secret lies in mastering the Incident Management lifecycle. But don’t worry—this isn’t some dull, complicated process only experts can understand. The Incident Management lifecycle is simply a structured approach to handling incidents efficiently. And the best part? You can quickly get the hang of it.

Alert noise reduction: How to cut through the noise

ITOps and AIOps teams often face an overwhelming volume of notifications, many of which are false positives or low-priority alerts. The constant influx creates a chaotic environment. ITOps and AIOps teams can easily miss critical issues, potentially leading to system failures or prolonged downtime. Spending significant time sifting through irrelevant alerts reduces team efficiency and slows response. Focus on alert noise reduction to ensure that only meaningful and actionable alerts reach your teams.

5 ways teams used BigPanda during the CrowdStrike outage

In the weeks since the Crowdstrike outage brought millions of systems to a halt, countless articles have been written about the cause of the outage, its impact, and the costs companies incur during service disruptions. Nearly every large company had hosts offline due to the faulty update in CrowdStrike’s Falcon software. BigPanda customers were no exception. On July 19, between 04:00 and 07:00 UTC, the BigPanda systems logged an increase in shared incidents.

How to Automatically Remediate Incidents with Grafana IRM

Build automatic remediation workflows to preemptively resolve system issues and minimize downtime. With observability-native IRM, you can automate routine tasks, ensure consistent responses, and reduce the manual effort required to manage incidents. Grafana Cloud is the easiest way to get started with Grafana dashboards, metrics, logs, and traces. Our forever-free tier includes access to 10k metrics, 50GB logs, 50GB traces and more.

What is ISO 27001 Incident Management? Definition and Process

Managing incidents is crucial to maintaining the security and integrity of an organization's information systems. ISO 27001 Incident Management provides a structured approach to addressing and resolving incidents in a way that minimizes impact and prevents recurrence. This framework doesn't just help organizations respond to incidents—it helps them create a robust system that anticipates and mitigates risks before they escalate.

Avoid ITSM and NOC surprises with better context

Rapid, proactive responses to unexpected system behavior and swift, efficient incident remediation are hallmarks of great IT teams. But the most successful NOC and incident management teams share the following: The right context gives teams visibility across systems, helps them collaborate and share knowledge, and makes every team member more efficient.

Data quality testing

Data quality testing is a subset of data observability. It is the process of evaluating data to ensure it meets the necessary standards of accuracy, consistency, completeness, and reliability before it is used in business operations or analytics. This involves validating data against predefined rules and criteria, such as checking for duplicates, verifying data formats, ensuring data integrity across systems, and confirming that all required fields are populated.

Should You Get an Incident Management Certification? Top 4 Choices

In IT Service Management, the ability to manage incidents efficiently is crucial. Whether it’s a minor disruption or a major outage, having a skilled incident manager at the helm can make all the difference. But how do you become that go-to person in times of crisis? The answer lies in obtaining the right certifications. Incident Management certifications not only validate your skills but also equip you with the knowledge needed to handle any situation that comes your way.

How Does Incident Management Automation Work? A Complete Guide

Managing incidents efficiently is crucial to maintaining service quality. But handling every issue manually can be time-consuming, prone to errors, and overwhelming for your team. That's where Incident Management automation comes into play, revolutionizing the way IT teams respond to and resolve issues. Automation within Incident Management takes the guesswork out of the process, enabling faster response times and improving overall service delivery.

DevOps Incident Management: Streamline Your Processes for Resolution

In the world of DevOps, where development and operations blend seamlessly, incidents are bound to happen. But the way these incidents are managed can make all the difference. Imagine a high-stakes race where every second counts—this is what DevOps Incident Management feels like. It's not just about putting out fires; it's about learning from each one to prevent future flare-ups.

Top Features to Look for in Enterprise Incident Management Software

Are you tired of dealing with unexpected system crashes and the chaos they bring? You're not alone. For enterprise SREs, DevOps, and IT Operations teams, mastering incident management goes beyond just fixing problems; it’s about preventing them. According to a recent report, incident volume within enterprise companies rose by 16% during 2023, highlighting the growing complexity and risk in digital operations. This underscores the urgent need for robust incident management solutions.

Elevate your ITOps skills with BigPanda University

Are you ready to take your IT operations to the next level and unlock the full power of the BigPanda AIOps platform? Our engaging online learning platform empowers professionals like you with top-notch training and certification opportunities. Our carefully designed courses allow you to learn at your own pace and convenience through asynchronous learning. Whether you are a seasoned IT expert or just starting, our courses cater to all skill levels.

PIR in Incident Management: How to Conduct a Successful Review

Incidents are inevitable. No matter how well-prepared your team is, something will eventually go wrong. But what separates high-performing IT teams from the rest is how they handle these incidents after the dust settles. Enter the Post-Incident Review (PIR) in Incident Management—a crucial process that not only helps teams understand what went wrong but also ensures that they’re better prepared next time.

Introducing Statusy - An Open Source Status Page Aggregator

A quick walkthrough of Statusy—an open-source status page aggregator that centralizes service monitoring for your team. Created by Yash Jain at Squadcast, Statusy simplifies tracking with a unified dashboard and flexible notifications. Set up in minutes and keep your team informed! Statusy is fully open source.

What is Enterprise Incident Management? Process and Software

Enterprise Incident Management (EIM) is a game-changer for organizations that want to keep their IT operations running smoothly. Whether it's a minor glitch or a full-blown system outage, managing incidents efficiently is crucial to minimizing downtime and keeping your business on track. But what exactly is Enterprise Incident Management, and why should you care?

Getting Started with Ruby on Rails in 2024 - The Complete Development Environment Guide

Overview Ruby on Rails is a web development framework written in Ruby that helps developers build websites and applications quickly. It uses an MVC (Model-View-Controller) structure to organize code and make everyday tasks easier by following simple patterns instead of complex configurations. Rails also helps with database management and includes security features to protect against common threats. It's famous for building websites and apps, especially for startups, and powers well-known platforms like GitHub and Shopify.