Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Suppressing Alert Noise during Scheduled Maintenance

Alert noise is a common problem for IT teams that monitor and manage complex systems. Excessive unactionable alerts triggered by various sources, such as applications, servers, network devices, etc., can cause alert fatigue. The higher volume of alerts can be overwhelming, reducing the ability to respond to critical alerts. One event of possible alert noise is during scheduled maintenance, awhich is a common practice in the digital realm.

6 Best Practices for Tuning Network Monitoring Alerts

Network monitoring and alerting provide the foundation for efficient IT operations and cyber resilience. By keeping track of the status and performance of network infrastructure and applications, network monitoring tools can automatically generate alerts when defined thresholds are exceeded or specific events occur. These network monitoring alerts allow IT teams to detect outages, performance degradation, and potential security incidents so they can respond swiftly to minimize disruption.

Sponsored Post

Taking down (and restoring) the Raygun ingestion API

In a world where Software as a Service (SaaS) products are integral to daily life, maintaining uninterrupted service for end-users is paramount. However, stuff happens. When it does, our most valuable response (other than restoring service ASAP) is to review the series of events that led up to the incident and learn from them. On August 25th, 2023, at 7:02 AM NZT, Raygun experienced a significant incident that impacted our API ingestion cluster, leading to an outage lasting approximately 1 hour and 15 minutes. While this wasn't fun for anyone involved, this incident did prove to be a valuable learning experience, shedding light on the importance of infrastructure management and resilience.

Status Pages That Deliver: Top 10 Favorites

Status Pages represent an invaluable asset for websites and SaaS businesses, particularly in today's environment with prevalent outages and heightened user expectations for seamless uptime. Integral to any robust website monitoring strategy, these pages serve as centralized hubs, offering users a singular, authoritative source for tracking the status of websites and applications.

Status Pages 101: How to Create a Status Page You and Your Customers Will Actually Want to Use

This blog post is adapted from my talk at SRECon EMEA 2023 - original slides are available here! Status pages are a simple yet underutilized element of incident communication. Done well, they’re a low-lift way to keep your customers and stakeholders informed when incidents impact them. But without a solid approach, updating status pages can easily become a tedious and often neglected task during incidents. In this post, we’ll cover some tips to get your status page right.

PagerDuty and Jeli Together Will Transform Incident Management

Today is an important day for us at PagerDuty, and for the larger ecosystem of incident management. We’ve signed a definitive agreement to acquire Jeli, a standout player in the incident management space. This deal represents a strategic alignment of visions, technologies and goals that will have a lasting impact on the industry and our customers.

Stop aiming for a 'perfect' monitoring and observability strategy - and start using AIOps

Change is the only constant in today’s continuously shifting IT landscape. Whether you’re adding new observability tools, retiring existing monitoring systems, establishing new business units, or onboarding IT systems from acquisitions, managing these non-stop changes can challenge even your expert ITOps team. Trying to get your monitoring house in order is a daunting task.

Basics of Incident Management

Life is full of unexpected incidents. From the coffee spill that disrupts your morning routine to the sudden traffic jam that transforms a 20-minute commute into an hour-long ordeal. Much like these challenges, most of our systems and infrastructure also constantly face these tiny glitches. If ignored, they can have a significant impact. Unlike minor inconveniences, these glitches we call Incidents have the potential to disrupt your business, frustrate customers, and eat into your revenue.

Set Responders Up for Success with New User Onboarding

Effective incident response plays a critical role in maintaining smooth operations at organizations of all sizes. When built up correctly, operational resilience–that ability to bounce back quickly after failure–can act as a shield that guards your customer experience, ensuring that even when incidents inevitably happen, you’re back online in no time.