Operations | Monitoring | ITSM | DevOps | Cloud

November 2023

How to Route Alerts to Subject Matter Experts Using Squadcast Tagging & Routing Rules?

Effective Incident Management is crucial for ensuring customer satisfaction and brand loyalty. As systems grow more complex, efficiently directing alerts to the right teams becomes crucial. This article delves into the challenges, implementation, and benefits of automating incident categorization.

How to improve your IT alert management: Understanding best practices

As an IT leader, you’re under significant pressure to control the constant alerts. Somehow, you must manage non-stop IT alerts while also ensuring ultra-high service availability. The task is far from easy, and even the most sophisticated teams struggle to keep up and turn alerts into action with tech stacks that are constantly growing in size and complexity. IT alert management is the first line of defense.

Your guide to better incident status pages

Your status page (or lack thereof) has the opportunity to signal a lot about your brand — how transparent you are, how quickly you respond to incidents, how you communicate with your customers — and ultimately, this all seriously impacts your reliability. After all, as our CEO Robert put it in a recent interview on the SRE Path podcast, you don’t get to decide your reliability; your customers do.

What is Incident Management? Unpacking the Complexity

In the increasingly digital world, tech-savvy professionals strive to maintain reliable and efficient operations that ensure customer satisfaction and uphold trust. Incident Management is an essential component in achieving those goals. This article delves into the complexities of Incident Management, highlighting essential tools and processes that contribute to effective response and resolution strategies.

Announcing the StatusCast Mobile App: A Game-Changer for Status Page Users

We are thrilled to introduce the latest innovation from StatusCast: our groundbreaking mobile status page application, which will be available on both Android and iOS platforms. This launch marks a significant milestone in the evolution of status page accessibility, offering unparalleled convenience and functionality to your power users, the subscribers.

Everbridge Webinar: Increased Terrorism Risks during the Holiday Season

Watch as Director of CEM Product Marketing Sean McDevitt and EMEA Risk Intelligence Regional Analyst James Burr discuss the increased risks at large public gatherings throughout Europe during the holiday season. They will also showcase key functionality Everbridge customers may utilize to keep their employees safe during the next several weeks.

#5 Rundeck by Pagerduty Community Meetup: Automate Kubernetes w/ Rundeck (Part 3)

Session III: Automate Kubernetes with Rundeck Speaker: Justyn Robberts, Sr. Solutions Consultant @ PagerDuty Get together with the Rundeck by PagerDuty Process Automation crew in this 5th Community Meetup and learn how automation is leading La Sapienza University of Rome and Application Performance's way to innovation and fast tracking business for the future.

Navigating the New SEC Data Breach Rule A Blameless Blueprint for Compliance

The new SEC rule on material security breaches goes into effect on December 18, 2023 for larger publicly traded companies and all other public companies within 180 days. If you're not already in compliance, it’s important for you to prepare for the new rule now by developing a plan for incident response and disclosure.

Incidents are inevitable, but chaos is optional.

Ever wondered how to navigate through unexpected challenges without succumbing to chaos? Our short video explores the art of managing incidents effectively, showcasing practical strategies to keep chaos at bay. Dive into insightful tips and real-world examples that demonstrate how proactive planning and a resilient mindset can turn potential chaos into an opportunity for growth. Whether you're a business professional, student, or someone eager to enhance their problem-solving skills.

Are you down?

Discover the power of streamlined communication with StatusCast as we delve into how our platform can revolutionize the way you handle incidents and keep everyone on the same page. From status updates to incident resolution, this video is your gateway to seamless collaboration. Dive into real-world scenarios showcasing how Statuscast.com ensures that everyone stays informed, minimizing downtime and maximizing productivity. Learn how to turn potential setbacks into opportunities for growth with our intuitive platform.

What is ServiceNow change management - and how does AIOps optimize it?

Effective IT change management is essential for maintaining smooth operations in today’s fast-paced, agile IT environment. Given that 85%, or the vast majority, of incident-impacting alerts result from changes, optimizing your change management means improving your incident management and ensuring critical system reliability. So whether your organization uses ServiceNow for change management or is considering using ServiceNow, we’ll walk you through everything you need to know.

PagerDuty Named a Leader in GigaOm's Inaugural 2023 Incident Response Platforms Radar Evaluation

In a world where organizations of all industries increasingly rely on digital innovation and experiences to create differentiation in the market, it has never been more critical to ensure the integrity of their operations are safeguarded against unforeseen outages and incidents. Operational disruptions today can have a major impact on brand reputation, create negative revenue implications and impact customer loyalty.

Unlocking Visibility and Control: Introducing Squadcast's Service Graph Feature

To ensure efficient Incident Management, it is crucial to proactively anticipate and address potential disruptions The need for a comprehensive, high-level view of the status of all services is paramount. Enter Squadcast's Service Graph – a feature designed to transform the way organizations approach Incident Management.
Sponsored Post

Comparing the Top 9 Pagerduty Alternatives in 2023

Pagerduty is a popular Incident Management platform that helps teams respond to alerts and incidents quickly and efficiently. However, its pricing structure can be complex and expensive for scaling businesses and Incident Response teams. In this blog post, we will compare the top 9 Pagerduty alternatives in 2023, and help you to choose the best one for your needs.

Engineering nits: Building a Storybook for Slack Block Kit

We care a lot about the pace of shipping at incident.io: moving fast is a fundamental part of our company culture, and out-pacing your competition is one of the best ways we know to win. In engineering teams, one way to ship fast is to invest in tools that make your team more productive. We've become good at identifying small pains and frustrations that slow us down over time and – after surfacing them to the rest of the team – find solutions for them.

Build Operational Resilience with Generative AI and Automation

For modern enterprises aiming to innovate faster, gain efficiency, and mitigate the risk of failure, operational resilience has become a key competitive differentiator. But growing complexity, noisy systems, and siloed infrastructure have created fragility in today’s IT operations, making the task of building resilient operations increasingly challenging.

Automate insights-rich incident summaries with generative AI

Does this sound familiar? The incident has just been resolved and management is putting on a lot of pressure. They want to understand what happened and why. Now. They want to make sure customers and internal stakeholders get updated about what happened and how it was resolved. ASAP. But putting together all the needed information about the why, how, when, and who, can take weeks. Still, people are calling and writing. Nonstop.

What is PagerDuty - and how does it work with BigPanda?

PagerDuty is an IT operations management platform and cloud computing company launched in 2009. They provide a suite of tools designed to help IT and DevOps teams detect and respond to infrastructure problems, streamline workflows, and improve operational reliability. The PagerDuty platform bridges different systems and the teams that maintain them, centralizing the detection and reporting of incidents. It allows organizations to minimize downtime and resolve issues efficiently.

Managing Databases on AWS: A Practical Guide

Amazon Web Services (AWS) provides a range of managed database services that provide multiple database technologies to handle various use cases. They are designed to free businesses from tasks like database administration, maintenance, upgrades, and backup. AWS databases come in several types to cater to different business needs.

Top 5 Incident Response Tools to Watch Out for in 2024

Having effective incident response tools is crucial for IT organizations. Improving your incident response process is enhanced when equipped with the appropriate tool that includes intelligent features tailored to your needs. Whether you're just beginning your venture into efficient Incident Management or in search of the finest incident response tools, we present the top five options for your consideration.

Build custom monitoring and remediation tools with the Datadog App Builder

When you’re responding to an issue with your application in the heat of on-call, you need reliable, well-maintained tooling that’s painless to use. Otherwise, the time you’ll spend combing through monitoring data for context, connecting to hosts and other infrastructure resources, and pivoting between consoles for various managed services can add up quickly and slow your response.

Top SRE Tools for Enhanced Site Reliability

Site Reliability Engineering (SRE) stands out as a crucial discipline, ensuring the smooth operation and scalability of intricate software systems. SREs employ a diverse toolkit, automating tasks, monitoring system health, and proactively tackling potential issues. The goal? To elevate site reliability and keep downtime at bay. In this blog, we'll dive deep into the realm of SRE tools, breaking down what each tool brings to the table.

Your incident declaration form is (probably) too long: The power of concise reporting

It’s 10am, your coffee is ready and piping hot, and you have just been paged. Looks like is down, and customers are starting to notice. With no time to lose, you open up your organization’s incident declaration form and you spend the next thirty minutes filling out the fifteen required fields, while the incident grows bigger and more complex, messages are rolling in, and your coffee grows cold.

PagerDuty Copilot | Generative AI for PagerDuty Operations Cloud

Introducing PagerDuty Copilot: Your GenAI assistant for critical operations work. For scaling your teams. For sustaining customer experiences. For moving business forward – faster. Work more efficiently. Protect more revenue. Build greater operational resilience. PagerDuty Copilot is the AI assistant operations teams trust to help them manage business-impacting issues in seconds, not hours. From event to resolution, PagerDuty Copilot’s automations help you resolve issues faster, reduce risk, and control costs.

Improving Customer Support with Squadcast Webforms: A Smart Solution for MSPs

Managed Service Providers (MSPs) handle a multitude of customer support cases, each requiring efficient routing to the right team member. Squadcast's Webforms provide a solution to expedite issue reporting and streamline resolution. In this blog, we will explore how MSPs can leverage webforms to enhance the customer support experience.

Introducing Workflows: Enhancing Automation in Incident Response

At Squadcast, we advocate for the principles of Site Reliability Engineering (SRE), which emphasize the critical importance of automating routine tasks to boost efficiency in Incident Management. We're aiding organizations in implementing these principles with one of our newest features: 'Workflows'. Workflows has been designed to automate manual facets of your Incident lifecycle, all while ensuring human-in-the-loop execution for critical decisions.

Best Practices to Avoid Website Outages on Black Friday

The most frenzied shopping day of the year – Black Friday – is fast approaching, and businesses around the globe are bracing themselves. However, imagine this – a massive number of eager shoppers ready to snag the hottest deal, and just when your website should be working at its best, it crashes, leaving behind frustrated customers and potential revenue slipping through your virtual fingers. This scenario is not entirely fictional.

Resilience Engineering in 2024: Challenges, Trends, & Priorities

Is your organization ready to fortify, expand, and cultivate a robust resilience engineering culture in 2024? In this webinar Chris Evans (Co founder & Chief Product Officer, incident.io) and Courtney Nash (Internet Incident Librarian, The VOID) will delve into crucial considerations and top priorities for improving your organization’s ability to build safer and more reliable complex systems while unlocking insights for shaping your plans for 2024 and beyond.

Quick start guide to Unified Analytics dashboards

When it comes to observability, we’ve found that most organizations have ~20 tools installed in their IT environments. With so many tools, it’s difficult for IT leaders to gain insight into how their tools are performing and determine how much value ITOps is bringing to the organization.

Weathering Black Friday and Other Storms Reliably

If you work in eCommerce, you can see the storm on the horizon. Black Friday, the biggest shopping day of the year both online and off, is only a few days away. Your services are going to hit usage spikes you possibly have never seen before. And it will be all aspects of your services pushed to your limit – people won’t just be searching, or just buying, or signing up for programs, they’ll be doing all of these at once. ‍ Most crucially, everyone else is offering deals too.

Should data teams consider incident management tools to respond to pipeline issues?

Data teams are adopting more processes and tools that align with software engineering, and from talks at the dbt Coalesce conference in 2023, there’s clearly a big push towards adopting software engineering practices at enterprise scale companies. At the moment, there are a lot of tools in the data space for identifying errors in data pipelines, but no tools for responding to these errors, such as coordinating fixes. This is exactly where an incident management platform makes sense to implement.

Guide To Best Incident Management Software

Avoiding downtime is imperative. To keep you sturdy against any unplanned disruptions there are Incident Management tools ensuring quick response, efficient resolution, and minimal impact on operations. This blog aims to be your go-to guide for navigating the diverse landscape of Incident Management platforms.

Captains Log: How we are leveraging CEL for Signals

As engineers, we didn't want to make Signals only a replacement for what the existing incumbents do today. We've had our own gripes for years about the information architecture many old companies still force you to implement today. You should be able to send us any signal from any data source and create an alert based on some conditions. We're no strangers to building features that include conditional logic, but we upped the ante when it came to Signals.

IAG Relies on PagerDuty Operations Cloud for Sustainable Growth

Part of the International Airlines Group (IAG), IAG Loyalty operates the loyalty programs for IAG’s airlines—British Airways, Iberia, Vueling and Aer Lingus—and 125+ global brand partners in travel, retail, and financial services. With the PagerDuty Operations Cloud, IAG Loyalty has built a framework that allows engineers to build products and services in a fast and safe way. This has laid the foundation for sustainable growth as a company. Hear more in this video from Colin Lewis, Head of Core Engineering at IAG Loyalty and James Headon, Cloud Operations Manager at IAG Loyalty.

Tip of The Day : Resend Notifications and Set Notification Preferences

Unlock the power of effective communication! Tune in to our latest Tip of the Day video on StatusCast.com, where we delve into 'Resend Notifications' and guide you on optimizing your experience by setting personalized notification preferences. Stay informed, stay empowered!

Status Pages and Incident Management for IT Enterprise

Ready to revolutionize your IT Enterprise? Look no further! Explore the dynamic world of StatusCast.com, where Status Pages and Incident Management come together to redefine how you handle IT disruptions. Why StatusCast.com? StatusCast.com is not just a tool; it's your strategic partner in maintaining the health and performance of your IT systems. Our platform offers a comprehensive solution for creating informative and visually appealing status pages, ensuring your users are always in the loop during incidents.

What is tool consolidation - and how can AIOps optimize it?

Tool consolidation is the process of analyzing which IT observability and monitoring tools to use, which to add, and which to retire. By carefully determining the usage and value of your current observability stack, your ITOps teams can consolidate redundant tools and those providing little value to reduce your operational costs. While the benefits of tool consolidation are clear, doing so is anything but.

Tame observability complexity: Understanding the observability tool landscape

Choosing, deploying, maintaining, and rationalizing observability and monitoring tools can be a constant challenge for ITOps, DevOps, and SRE teams. As teams monitor increasingly complex systems, the need for instrumentation that monitors those systems grows at the same rate, leading directly to a growing problem of observability data engineering, integration, and enrichment.

Strengthen operational resilience with Service Chain Mapping. Watch our 60 second overview.

Watch this short video to learn how Interlink’s Service Chain Mapping solution transforms the ability of banking and finance organizations to address regulatory demands, manage operational risk, and avoid technology failures that could disrupt key customer journeys.

Status Pages and Incident Management for SaaS Companies

Explore the critical importance of status pages and incident management for SaaS companies in our latest video. Learn how effective management enhances customer trust, minimizes downtime, and ensures a resilient and successful SaaS operation. Don't miss out on valuable insights to optimize your service delivery and elevate customer satisfaction!

New Features: AI-assisted postmortems, ilert Terraform updates, and expanded ChatOps capabilities

In incident management, staying ahead of the curve is crucial, and that's what we're doing with our latest suite of features designed to streamline your workflow and enhance your response capabilities. Furthermore, you have provided numerous excellent suggestions during this period. We value your feedback and invite you to reach out to us at support@ilert.com to share your experiences with ilert.

Incident Priority Matrix: A Comprehensive Guide

When multiple users are affected by an incident, it can quickly escalate into a chaotic situation. To effectively manage and prioritize such incidents, organizations need a robust incident priority matrix. An incident priority matrix is a tool organizations use to deal with critical issues quickly. It’s a roadmap for handling incidents efficiently.

What is Vulnerability Management?

Vulnerability management is a critical aspect of a cybersecurity strategy. It refers to the systematic and ongoing process of identifying, classifying, prioritizing, and addressing security vulnerabilities in a network environment. This proactive approach to network security aims to minimize the risk of exploitation by attackers. Vulnerability management is about staying one step ahead of potential threats.

Security - A Pillar of Reliability

When you think about making your service reliable, what standards and benchmarks are most important? The availability of services? Consistently fast responses? Accurate data? Prioritizing critical and common use cases? These are all important and deserve some focus, but today we’ll put the spotlight on an often overlooked pillar: security. ‍ Cybersecurity incidents can be the most devastating types of incident for your organization.

Unleash the potential of intelligent, context-aware automation with BigPanda and Ansible

Many ITOps organizations we speak with want a state of self-healing systems capable of identifying and resolving issues without human intervention. Thanks to the progress in AI and ML, AIOps has made significant advancements in areas that automate many of the steps involved with identifying and triaging incidents. We ask ITOps leaders why they aren’t taking the next step with auto-remediating incident response workflows.

Status Pages and Incident Management for Higher Education

Elevate your higher education experience with StatusCast! Watch our exclusive system outage video to discover crucial insights and proactive strategies to ensure uninterrupted operations in the dynamic landscape of academia. Learn from real-life scenarios and gain valuable knowledge on maintaining system reliability, minimizing downtime, and enhancing the overall efficiency of your educational institution. Stay ahead in the digital age of higher education with StatusCast – because your institution's success depends on a robust and resilient IT infrastructure!

Incident communication best practices for an elevated user experience

Downtime is unavoidable, and incidents happen. Organizations need to be rapid and transparent in communicating incidents with their customers. Lack of timely communication can jeopardize the entire incident management process and increase user frustration. This guide provides rich insights into what incident communication is, why it's important, and best practices for effective incident management. What is an incident, and why is incident communication important?

Understanding intelligent alerts in ITOps and alert management best practices

As an ITOps leader, you know managing enterprise IT can be challenging, with its mix of old and new, on-site and cloud-based systems. Closely monitoring each part of the system infrastructure and its many components is a constant struggle, forcing you and your team to juggle non-stop alerts and keep services up and running. How can you stop alert fatigue and gain clarity when alerts are incessant, unclear, and lack the necessary context? The answer lies in intelligent alerts.

Tip of The Day : How to Best Use Incident Templates

Welcome to Statuscast.com's latest video: "How to Best Use Incident Templates," hosted by our very own Director of Customer Experience Engineering! In this power-packed tutorial, Denise Joyal will guide you through the intricacies of optimizing your incident response using Statuscast's cutting-edge Incident Templates feature.

Incident management really can be for everyone

Incident management tools are often built for engineers to solve technical issues. On the surface, thinking of incident management as an engineering problem makes sense, and it’s an approach that’s widely used by many organizations from small startups to large enterprises. When there's a problem like a checkout page failure or a server crash, it’s natural for engineers to spring into action, declaring and resolving these incidents.

From Chaos to Actionable Insights with PagerDuty Integrations and Automation

It’s 2023. In today’s world, every company and individual, regardless of their industry, relies on software to increase productivity. Our users expect our technology to be available and reliable at all times. If your software serves businesses within a single country during regular working hours, they expect it to be available throughout that time. Easy, right?

A tool rationalization head start with BigPanda

Tool rationalization, sometimes called tool consolidation, is the systematic analysis of observability and monitoring tools, the consideration of onboarding new tools to fill gaps, and the retirement of unnecessary tools. Perhaps you and your IT team are struggling with constantly buying new tools to meet a very niche use case to unlock new capabilities.

Introducing Workflows: Enhancing Automation to Incident Response

At Squadcast, we advocate for the principles of Site Reliability Engineering (SRE), which emphasize the critical importance of automating routine tasks to boost efficiency in Incident Management. We're aiding organizations in implementing these principles with one of our newest features: 'Workflows'. Workflows has been designed to automate manual facets of your Incident lifecycle, all while ensuring human-in-the-loop execution for critical decisions.

What is ServiceNow IT Operations Management - and how does it work with AIOps?

Is your company using ServiceNow IT Operations Management or considering using it? If so, you know the importance of enhancing the visibility of your IT infrastructure and services, protecting against service disruptions, and enhancing your company’s operational flexibility. In this blog, we’ll discuss how ServiceNow ITOM works, improves visibility across the entire IT infrastructure, and streamlines operations. We’ll also discuss how ServiceNow ITOM is better together with AIOps.

7 Habits of Successful Generative AI Adopters

Generative AI is forecasted to have a massive impact on the economy. These headlines are driving software teams to rapidly consider how they can incorporate generative AI into their software, or risk falling behind in a sea-change of disruption. But in the froth of a disruptive technology, there’s also high risk of wasted investment and lost customer trust.

OnPage Releases Healthcare-Focused Slack Integration

In the healthcare realm, the need for communication platforms that meet HIPAA standards is undeniable. Enter Slack, a popular collaboration platform armed with robust security features. However, the real game-changer emerges through the integration with OnPage. This isn’t just an upgrade in collaboration; it’s a transformative shift in critical communication within healthcare—a field where every moment counts.

The Unplanned Show E20: LLM Observability w/Charity Majors & James Governor

Large language models (LLMs) are foundational to generative AI capabilities, but present new challenges from an observability perspective. Hear from observability thought leader and CTO/co-founder of Honeycomb, Charity Majors, and developer-focused analyst and co-founder of Redmonk, James Governor in this discussion about LLM observebility as more organizations are building business critical features on LLMs.

How to Reduce MTTR: A Complete Guide

Organizations striving to improve their operational efficiencies must know how to reduce MTTR as it plays a key role in today’s fiercely competitive business landscape. Customer satisfaction is a top priority for most businesses and late response to their queries or issues can have a negative impact. To track the response and resolution time, businesses measure their MTTR score. MTTR is a key metric that gives insight as to how much time an organization takes to resolve an incident or issue.

How observability and AIOps work better together

If you’re juggling complex, cloud-based, containerized systems and aiming to meet high customer expectations, your old monitoring processes probably don’t cut it anymore. Increasing infrastructure complexity means you need to instrument more, log more, and monitor more. That leads to even more complexity. The answer is better observability, right? Yes and no. Observability and monitoring are critical, but they are only part of what you need for service awareness and availability.

Captains Log: A first look at our architecture for Signals

Welcome to the first Signals Captain’s Log! My name is Robert, and I’m a recovering on-call engineer and the CEO of FireHydrant. When we started our journey of building Signals, a viable replacement for PagerDuty, OpsGenie, etc, we decided very early that we would tell everyone what makes Signals unique, and what better way than to tell you how we’re building it (without revealing too much 😉). Let’s jump in.

The New SEC Rules and You

The Securities and Exchanges Commission published new rules for SEC registrants around disclosing incident details and response policies. Compliance with these new rules should be top of mind for any company – even if your org hasn’t hit the milestone of registering with the SEC, you should be prepared to be compliant when you take that step. ‍

The Unplanned Show, Episode 19: Cloud Security response with Ashley Ward

As organizations move to the cloud, where is there overlap between security and IT and engineering? In this session, Dormain will sit down with Orca Security's Principal Technical Evangelist, Ashley Ward, to learn about how working practices have to evolve with the speed of change in the cloud.

What you need to know about the The Digital Operational Resilience Act (DORA)

The European Commission has introduced the Digital Operational Resilience Act (DORA) to bolster the digital infrastructure of the financial sector within the European Union (EU). As part of the EU's wider digital finance strategy, DORA's objective is to create a comprehensive framework governing digital operational resilience. Financial institutions must ensure full compliance with DORA by January 2025.

Mastering Root Cause Analysis: A Guide for Site Reliability Engineers

Site Reliability Engineers (SREs) play a vital role in ensuring the stability and performance of web services and are key in incident management. One of the core skills SREs need is the ability to conduct effective Root Cause Analysis (RCA) when issues arise. This guide is about how to improve your RCA skills for more effective post-incident analysis.Let's dive in.🔖 What is Prometheus Alertmanager? Read here!

What is IT incident management - and how can AIOps optimize it?

Imagine you’re in the middle of a critical project, and suddenly, your system crashes. Or perhaps it’s the middle of the night, and your server goes down, affecting countless users. Some IT incidents are inevitable, but the way you manage them makes all the difference in minimizing their impact. You know that proper incident management is critical – and that incidents can become costly.

How we manage incidents at Datadog

Incidents put systems and organizations to the test. They pose particular challenges at scale: in complex distributed environments overseen by many different teams, managing incidents requires extensive structure and planning. But incidents, by definition, break structures and foil plans. As a result, they demand carefully orchestrated yet highly flexible forms of response. This post will provide a look into how we manage incidents at Datadog. We’ll cover our entire process.

The Journey Into Automation: Optimizing Care Delivery

In a world where efficiency and precision are the cornerstones of progress, automation has become the unsung hero across diverse industries. From manufacturing floors to customer service, its transformative power has reshaped the way we work and deliver services. Today, we embark on a journey to explore the profound influence of automation on healthcare, where each automated process is a progressive step towards optimizing care delivery and reshaping the future of patient-centered care delivery.

xMatters Support - Broadcast Groups

In xMatters, groups determine how and when people are notified using on-call schedules, escalation timelines, and rotations. But what if you don't use complex on-call schedules, or need to notify all members of the group simultaneously? Broadcast groups make it easier for customers who don't always need on-call schedules. Let’s take a look.

Suppressing Alert Noise during Scheduled Maintenance

Alert noise is a common problem for IT teams that monitor and manage complex systems. Excessive unactionable alerts triggered by various sources, such as applications, servers, network devices, etc., can cause alert fatigue. The higher volume of alerts can be overwhelming, reducing the ability to respond to critical alerts. One event of possible alert noise is during scheduled maintenance, awhich is a common practice in the digital realm.

6 Best Practices for Tuning Network Monitoring Alerts

Network monitoring and alerting provide the foundation for efficient IT operations and cyber resilience. By keeping track of the status and performance of network infrastructure and applications, network monitoring tools can automatically generate alerts when defined thresholds are exceeded or specific events occur. These network monitoring alerts allow IT teams to detect outages, performance degradation, and potential security incidents so they can respond swiftly to minimize disruption.

Sponsored Post

Taking down (and restoring) the Raygun ingestion API

In a world where Software as a Service (SaaS) products are integral to daily life, maintaining uninterrupted service for end-users is paramount. However, stuff happens. When it does, our most valuable response (other than restoring service ASAP) is to review the series of events that led up to the incident and learn from them. On August 25th, 2023, at 7:02 AM NZT, Raygun experienced a significant incident that impacted our API ingestion cluster, leading to an outage lasting approximately 1 hour and 15 minutes. While this wasn't fun for anyone involved, this incident did prove to be a valuable learning experience, shedding light on the importance of infrastructure management and resilience.

Status Pages That Deliver: Top 10 Favorites

Status Pages represent an invaluable asset for websites and SaaS businesses, particularly in today's environment with prevalent outages and heightened user expectations for seamless uptime. Integral to any robust website monitoring strategy, these pages serve as centralized hubs, offering users a singular, authoritative source for tracking the status of websites and applications.

Status Pages 101: How to Create a Status Page You and Your Customers Will Actually Want to Use

This blog post is adapted from my talk at SRECon EMEA 2023 - original slides are available here! Status pages are a simple yet underutilized element of incident communication. Done well, they’re a low-lift way to keep your customers and stakeholders informed when incidents impact them. But without a solid approach, updating status pages can easily become a tedious and often neglected task during incidents. In this post, we’ll cover some tips to get your status page right.

PagerDuty and Jeli Together Will Transform Incident Management

Today is an important day for us at PagerDuty, and for the larger ecosystem of incident management. We’ve signed a definitive agreement to acquire Jeli, a standout player in the incident management space. This deal represents a strategic alignment of visions, technologies and goals that will have a lasting impact on the industry and our customers.

Stop aiming for a 'perfect' monitoring and observability strategy - and start using AIOps

Change is the only constant in today’s continuously shifting IT landscape. Whether you’re adding new observability tools, retiring existing monitoring systems, establishing new business units, or onboarding IT systems from acquisitions, managing these non-stop changes can challenge even your expert ITOps team. Trying to get your monitoring house in order is a daunting task.

Basics of Incident Management

Life is full of unexpected incidents. From the coffee spill that disrupts your morning routine to the sudden traffic jam that transforms a 20-minute commute into an hour-long ordeal. Much like these challenges, most of our systems and infrastructure also constantly face these tiny glitches. If ignored, they can have a significant impact. Unlike minor inconveniences, these glitches we call Incidents have the potential to disrupt your business, frustrate customers, and eat into your revenue.

Set Responders Up for Success with New User Onboarding

Effective incident response plays a critical role in maintaining smooth operations at organizations of all sizes. When built up correctly, operational resilience–that ability to bounce back quickly after failure–can act as a shield that guards your customer experience, ensuring that even when incidents inevitably happen, you’re back online in no time.