October 2023

xMatters Support - Dynamic Groups

Oct 31, 2023 By xMatters In xMatters

Dynamic groups are teams of users based on selected criteria. A dynamic group's members change depending on who matches the selected criteria at the time of an alert. For example, you can create a dynamic group that includes all users who have specific training (such as first aid or fire safety) in a particular physical location within your organization. You could base this on a custom user property that indicates the level of training each user has. As each user gains a certification, the group is updated to reflect that change.

View Video

xMatters

Incident Management

Read more about xMatters Support - Dynamic Groups

The Unplanned Show, Episode 18: Resilient architectures with Matt Stine

Oct 31, 2023 By PagerDuty In PagerDuty

We'll catch up with Matt Stine, the author of "Migrating to Cloud-Native Architectures" (O'Reilly), about his current thinking on resilient architectures.

View Video

PagerDuty

Read more about The Unplanned Show, Episode 18: Resilient architectures with Matt Stine

PagerDuty Operations Cloud Fall Launch 2023

Oct 30, 2023 By Inga Weizman In PagerDuty

Across the business landscape, 2023 has been called the “year of efficiency.” Organizations have had to deliver more growth and innovation, but with tighter budgets and headcount than in prior years. CIOs have needed to build strategies to mitigate the risk of operational failure and protect their brand’s customer experience.

Read Post

PagerDuty

Read more about PagerDuty Operations Cloud Fall Launch 2023

Interlink's Service Chain Mapping solution: Helping Banking & Finance Organizations Strengthen Operational Resilience and Meet Regulatory Requirements

Oct 30, 2023 By David Arrowsmith In Interlink

Operational resilience is an increasing area of focus and scrutiny for regulators of the banking and financial services industry. In the European Union, the Digital Operational Resilience Act (DORA) looms on the near horizon - with equivalent regulatory frameworks slowly but surely rolling out across the globe.

Read Post

Interlink

Read more about Interlink's Service Chain Mapping solution: Helping Banking & Finance Organizations Strengthen Operational Resilience and Meet Regulatory Requirements

Introducing Squadcast's Global Event Rulesets | Incident Management | Squadcast

Oct 30, 2023 By Squadcast In Squadcast

With video will give you a walkthrough of Squadcast's new feature 'Global Event Rulesets' that helps you simplify alert Routing and boost efficiency Global Event Rulesets enable you to manage alert routing across services and automate actions based on predefined global event rulesets.

View Video

Squadcast

Read more about Introducing Squadcast's Global Event Rulesets | Incident Management | Squadcast

StatusCast vs Status.io: Status Page Comparison

Oct 30, 2023 By StatusCast In StatusCast

In the modern day IT landscape, service reliability is of the utmost importance. Status pages serve as crucial interfaces, communicating any interruptions or issues to stakeholders. While several options are available, two notable status page providers stand out: StatusCast and Status.Io. Here we take a dive into the various aspects of status pages and incident management for each status page service.

Read Post

StatusCast

Read more about StatusCast vs Status.io: Status Page Comparison

Create a dedicated Microsoft Teams channel for an existing alert

Oct 30, 2023 By iLert In iLert

With the ilert Microsoft Teams integration, you can create a separate MS Teams channel for a specific alert, allowing quick collaboration. You can bring together your team members in a shared chat to discuss the issue, share findings, and coordinate your response. This feature is also helpful for reviewing incidents and creating postmortems.

View Video

iLert

Read more about Create a dedicated Microsoft Teams channel for an existing alert

New Features In Team Onboarding

Oct 28, 2023 By PagerDuty In PagerDuty

Get an inside look at two features designed to ensure that the people on your response teams are set up correctly on PagerDuty. Senior Product Manager Alex Quintana joins us to share how your people will successfully onboard onto the PagerDuty platform and then we’ll look at a new report that shows you all of your users on the platform.

View Video

PagerDuty

Incident Management

Read more about New Features In Team Onboarding

Tips To Never Miss An Incident Notification With Squadcast Escalations Policies

Oct 27, 2023 By Chitra Bisht In Squadcast

Companies implement an Incident Response process to promptly resolve critical issues. Setting up escalation policies to notify engineers is a key step in this process. With traditional escalation policies, alert notifications still get missed which results in higher response times and failure to meet SLAs. So, how can one ensure incident notifications are never missed?

Read Post

Squadcast

Read more about Tips To Never Miss An Incident Notification With Squadcast Escalations Policies

Opsgenie Alternatives: Finding the Right Fit for your Incident Management Teams

Oct 27, 2023 By Chitra Bisht In Squadcast

In the dynamic landscape of modern IT operations and Incident Management, choosing the right tool is paramount to ensuring the resilience of your organization. Opsgenie, a popular Incident Response and Alerting platform, has been a go-to choice for many. However, as businesses grow and requirements evolve, exploring Opsgenie alternatives becomes essential in the quest to find the perfect fit for your unique operational needs. In this blog, we'll embark on a journey to uncover and evaluate some compelling alternatives to Opsgenie, helping you navigate the vast sea of options and make an informed decision that aligns perfectly with your team's workflows and objectives.

Read Post

Squadcast

Read more about Opsgenie Alternatives: Finding the Right Fit for your Incident Management Teams

Fresh from FireHydrant October 2023: Updates to status pages, views, and analytics

Oct 27, 2023 By Joel Smith In FireHydrant

October might be a spooky month, but we’re doing our best to make incidents less scary. We released a number of updates this month that focus on two main areas: Let’s jump in.

Read Post

FireHydrant

Read more about Fresh from FireHydrant October 2023: Updates to status pages, views, and analytics

What Should Your System Outage Notifications Say?

Oct 27, 2023 By OnPage Corporation In OnPage

System outages: they are an inevitable problem that every single IT team will encounter at some point. Whether they come about due to technical issues, act-of-god natural disasters, or simply random human error, system outages happen to the best of us. Though the cause of system outages is not always in your control, you can control your team’s processes for response and resolution.

Read Post

OnPage

Read more about What Should Your System Outage Notifications Say?

Webinar: Streamlining Incident Management With Automation and Contextual Awareness

Oct 27, 2023 By Squadcast In Squadcast

In the modern context of distributed teams & complex digital infrastructure, major incidents having a negative impact spanning multiple teams and services can cause a barrage of alerts. While a meticulously designed incident response strategy can aid in restoring order, it's essential to underscore the significance of providing responders with effective tools that offer contextual understanding and facilitate the identification of actionable alerts.

View Video

Squadcast

Read more about Webinar: Streamlining Incident Management With Automation and Contextual Awareness

MSP's As NOC's, Handling Multiple Clients

Oct 26, 2023 By Chitra Bisht In Squadcast

A Managed Service Provider (MSP) should invest in an Incident Management platform to ensure seamless service delivery and customer satisfaction. Such a platform streamlines Incident Response, improves service reliability, and enhances communication among teams. It helps MSPs in reducing Mean Time to Acknowledge (MTTA) and Mean Time to Resolve (MTTR) incidents, thereby minimizing downtime and service disruptions.

Read Post

Squadcast

Read more about MSP's As NOC's, Handling Multiple Clients

Understanding the ServiceNow CMDB - and how AIOps modernizes it

Oct 26, 2023 By Adam Blau In BigPanda

Navigating the complex world of ServiceNow’s Configuration Management Database (CMDB) can feel overwhelming. You might find yourself grappling with understanding the foundational aspects of the CMDB, or maybe you’re seeking effective ways to utilize and integrate it seamlessly into your IT processes. You want to extract the maximum value from your ServiceNow CMDB but need help figuring out how to start.

Read Post

BigPanda

Read more about Understanding the ServiceNow CMDB - and how AIOps modernizes it

Build Sophisticated Apps for Your PagerDuty Environment Using OAuth 2.0 and API Scopes

Oct 26, 2023 By Mandi Walls In PagerDuty

Many PagerDuty customers create their own apps to help them manage their PagerDuty environments. Teams might have any number of workflows that might benefit from a custom application. A PagerDuty admin might want to be able to load CSV files with new users and their contact information into PagerDuty when new teams join the platform, or load new services before they are released to production.

Read Post

PagerDuty

Read more about Build Sophisticated Apps for Your PagerDuty Environment Using OAuth 2.0 and API Scopes

Elevating Incident Management: Leveraging automation and AI to put reliability on autopilott

Oct 26, 2023 By Blameless In Blameless

If your company operates in a modern digital environment, then there’s a good chance questionable reliability is hurting you competitively. On the other hand, every hour your engineering team spends on operations comes at the expense of developing your product. So, what are you supposed to do?

View Video

Blameless

Read more about Elevating Incident Management: Leveraging automation and AI to put reliability on autopilott

RapidSpike + Squadcast: Routing Alerts Made Easy

Oct 25, 2023 By Vishal Padghan In Squadcast

RapidSpike is a website monitoring solution that focuses on all three key aspects of website health: performance, reliability and security in a single dashboard. If you use RapidSpike for your website monitoring requirements, you can integrate it with Squadcast, an end-to-end Incident Response tool, to route alerts from RapidSpike to the right users in Squadcast with ease.

Read Post

Squadcast

Read more about RapidSpike + Squadcast: Routing Alerts Made Easy

What is a Pull Request and Why You Need Them

Oct 25, 2023 By Anjali Udasi In Zenduty

As an engineer, you're probably familiar with version control systems like Git that let you track changes to your codebase. But are you using one of the most useful features of Git pull requests? If not, you're missing out. Pull requests are one of the best ways to collaborate on projects and create better code. In this article, we'll go over the pull request meaning, why you should be using them, and how to create your own pull requests.📑 What is incident management software?

Read Post

Zenduty

Read more about What is a Pull Request and Why You Need Them

What is a Status Page and Why Do You Need One?

Oct 24, 2023 By OpsMatters In OpsMatters

If you run a service-providing business, then you probably know how much of a struggle it is to inform everyone when an issue occurs. You must note the end users for sure, but also the responsible parties who will work on resolving the incident and bringing back the services as they were.

Read Post

OpsMatters

Read more about What is a Status Page and Why Do You Need One?

The definitive guide to event correlation in AIOps: Processes, tools, examples, and checklist

Oct 24, 2023 By Scott Stradley In BigPanda

Are you tired of sifting through a sea of IT events and alerts? Or perhaps you’ve found yourself overwhelmed by the volume of data flooding your monitoring systems and challenged to identify the incident root cause. There’s a better way to manage the chaos: using AIOps to unite disparate tools, data, and teams for event correlation.

Read Post

BigPanda

Read more about The definitive guide to event correlation in AIOps: Processes, tools, examples, and checklist

The Unplanned Show, Episode 17: PagerDuty's Backstage Plug-in

Oct 24, 2023 By PagerDuty In PagerDuty

Did you know there's a PagerDuty plug-in for Backstage? Learn more about what Backstage is, the PagerDuty plug-in, and what's new from Head of Product for Backstage at Spotify, Meg Watson, and PagerDuty Developer Advocate, Tiago Barbosa.

View Video

PagerDuty

Incident Management

Read more about The Unplanned Show, Episode 17: PagerDuty's Backstage Plug-in

PagerDuty for Customer Service Operations

Oct 24, 2023 By PagerDuty In PagerDuty

Provide relevant context to solve customer problems. Customer service representatives need relevant historical context in order to accurately and quickly resolve the issue at hand. Reduce the impact on your customers by layering monitoring data from technical resources across your organization with data from customer calls and other systems of record—so you have a holistic view of an issue and can identify the right solution quickly.

View Video

PagerDuty

Read more about PagerDuty for Customer Service Operations

Customers Choose PagerDuty for Real-Time Operations

Oct 24, 2023 By PagerDuty In PagerDuty

Organizations need a solution that’s designed for today’s dynamic digital reality. Hear customers like Carrefour Bank, IG, The Trevor Project, Vodafone, and Zoom explain how PagerDuty empowers them in an always-on, real-time world.

View Video

PagerDuty

Read more about Customers Choose PagerDuty for Real-Time Operations

Navigating the SRE Landscape| Better Incidents Podcast Ep. 9

Oct 24, 2023 By FireHydrant In FireHydrant

View Video

FireHydrant

Read more about Navigating the SRE Landscape| Better Incidents Podcast Ep. 9

Why Invest in Tooling? Benefits and Concerns

Oct 24, 2023 By Emily Arnott In Blameless

When looking to invest money in your engineering teams, what gives the best return? Hiring more staff to enable bigger projects and more diversified skill sets? Training engineers to uplevel their ability and productivity? Increasing salaries to retain the best talent? These are all great ideas that should be exercised often. But there’s one other investment worth considering that can offer huge benefits for relatively small amounts of money: tooling.

Read Post

Blameless

Read more about Why Invest in Tooling? Benefits and Concerns

AIOps use cases: Technical, operational, and business examples

Oct 23, 2023 By Scott Stradley In BigPanda

ITOps is at a crossroads: Teams struggle to manage a high volume of alerts and coordinate between different tools and teams. Teams also must balance cloud technologies’ agility and on-premise solutions’ stability. The sheer speed of today’s IT demands both flexibility and visibility in development and harmonized tech stacks.

Read Post

BigPanda

Read more about AIOps use cases: Technical, operational, and business examples

Getting started on alerts with Escalation Policies

Oct 23, 2023 By Kaushik Thirthappa In Spike

Escalation policies are essential for making sure that incidents are quickly addressed and resolved. They provide a systematic approach to automate alerts, guaranteeing that no incident goes unnoticed. Let’s get you started, shall we? An escalation policy is a way to automate alerts and assure that incidents are never missed. The first point of contact for an incident is through an alert that is sent according to the escalation policy.

Read Post

Spike

Read more about Getting started on alerts with Escalation Policies

Ready for DORA? Strengthen Operational Resilience with Service Chain Mapping by Interlink Software.

Oct 23, 2023 By Interlink In Interlink

Deliver a data-driven audit and compliance solution: Visualize, monitor and triage customer journeys, in near real-time - as they flow across interconnected systems and technologies.

View Video

Interlink

Read more about Ready for DORA? Strengthen Operational Resilience with Service Chain Mapping by Interlink Software.

12 Best Practices to Improve Incident Management

Oct 23, 2023 By Guest Author In Netreo

Today’s fast-paced digital world can lead to system breakdown and disruptions that strain organizational resources. What truly distinguishes successful organizations is their response when problems occur. Incident management serves this function. At its core, incident management involves teams managing unexpected disruptions quickly with minimal impact to users or business operations. The process is like a safety net that prevents further problems from developing into trust issues.

Read Post

Netreo

Read more about 12 Best Practices to Improve Incident Management

The price of building your own incident management tool is not what it seems.

Oct 23, 2023 By Asiya Gorelik In Incident.io

Build or buy? An age-old decision that gets made dozens of times a year. It’s quite possibly one of the most important decisions you make as an company. It impacts roadmaps, productivity, team structure, and customer satisfaction (you know, just a few little things). There are a lot of factors to consider, one of the most prominent being cost. So, what exactly are the costs you need to consider when building your own incident management solution?

Read Post

Incident.io

Read more about The price of building your own incident management tool is not what it seems.

Create an alert template

Oct 23, 2023 By iLert In iLert

Create your own template for the alert summary and details using the integration's preset fields. Learn more about this ilert feature here.

View Video

iLert

Read more about Create an alert template

Coconala Uses PagerDuty to Reduce MTTA

Oct 20, 2023 By PagerDuty In PagerDuty

Coconala is a Tokyo-based online learning platform to easily buy and sell knowledge, skills, and experience. The marketplace relies on PagerDuty for incident management. Check out how PagerDuty helped them with rapid detection of incidents, reducing their MTTA from 10 minutes to under 1 minute.

View Video

PagerDuty

Read more about Coconala Uses PagerDuty to Reduce MTTA

Zenduty - The What, Why and Where We're Going

Oct 20, 2023 By Zenduty In Zenduty

Zenduty is an end-to-end incident management platform that gives you greater control and automation over the incident management lifecycle.

View Video

Zenduty

Read more about Zenduty - The What, Why and Where We're Going

Building a culture of Incident response

Oct 20, 2023 By Kaushik Thirthappa In Spike

Building a culture of incident response is not just about solving problems; it is about creating stronger teams, empowering individuals, and fostering a more resilient and thriving workplace. How do you achieve this culture and improve your incident management processes? Let’s dive in;

Read Post

Spike

Read more about Building a culture of Incident response

How does SIGNL4 provide for truly reliable alerting?

Oct 20, 2023 By Ronald In SIGNL4

Of course, one expects an alerting solution to be reliable. This is important because a missed alert can have a significant impact on the business. It is about IT uptime, disruptions in production or other critical system conditions. Business processes, production workflows and therefore money, the reputation of the company or even the health of the employees are at stake. But what does reliable alerting actually mean and how is it achieved?

Read Post

SIGNL4

Read more about How does SIGNL4 provide for truly reliable alerting?

AWS Orchestration with Systems Manager & Runbook Automation

Oct 20, 2023 By Jake Cohen In PagerDuty

It is now the de facto standard for companies to operate across numerous regions and cloud-accounts. The reasons for this vary, and depending on where you sit in the organization, these reasons may be more or less apparent to you.

Read Post

PagerDuty

Read more about AWS Orchestration with Systems Manager & Runbook Automation

Blue Matador + Squadcast: Alert Routing Simplified

Oct 19, 2023 By Vishal Padghan In Squadcast

Blue Matador is the fastest, easiest way to set up AWS infrastructure monitoring, allowing small teams to fully monitor their cloud operations with no manual setup. If you use Blue Matador for your cloud monitoring requirements, you can integrate it with Squadcast, an end-to-end Incident Response tool, to route alerts from Blue Matador to the right users in Squadcast with ease.

Read Post

Squadcast

Read more about Blue Matador + Squadcast: Alert Routing Simplified

NOC Success Like Never Before: Automation Strategies for All-new Incident Management

Oct 19, 2023 By Holly Pennebaker In Resolve

Network Operations might never be the same. But then again, why would anyone want it to be? The power of automation and orchestration can bring incredible value to the Network Operations Center (NOC), including the business-critical call to get proactive and ahead of the incidence response and management game. It’s more than a towering volume of events – it’s the complexities involved, too.

Read Post

Resolve

Read more about NOC Success Like Never Before: Automation Strategies for All-new Incident Management

Alerting with ilert and Pandora FMS

Oct 19, 2023 By Sancho Lerena In iLert

This post introduces the Pandora FMS monitoring solution and how to integrate it with ilert to establish reliable alerting. The guest post is written by Sancho Lerena, the CEO of Pandora FMS.

Read Post

iLert

Read more about Alerting with ilert and Pandora FMS

Global AWS Orchestration with Runbook Automation

Oct 19, 2023 By PagerDuty In PagerDuty

It is common for companies to have multiple AWS Accounts, and as it turns out, there are cases where certain operational tasks need to be performed on EC2’s that reside in each account. Examples of this include standardizing practices for auditing, patching, and incident-response – such as retrieving diagnostics or remediation. This demo showcases how Runbook Automation orchestrates commands and scripts on EC2’s spanning numerous AWS accounts through an integration with Systems Manager (SSM).

View Video

PagerDuty

Read more about Global AWS Orchestration with Runbook Automation

Squadcast Unveils Enhanced Status Pages

Oct 19, 2023 By Squadcast In Squadcast

Big News! Squadcast's Enhanced Status Page(s) are LIVE!

View Video

Squadcast

Read more about Squadcast Unveils Enhanced Status Pages

Status Page Demo: Build your OneUptime Status Page in under 10 minutes.

Oct 19, 2023 By OneUptime In OneUptime

Welcome to our step-by-step demo on building your own OneUptime Status Page in under 10 minutes. This video is designed to guide you through the process of setting up a fully functional status page In this tutorial, we’ll walk you through the entire process, from signing up for a OneUptime account to customizing your status page to suit your brand’s identity. We’ll show you how to add services, incidents, and maintenance events, and how to manage notifications to keep your users informed about the status of your services.

View Video

OneUptime

Read more about Status Page Demo: Build your OneUptime Status Page in under 10 minutes.

Dell Technologies acquires Moogsoft

Oct 19, 2023 By Phil Tee In Moogsoft

Dear Current and Future Moogsoft Customers, I am happy to announce that Dell Technologies has acquired Moogsoft on September 17, 2023. This is good news for existing and future Moogsoft and Dell customers. Earlier this year Moogsoft embarked upon raising capital to accelerate growth.

Read Post

Moogsoft

Read more about Dell Technologies acquires Moogsoft

4 Ways to Reduce Your Mean Time to Resolution

Oct 19, 2023 By Daria Yankevich In Auvik

Dealing with a high MTTR in your network? Auvik Network Management is a comprehensive network monitoring and troubleshooting solution. With over 50 pre-configured alerts, it keeps you informed about critical network events. Users have the flexibility to customize these alerts and control notification frequency so that they have all the essential context to be able to fix issues.

Read Post

Auvik

Read more about 4 Ways to Reduce Your Mean Time to Resolution

Panel Discussion: Modern Monitoring and Observability

Oct 18, 2023 By PagerDuty In PagerDuty

Struggling with effective monitoring for your services? Not sure how to handle the volume of information your environment creates? Join us for a panel discussion about Monitoring and Observability, featuring Jason Hand of Datadog, Ernest Mueller of Accenture, Steve McGhee of Google, and Peco Karayanev of PagerDuty. Hosted by PagerDuty DevOps Advocate Mandi Walls.

View Video

PagerDuty

Incident Management

Read more about Panel Discussion: Modern Monitoring and Observability

Terraform Time - Leveraging PagerDuty Service Standards for better Terraform configuration

Oct 18, 2023 By PagerDuty In PagerDuty

We'll be exploring how to leverage Service Standards to follow best practices on PagerDuty Technical Services configuration.

View Video

PagerDuty

Read more about Terraform Time - Leveraging PagerDuty Service Standards for better Terraform configuration

Introducing Past Incident Feature | Incident Context and History | Squadcast

Oct 18, 2023 By Squadcast In Squadcast

Introducing Squadcast's Past Incidents feature which helps incident responders by presenting them with past incidents related to the same service. It employs data science techniques to match and display a historical list of similar incidents from the same service you are currently investigating. This aids in expediting issue resolution by offering valuable insights, such as historical context, prior incident details, timing patterns, and past solutions.

View Video

Squadcast

Read more about Introducing Past Incident Feature | Incident Context and History | Squadcast

Internet Sonar: A Game-Changer for Incident Detection

Oct 18, 2023 By Mark Towler In Catchpoint

When outages cost you tens of thousands of dollars each minute, pinpointing the source of disruptions as quickly as possible becomes mission-critical. This is not a time for finger-pointing and hastily assembled war rooms searching for that needle in the haystack. You need simple, intelligent, trustworthy Internet health information to expedite your incident detection.

Read Post

Catchpoint

Read more about Internet Sonar: A Game-Changer for Incident Detection

Speed, Scale, and Special Sauce: The Evolution of the PagerDuty Brand

Oct 18, 2023 By Jesse Purewal In PagerDuty

At PagerDuty, our purpose is to empower teams with the time and efficiency to build the future. That means that our own teams are constantly building and relentlessly innovating to help organizations drive transformative change in the way they operate.

Read Post

PagerDuty

Read more about Speed, Scale, and Special Sauce: The Evolution of the PagerDuty Brand

Everything you need to know about IT Operations Analytics

Oct 18, 2023 By Jason Walker In BigPanda

Data is both a challenge and an asset for IT professionals, who rely on IT Operations Analytics (ITOA) to guide them towards operational excellence, system reliability, and swift incident resolution. So whether you’re seeking clarity on understanding what ITOA is and its connection to related technologies, are contemplating how to use it within your organization, or are curious about its enhanced efficiency and cost savings benefits, we’ve got you covered.

Read Post

BigPanda

Read more about Everything you need to know about IT Operations Analytics

Behold a brand New Incident Dashboard!

Oct 18, 2023 By Menahi Shayan In Zenduty

The incidents page, the most visited page on Zenduty, has an all-new look and feel! It's been completely redesigned from the ground up to be faster, easier to use, and more visually appealing. The Incidents list now dedicates more space for important information, such as the title, date, priority, and more. The UI is also more polished, shaving off whitespace where unnecessary. The avatars have been redesigned with more pastel shades, resulting in an overall design far more soothing to the eye.

Read Post

Zenduty

Read more about Behold a brand New Incident Dashboard!

AI-Generated Status Updates

Oct 18, 2023 By PagerDuty In PagerDuty

PagerDuty’s AI-Generated Status Updates makes it easy to keep stakeholders in the loop during an incident. Learn how you can create your status update in seconds, leveraging the power of AI.

View Video

PagerDuty

Read more about AI-Generated Status Updates

AI-Generated Incident Postmortems

Oct 18, 2023 By PagerDuty In PagerDuty

PagerDuty’s AI-Generated Incident Postmortems help teams document and implement learnings from major incidents, faster. Watch this demo to learn how.

View Video

PagerDuty

Read more about AI-Generated Incident Postmortems

Process Automation 101

Oct 18, 2023 By PagerDuty In PagerDuty

PagerDuty Process Automation is a self-hosted deployment of PagerDuty Runbook Automation that provides maximum flexibility for security configurations and custom integrations.

View Video

PagerDuty

Read more about Process Automation 101

Do you need better cloud observability - or AI-powered cloud visibility?

Oct 17, 2023 By BigPanda In BigPanda

Maybe you’re still using monolithic applications, built and refined over many years. You understand that shifting to microservices or containerized architectures is a huge and daunting task. You’re probably grappling with the limitations of legacy systems—maybe they’re slow, tough to update, or can’t scale as you’d like. And you’re likely using more traditional IT monitoring tools or even some cloud observability tools.

Read Post

BigPanda

Read more about Do you need better cloud observability - or AI-powered cloud visibility?

Kubernetes Incident Management: A Practical Guide

Oct 17, 2023 By OnPage Corporation In OnPage

As more organizations embrace containerized applications, Kubernetes has emerged as the leading platform for orchestrating these containers. However, its complexity, combined with the inevitable reality of IT incidents, demands a well-defined strategy for managing disruptions. This article introduces Kubernetes incident management, describes common Kubernetes errors, and provides practical guidance to efficiently handle incidents.

Read Post

OnPage

Read more about Kubernetes Incident Management: A Practical Guide

Creating Effective Warnings For All Conference

Oct 17, 2023 By Everbridge In Everbridge

View Video

Everbridge

Read more about Creating Effective Warnings For All Conference

AI-Generated Runbooks

Oct 17, 2023 By PagerDuty In PagerDuty

AI-generated Runbooks lower the barrier to entry to new automation developers and speeds up the time to create new automation for experienced automation authors. This feature works seamlessly with the user’s preferred scripting language, offering a low-code solution for what used to be a high-code task. Watch how Runbook Automation users can write the task they wish to automate in plain-English and let AI build a template of automation for that particular task.

View Video

PagerDuty

Read more about AI-Generated Runbooks

Avoiding a Major Incident with PagerDuty AIOps

Oct 17, 2023 By PagerDuty In PagerDuty

A global retailer has a major incident occurring and the team doesn’t know it yet. Before PagerDuty AIOps, the NOC would get hit by alert storms and page multiple teams. This resulted in large conference calls and customer downtime. Now, a major incident right before Black Friday has been averted with PagerDuty AIOps. The result is better overall customer experience, no matter how stressed the system is.

View Video

PagerDuty

Read more about Avoiding a Major Incident with PagerDuty AIOps

Learning Flows: Bringing consistency to your post incident processes

Oct 16, 2023 By Luis Gonzalez In Incident.io

To get the most out of your incident response processes, consistency is crucial. The more predictable you can be whenever issues crop up, whether a small bug or a major outage, the quicker and more confidently you can respond. In practice, incident response is equal parts knowing how to actually resolve the issue and having the confidence that the processes in place will help get you through without added stress.

Read Post

Incident.io

Read more about Learning Flows: Bringing consistency to your post incident processes

What is Prometheus Alertmanager?

Oct 16, 2023 By Anjali Udasi In Zenduty

Prometheus Alertmanager is a powerful tool designed to handle various alerts generated by Prometheus. It plays a vital role in the overall monitoring ecosystem, acting as a centralized hub for managing alert notifications. With Prometheus Alertmanager and its robust notification management capabilities, you can efficiently define alert routing and notification policies. This empowers you to take timely actions and mitigate potential issues before they impact your service availability.

Read Post

Zenduty

Read more about What is Prometheus Alertmanager?

Create a dedicated Slack channel for an existing alert

Oct 16, 2023 By iLert In iLert

The ilert Slack integration enables you to establish a dedicated Slack channel for an existing alert, enabling immediate collaboration. Assemble your team members in a collective chat space to converse about the problem, exchange discoveries, and synchronize your response.

View Video

iLert

Read more about Create a dedicated Slack channel for an existing alert

After Hours Alerting for ConnectWise: Using SIGNL4 to Route CW Tickets to On-Call Engineers

Oct 13, 2023 By emily In SIGNL4

As a business owner or manager, you understand the importance of efficient operations and effective communication, particularly after hours. You want to equip your on-call engineers with all the information they need to resolve a ticket when not at their desk. If you are using ConnectWise to manage your service tickets – here is some great addition to help with your after hours alerting.

Read Post

SIGNL4

Read more about After Hours Alerting for ConnectWise: Using SIGNL4 to Route CW Tickets to On-Call Engineers

Blameless Unveils New Terraform Provider to Elevate Workflow Management at Scale

Oct 12, 2023 By Blameless In Blameless

Leading Incident Management Solution Enhances Control, Automation, And Security Workflow With Terraform's Lightning-Fast Resource.

Read Post

Blameless

Read more about Blameless Unveils New Terraform Provider to Elevate Workflow Management at Scale

G2 Fall Report Positions Squadcast among the leading Incident Management, and IT Alerting Tools

Oct 12, 2023 By Sanjog Sandhu In Squadcast

Squadcast established itself as a Momentum Leader and High Performer across different regions in the Incident Management and IT Alerting tool categories. We have solidified our leadership in the Mid Market segment across various regions, this recognition stems from our dedicated customer base.

Read Post

Squadcast

Read more about G2 Fall Report Positions Squadcast among the leading Incident Management, and IT Alerting Tools

What are AIOps platforms?

Oct 12, 2023 By BigPanda In BigPanda

IT operations teams are challenged to keep pace with the rapid speed of digital transformation. As companies use more cloud-based apps, increase agile deployments, and develop new microservices-based applications, they add layers and complexity to their technology stacks, making life increasingly challenging for ITOps performance.

Read Post

BigPanda

Read more about What are AIOps platforms?

A Detailed Guide to Setting Up Effective On-Call Rotations

Oct 11, 2023 By Chitra Bisht In Squadcast

On-Call Schedules are predefined rotations/shifts assigning team members to be available for incident response at specific times. They are essential for ensuring round-the-clock support, swift issue/incident resolution, and continuous service availability. For a robust On-Call system, proper schedules are essential serving as the backbone of reliable Incident Response, and ensuring your team is well-prepared to address technical challenges effectively.

Read Post

Squadcast

Read more about A Detailed Guide to Setting Up Effective On-Call Rotations

The Debrief: Build vs buy

Oct 11, 2023 By Incident.io In Incident.io

Almost every organization around will eventually face an important crossroad: should I build the tooling I need, or buy it? But more often that not, the decision to buy is the most sensible one that'll save you the most time, effort, and even money. But there are some edge cases where building can be the right choice. In this chat with Isaac, product engineer at incident.io, we dive into this nuanced debate and explain why buying is your best bet...most of the time.

View Video

Incident.io

Incident Management

Read more about The Debrief: Build vs buy

After Hours Alerting for ConnectWise

Oct 11, 2023 By SIGNL4 In SIGNL4

A short demo video on how to add After Hours Alerting with SIGNL4 to your ConnectWise PSA. We show you the complete workflow and what to keep in mind for seamless connectivity and targeted mobile alerting including duty scheduling for your teams.

View Video

SIGNL4

Read more about After Hours Alerting for ConnectWise

SLA vs. SLO vs. SLI: What's the Difference?

Oct 11, 2023 By Laura Clayton In Uptime Robot

When it comes to managing services effectively, terms like SLA, SLO, and SLI are often thrown around like confetti at a parade. They’re in meetings, in documents, and even in casual office conversations. But if you’re new to the field or simply haven’t had the chance to dig into these acronyms, they can feel like a bewildering alphabet soup. And they can’t be missing on an uptime monitoring blog such as ours! So, what do these terms really mean?

Read Post

Uptime Robot

Read more about SLA vs. SLO vs. SLI: What's the Difference?

A guide to post-mortem meetings and how we run them at incident.io

Oct 11, 2023 By Luis Gonzalez In Incident.io

You've just made it through a particularly tough incident. It was a short outage affecting a subset of customers, so not exactly the end of the world, but bad enough that it involved multiple people across a number of teams to resolve. Either way, the incident was well managed, and the dust has settled. Now what? Most guidance would say that putting together a post-mortem document is a good idea, given the severity of the incident. You've also done this, so what's next?

Read Post

Incident.io

Read more about A guide to post-mortem meetings and how we run them at incident.io

Introduction to ilert AI

Oct 10, 2023 By iLert In iLert

During the intensity of incident response, it is crucial to maintain concentration on resolving the problem promptly. At times, crafting a thorough and precise incident communication can be difficult, particularly when under pressure. This is where ilert's AI-powered incident communication feature becomes valuable.

View Video

iLert

Read more about Introduction to ilert AI

Three Ways to Better Appreciate your SREs and DevOps Engineers

Oct 10, 2023 By Emily Arnott In Blameless

DevOps engineers and Site Reliability Engineers are vitally important to the continued health of your product and business. We all know it’s true, and yet people in these roles often feel underappreciated and undervalued. This sort of work runs into the issue of “when process and infrastructure break, it gets shoved in the spotlight; but when everything works perfectly, no one notices.” ‍

Read Post

Blameless

Read more about Three Ways to Better Appreciate your SREs and DevOps Engineers

The Unplanned Show, Episode 16: Resiliency with Sam Newman

Oct 10, 2023 By PagerDuty In PagerDuty

When the author of Building Microservices (O'Reilly) tweets asking for a "plurality of views" on resiliency, I, for one, am intrigued. In this episode, we'll hear from Sam Newman about his latest thinking on resiliency.

View Video

PagerDuty

Incident Management

Read more about The Unplanned Show, Episode 16: Resiliency with Sam Newman

How AIOps modernizes CMDBs to drive accuracy and value

Oct 10, 2023 By Blair Sibille In BigPanda

Maintaining your Configuration Management Database’s (CMDB) accuracy, keeping it fully updated, and improving its performance is a frustrating and elusive goal for ITOps and IT leaders. Aiming for this ‘golden’ CMDB standard can feel like running on a treadmill where you’re putting in a lot of work, but remain as distant as ever from your goal. Can IT leaders ever catch up?

Read Post

BigPanda

Read more about How AIOps modernizes CMDBs to drive accuracy and value

Bridging the ITIL vs DevOps Mindset: CI/CD Best Practices for ITIL Organizations

Oct 9, 2023 By Elik Eizenberg In BigPanda

DevOps practices in software development have revolutionized the way updates are released. However, many companies entrenched in ITIL practices find it challenging to seamlessly integrate with the DevOps practice of Continuous Integration and Continuous Delivery/Deployment (CI/CD). This is because ITIL focuses on stability, which suits older systems, while DevOps is ideal for modern setups with its agile, automated practices.

Read Post

BigPanda

Read more about Bridging the ITIL vs DevOps Mindset: CI/CD Best Practices for ITIL Organizations

Revolutionizing your Grafana setup with intelligent alerting

Oct 9, 2023 By emily In SIGNL4

Once upon a time, in the bustling city of DataVille, lived a team of dedicated IT professionals tirelessly working to maintain the city’s digital heartbeat. Their mission was to ensure the smooth operation of their city’s digital infrastructure, which was not limited to the daytime operations but extended beyond business hours. They were the unsung heroes, the guardians of the city’s data. Their tool of choice? Grafana, a powerful open-source platform for observability.

Read Post

SIGNL4

Read more about Revolutionizing your Grafana setup with intelligent alerting

What is HCAHPS: A Comprehensive Overview

Oct 9, 2023 By Halle Katz In OnPage

In the realm of hospitals and healthcare organizations, the term “HCAHPS survey” is a recurrent presence: Hospital Administrator A: “The latest HCAHPS survey results just came out, and patients seem satisfied with…” Hospital Administrator B: “Some of our past patients participated in the HCAHPS survey, but they expressed disappointment with…” You might be left wondering, “What exactly is the HCAHPS survey?” Allow me to elucidate.

Read Post

OnPage

Read more about What is HCAHPS: A Comprehensive Overview

Unified Incident Management: Merits of Combined On-Call and Incident Response | Squadcast

Oct 6, 2023 By Squadcast In Squadcast

In this session, we explore the crucial aspects of effective on-call management and incident response in product organizations. Squadcast combines On-Call and Incident Response into a single platform using automation capabilities for enhanced reliability, continuous learning, and better productivity. 🔍 Timestamps.

View Video

Squadcast

Read more about Unified Incident Management: Merits of Combined On-Call and Incident Response | Squadcast

Choosing the Right Career Path in Tech: Software Engineering vs. Site Reliability Engineering (SRE)

Oct 6, 2023 By Anjali Udasi In Zenduty

The tech industry is booming, and there are many different career paths. But, two of the most popular and in-demand roles are Software Engineering and Site Reliability Engineering (SRE). Site Reliability Engineering (SRE) blends elements of software engineering with IT operations, focusing on reliability. On the other hand, SWE Software Engineering involves designing, developing, testing, and deploying software applications.

Read Post

Zenduty

Read more about Choosing the Right Career Path in Tech: Software Engineering vs. Site Reliability Engineering (SRE)

Alerting, Incident Management and the SDLC | Better Incidents Podcast Ep. 8

Oct 5, 2023 By FireHydrant In FireHydrant

In this episode we chat with veteran cloud architect Masaru Hoshi about the challenges of alert fatigue, the importance of effective alerting systems, and fostering ownership in software teams. Masaru shares insights from his 30-year career, emphasizing the need for balance, trust, and collaboration in incident response.

View Video

FireHydrant

Read more about Alerting, Incident Management and the SDLC | Better Incidents Podcast Ep. 8

October 2023 Update - New layout, additional cross links, improved event filtering and much more

Oct 5, 2023 By René In SIGNL4

Our October update brings a new layout in the web portal, new additional cross-references from Signl details to linked entities, and improved grouping options for conditions in the distribution rules. As always, all the details are in this blog article.

Read Post

SIGNL4

Read more about October 2023 Update - New layout, additional cross links, improved event filtering and much more

What is Mean Time Between Failures - and why does it matter for service availability

Oct 5, 2023 By Amy Brennen In BigPanda

Mean Time Between Failures (MTBF) measures the average duration between repairable failures of a system or product. MTBF helps us anticipate how likely a system, application or service will fail within a specific period or how often a particular type of failure may occur. In short, MTBF is a vital incident metric that indicates product or service availability (i.e. uptime) and reliability.

Read Post

BigPanda

Read more about What is Mean Time Between Failures - and why does it matter for service availability

Enhance Your Customer Service with PagerDuty for ServiceNow CSM

Oct 5, 2023 By Hadijah Creary In PagerDuty

In today’s fast-paced, digital-first landscape, delivering exceptional customer experience is paramount to business success. For customer service teams, that means maintaining service level agreements (SLAs) and ensuring swift responses to customer issues that can make or break your company’s reputation. Fortunately, PagerDuty has improved the way companies handle customer service teams and has built applications into ServiceNow’s CSM platform.

Read Post

PagerDuty

Read more about Enhance Your Customer Service with PagerDuty for ServiceNow CSM

The Rise of Generative AI

Oct 5, 2023 By Blameless In Blameless

Revolutionizing Business: The Rise of Generative AI - Actionable Strategies to Integrate Advanced AI Seamlessly into Your Engineering Operations.

View Video

Blameless

Read more about The Rise of Generative AI

Global Event Rulesets: Streamlining Alert Routing Across Services

Oct 4, 2023 By Vishal Padghan In Squadcast

In the fast-paced world of organizations handling numerous microservices and projects, tackling the challenges that arise can be a daunting task. As many of our customers come with infrastructures that included a large number of microservices we set out to make it easier for them to streamline alert source management. Enter Global Event Rulesets (GER). This feature is designed to redefine the way you manage alerts.

Read Post

Squadcast

Read more about Global Event Rulesets: Streamlining Alert Routing Across Services

The Link Between Early Detection and Internet Resilience: A Lesson from Salesforce's Outage

Oct 4, 2023 By Madan Gopal N In Catchpoint

Almost every study examining the hourly cost of outages invariably leads to a clear and undeniable conclusion: outages are expensive. According to a 2016 study, the average cost of downtime was estimated at approximately $9,000 per minute. In a more recent study, 61% of respondents stated that outages cost them at least $100,000, with 32% indicating costs of at least $500,000 and 21% reporting expenses of at least $1 million per hour of downtime.

Read Post

Catchpoint

Read more about The Link Between Early Detection and Internet Resilience: A Lesson from Salesforce's Outage

Practicing SDLC the right way #shorts #incidentresponse #sre #softwareengineer

Oct 4, 2023 By FireHydrant In FireHydrant

View Video

FireHydrant

Read more about Practicing SDLC the right way #shorts #incidentresponse #sre #softwareengineer

The problem with noise in Alerting #shorts #incidentresponse #sre #softwareengineer

Oct 4, 2023 By FireHydrant In FireHydrant

View Video

FireHydrant

Read more about The problem with noise in Alerting #shorts #incidentresponse #sre #softwareengineer

Whose fault was it anyway? On blameless post-mortems

Oct 4, 2023 By incident.io In Incident.io

No one wants to be on the receiving end of the blame game—especially in the wake of a major incident. Sure, you know you were the one who made the final change that caused the incident. And hopefully, it was a small one that didn’t cause any SEV-1s. Still, the weight of knowing you caused something bad should be enough, right? Unfortunately, sometimes fingers get pointed, your name gets called, and suddenly, everyone knows that you’re the person who created more work for everyone.

Read Post

Incident.io

Read more about Whose fault was it anyway? On blameless post-mortems

Choosing the Right Metrics for Noiseless K8s Alerting

Oct 4, 2023 By Zenduty In Zenduty

Watch Ankur Rawal and Dheeraj Reddy talk about how to choose the right metrics for noise K8s alerting, with insights and suggestions based on the mistakes made by hundreds of companies while implementing Prometheus Alertmanager in their production systems, and learn how much bad monitoring could be costing you. This talk was delivered at PromCon'2023 in Berlin.

View Video

Zenduty

Read more about Choosing the Right Metrics for Noiseless K8s Alerting

Blameless Introduces The First Generative AI-powered, Automated Incident Communications With Comms Assistant

Oct 3, 2023 By Blameless In Blameless

Revolutionizing Incident Communications, Blameless Introduces Generative AI To More Fully Automate Incident Communication Workflows.

Read Post

Blameless

Read more about Blameless Introduces The First Generative AI-powered, Automated Incident Communications With Comms Assistant

What Is the Role of an Incident Commander?

Oct 3, 2023 By Eduardo Messuti In Statuspal

For most businesses, managing major incidents can be intimidating. With a swarm of information coming from different directions, keeping things organized and maintaining clear, effective communication is tough. It only gets worse when there's no defined process to follow. This disorganization confuses everyone, delays responses, and increases the incident escalation rate. Enter the incident commander (IC).

Read Post

Statuspal

Read more about What Is the Role of an Incident Commander?

Incident response and awareness acceleration: What we can learn from responders of Queenstown floods.

Oct 3, 2023 By Kaushik Thirthappa In Spike

I was visiting Queenstown, New Zealand last week amidst the horrible floods which quickly escalated. As an incident responder myself, I was amazed at the operations and how fast responders on the ground acted in evacuating and clearing the grounds. Over 100 people were evacuated in the middle of the night with zero casualties. A commendable job. Here are some observations I made and what we can learn as incident responders ourselves..

Read Post

Spike

Read more about Incident response and awareness acceleration: What we can learn from responders of Queenstown floods.

A Journey through the Blameless Resource Library

Oct 3, 2023 By Emily Arnott In Blameless

From the very beginning of Blameless, we had two vital missions. First, to offer a solution to what we saw as a mounting crisis of reliability by offering a comprehensive, easy-to-use, reliability platform. Second, to educate the companies facing this crisis on the fundamentals of incident management, cutting-edge best practices, and the cultural values that sustain learning and growth.

Read Post

Blameless

Read more about A Journey through the Blameless Resource Library

The Unplanned Show, Episode 15: PagerDuty APIs with Nakul Bhagat

Oct 3, 2023 By PagerDuty In PagerDuty

APIs are foundational to scaling operational efficiency. Tune in on October 2 to learn more about the different types of APIs supported by PagerDuty, why they matter, and what's new!

View Video

PagerDuty

Read more about The Unplanned Show, Episode 15: PagerDuty APIs with Nakul Bhagat

Terraform Time: Testing out OpenTofu alpha release with Rundeck Terraform provider

Oct 3, 2023 By PagerDuty In PagerDuty

We will be testing out PagerDuty's automation solution Rundeck via Terraform, but with latest OpenTofu release.

View Video

PagerDuty

Read more about Terraform Time: Testing out OpenTofu alpha release with Rundeck Terraform provider

Working Effectively With Executives During an Incident

Oct 2, 2023 By Ashley Sawatsky In Rootly

You’re in the incident channel rocking yet another incident. Comms are flowing, resolution is in sight, the team is grinding, and you’re feeling good. Then…

Read Post

Rootly

Read more about Working Effectively With Executives During an Incident

The new principles of incident alerting: it's time to evolve

Oct 2, 2023 By Robert Ross In FireHydrant

In the ever-evolving world of software engineering, the landscape is constantly shifting. New technologies emerge, best practices evolve, and how we build and run software continues to change. However, when it comes to incident alerting, it often feels like we're stuck in the past.

Read Post

FireHydrant

Read more about The new principles of incident alerting: it's time to evolve

The Debrief: The connection between incident management and problem management

Oct 2, 2023 By Incident.io In Incident.io

In this video, we talk through some of the nuances of incident management and problem management, why it's better to think of them as one, and how having more responsibility on teams to build and run their software and systems makes sense.

View Video

Incident.io

Incident Management

Read more about The Debrief: The connection between incident management and problem management

What is incident management?

Oct 2, 2023 By Incident.io In Incident.io

Effective incident management involves not just responding to incidents but also detecting them early and preparing for future occurrences to minimize impact.

View Video

Incident.io

Incident Management

Read more about What is incident management?

Why incident.io will play a pivotal role in the growth of WorkOS

Oct 2, 2023 By Incident.io In Incident.io

In this snippet, Alon Levi, VP of Engineering at WorkOS, talks about why incident.io will be a key contributor in the growth of WorkOS.

View Video

Incident.io

Incident Management

Read more about Why incident.io will play a pivotal role in the growth of WorkOS

Incident management vs problem management

Oct 2, 2023 By Incident.io In Incident.io

The video explores the difference between incident management and problem management in modern organizations. It describes a common scenario where operations teams focus on immediate fixes, like rebooting systems, without addressing the root causes. Once the immediate issue is resolved, these teams pass the incident report to the developers, who are then responsible for digging deeper to prevent future occurrences.

View Video

Incident.io

Incident Management

Read more about Incident management vs problem management

Alon Levi, VP of Engineering at WorkOS, on his favorite incident.io features

Oct 2, 2023 By Incident.io In Incident.io

In this snippet, Alon Levi of WorkOS highlights some of his favorite and most-used incident.io features: follow-ups and Workflows.

View Video

Incident.io

Incident Management

Read more about Alon Levi, VP of Engineering at WorkOS, on his favorite incident.io features

Generative AI for IT Operations: Your Questions Answered

Oct 2, 2023 By Blair Sibille In BigPanda

IT leaders are thrilled about the potential of Generative AI for IT Operations. But they also want to know how it works, why it works, and what it will do for them before taking the leap and adopting this new technology. Allow me to share my perspective on the hype and the truth behind Generative AI. I’m the Field CTO for BigPanda, Operational Intelligence and Automation driven by AIOps.

Read Post

BigPanda

Read more about Generative AI for IT Operations: Your Questions Answered

Operations | Monitoring | ITSM | DevOps | Cloud

October 2023