September 2023

Event Orchestration Tips & Tricks

Sep 30, 2023 By PagerDuty In PagerDuty

Principal Product Manager Frank Emery joins us to share some insights and best practices for PagerDuty's Event Orchestration tool. Get the most out of these powerful EO features!

View Video

PagerDuty

Incident Management

Read more about Event Orchestration Tips & Tricks

Observability Pillars: Exploring Logs, Metrics and Traces

Sep 29, 2023 By Chitra Bisht In Squadcast

The ability to measure the internal states of a system by examining its outputs is called Observability. A system becomes 'observable' when it is possible to estimate the current state using only information from outputs, namely sensor data. You can use the data from Observability to identify and troubleshoot problems, optimize performance, and improve security. In the next few sections, we'll take a closer look at the three pillars of Observability: Metrics, Logs, and Traces.

Read Post

Squadcast

Read more about Observability Pillars: Exploring Logs, Metrics and Traces

Alternatives to SMS alerts

Sep 29, 2023 By Kaushik Thirthappa In Spike

While SMS alerts are handy, they also tend to be tricky. Across 120+ countries, we continuously deal with compliances & regulations from Vendors, Government, and Phone carrier companies. Other alert channels similar to SMS are a lot less cumbersome with higher delivery rates. Let’s take a look at the available options to switch from SMS.

Read Post

Spike

Read more about Alternatives to SMS alerts

How incident.io enables the confidence to declare more incidents

Sep 29, 2023 By Incident.io In Incident.io

In this snippet, Alon Levi, VP of Engineering at WorkOS, talks about how his team has gained the confidence to declare more incidents with incident.io.

View Video

Incident.io

Incident Management

Read more about How incident.io enables the confidence to declare more incidents

How incident.io's intuitive UI is great for non-technical responders

Sep 29, 2023 By Incident.io In Incident.io

In this snippet, Alon Levi, VP of Engineering at WorkOS, talks about how non-technical responders have been able to confidently declare incidents thanks to incident.io's intuitive UI.

View Video

Incident.io

Incident Management

Read more about How incident.io's intuitive UI is great for non-technical responders

How WorkOS has benefitted from incident.io's high-quality support

Sep 29, 2023 By Incident.io In Incident.io

In this snippet, Alon Levi, VP of Engineering at WorkOS, talks about the quality of support his team has received from incident.io.

View Video

Incident.io

Incident Management

Read more about How WorkOS has benefitted from incident.io's high-quality support

How incident.io helped WorkOS transform its incident response featuring VP of Engineering, Alon Levi

Sep 29, 2023 By Incident.io In Incident.io

View Video

Incident.io

Incident Management

Read more about How incident.io helped WorkOS transform its incident response featuring VP of Engineering, Alon Levi

Blameless Demo 2023

Sep 29, 2023 By Blameless In Blameless

View Video

Blameless

Read more about Blameless Demo 2023

Blameless Announces New Google Docs and Google Drive Integration to Help Engineering Teams Enhance Their Incident Management and Retrospectives

Sep 28, 2023 By Blameless In Blameless

Leading Incident Management Solution Enables Enterprises & Their Engineering Organizations To More Efficiently Produce, Collaborate And Share Retrospectives Through Automation.

Read Post

Blameless

Read more about Blameless Announces New Google Docs and Google Drive Integration to Help Engineering Teams Enhance Their Incident Management and Retrospectives

Unveiling Past Incidents: Accelerating Incident Resolution with Historical Context

Sep 28, 2023 By Vishal Padghan In Squadcast

Having the context of how similar issues were handled in the past can be invaluable. It can help incident responders grasp the nature of recurring problems, their causes, and effective solutions that have worked in the past. Introducing Squadcast’s Past Incidents feature that assists incident responders by presenting them with a list of similar past incidents related to the same service they are currently investigating.

Read Post

Squadcast

Read more about Unveiling Past Incidents: Accelerating Incident Resolution with Historical Context

Introducing Grafana OnCall shift swaps: A simpler way to exchange on-call shifts with teammates

Sep 28, 2023 By Joey Orlando In Grafana

A family member’s birthday, that concert you’ve waited all year to see, an impromptu weekend getaway with friends — there are a lot of reasons software engineers might want to switch on-call shifts. And rather than have to frantically send Slack messages to your teammates, wouldn’t it be nice to automate the process and quickly find the coverage you need?

Read Post

Grafana

Read more about Introducing Grafana OnCall shift swaps: A simpler way to exchange on-call shifts with teammates

Product Spotlight: Enhancing Incident Resolution with Blameless' Microsoft Teams Integration

Sep 28, 2023 By Aaron Lober In Blameless

In today's fast-paced digital landscape, swiftly responding to incidents is paramount for engineering teams. Downtime is not just costly; it can tarnish your organization's reputation. The pressure felt by engineering operations, DevOps, and SRE leaders to architect and run an effective incident response process is immense. Fortunately, over the last several years, effective engineering organizations have developed a standard toolkit for running a good incident response process.

Read Post

Blameless

Read more about Product Spotlight: Enhancing Incident Resolution with Blameless' Microsoft Teams Integration

The importance of testing emergency warning systems

Sep 28, 2023 By Brian Toolan In Everbridge

On Oct. 4, 2023, the Federal Emergency Management Agency (FEMA) plans a nationwide mobile alert test which will send an emergency SMS to all cellphones in the United States. In coordination with the Federal Communications Commission (FCC), the national test will be administered at approximately 2:20 p.m. ET on Wednesday, Oct. 4. It will consist of two portions that will test Wireless Emergency Alerts (WEA) and Emergency Alert System (EAS) capabilities.

Read Post

Everbridge

Read more about The importance of testing emergency warning systems

Better learning from incidents: A guide to incident post-mortem documents

Sep 27, 2023 By Luis Gonzalez In Incident.io

If you’re just starting out in the world of incident response, then you’ve probably come across the phrase “post-mortem” at least once or twice. And if you’re a seasoned incident responder, the phrase probably invokes mixed feelings. Just to clarify, here, we’re talking about post-mortem documents, not meetings. It’s a distinction we have to make since lots of teams use the phrase to refer to the meeting they have after an incident.

Read Post

Incident.io

Read more about Better learning from incidents: A guide to incident post-mortem documents

Status Pages 101: Everything You Need to Know About Status Pages

Sep 26, 2023 By Sanjog Sandhu In Squadcast

Status Pages are critical for effective Incident Management. Just as an ill-structured On-Call Schedule can wreak havoc, ineffective Status Pages can leave customers and stakeholders, adrift, underscoring the need for a meticulous approach. Here are two, Matsuri Japon, a Non-Profit Organization and Sport1, a premier live-stream sports content platform, both integrate Squadcast Status Pages to enhance their incident response strategies discreetly. You may read about them later. Crafting these Status Pages demands precision, offering dynamic updates and collaboration.

Read Post

Squadcast

Read more about Status Pages 101: Everything You Need to Know About Status Pages

Why automated Root Cause Analysis matters for driving down MTTR

Sep 26, 2023 By Joel McKelvey In BigPanda

Finding the root causes of IT anomalies can be challenging, but the rewards are worth it. By identifying the root cause or causes of an incident or critical failure, response teams can resolve incidents faster and determine the best steps to avoid having them recur. This can drive down both the frequency of service interruptions and their duration.

Read Post

BigPanda

Read more about Why automated Root Cause Analysis matters for driving down MTTR

Clouds, caches and connection conundrums

Sep 26, 2023 By Ben Wheatley In Incident.io

We recently moved our infrastructure fully into Google Cloud. Most things went very smoothly, but there was one issue we came across last week that just wouldn’t stop cropping up. What follows is a tale of rabbit holes, red herrings, table flips and (eventually) a very satisfying smoking gun. Grab a cuppa, and strap in. Our journey starts, fittingly, with an incident getting declared... 💥🚨

Read Post

Incident.io

Read more about Clouds, caches and connection conundrums

Accelerate change alert discovery and incident resolution with Root Cause Changes

Sep 26, 2023 By Elli Dugger In BigPanda

Today, the majority of organizations operate under a hybrid cloud structure. Due to this, operations are consistently met with daily infrastructure and software changes and updates, which are also the primary cause of incidents and outages. Long gone are the days when a tech stack could be represented by a single dependency model. Microservices, CI/CD, and containers across multi-cloud make it extremely difficult to track all the changes and connect them to incidents.

Read Post

BigPanda

Read more about Accelerate change alert discovery and incident resolution with Root Cause Changes

The Ultimate Guide to DORA Metrics for DevOps

Sep 25, 2023 By Anjali Udasi In Zenduty

In the world of software delivery, organizations are under constant pressure to improve their performance and deliver high-quality software to their customers. One effective way to measure and optimize software delivery performance is to use the DORA (DevOps Research and Assessment) metrics. DORA metrics, developed by a renowned research team at DORA, provide valuable insights into the effectiveness of an organization's software delivery processes.

Read Post

Zenduty

Read more about The Ultimate Guide to DORA Metrics for DevOps

incident.io workflows and integrations - as told by Pleo

Sep 23, 2023 By Incident.io In Incident.io

View Video

Incident.io

Incident Management

Read more about incident.io workflows and integrations - as told by Pleo

How we've made Status Pages better over the last three months

Sep 22, 2023 By Asiya Gorelik In Incident.io

A few months ago we announced Status Pages – the most delightful way to keep customers up-to-date about ongoing incidents. We built them because we realized that there was a disconnect between what customers needed to know about incidents, and how easily accessible this information was. For example: As we built them, we focused on designing a solution that powered crystal-clear communication, without the overhead — all beautifully integrated into incident.io.

Read Post

Incident.io

Read more about How we've made Status Pages better over the last three months

Extend Incident Alert Management to ServiceNow ITSM (Two-way integration)

Sep 21, 2023 By OnPage In OnPage

Discover how OnPage's incident alert management solution can be seamlessly extended to ServiceNow's ITSM solution to provide a more efficient and streamlined service delivery experience. The two-way integration ensures that high-priority alerts are given top priority and reach the right team member in a timely manner. And, that's not all -- IT teams gain synchronization across audit trails, alert statuses, and notes, eliminating the need for app hopping and providing all the necessary information in one location.

View Video

OnPage

Read more about Extend Incident Alert Management to ServiceNow ITSM (Two-way integration)

How incident io thinks about learning from incidents

Sep 21, 2023 By Incident.io In Incident.io

A overview of how incident.io thinks about incidents, and how they promote learning in a smaller organisation.

View Video

Incident.io

Incident Management

Read more about How incident io thinks about learning from incidents

The struggles of actually applying incident theory

Sep 21, 2023 By Incident.io In Incident.io

Chris explains his thoughts on the theory of learning from incidents, and why work needs to be done to close the gap and help folks actually trying to get their job done.

View Video

Incident.io

Incident Management

Read more about The struggles of actually applying incident theory

Underneath the Surface of Incident Cost

Sep 21, 2023 By Blameless In Blameless

View Video

Blameless

Read more about Underneath the Surface of Incident Cost

Enhance emergency alerts with Device-Based Geo-Fencing

Sep 21, 2023 By Everbridge In Everbridge

In today’s fast-paced and interconnected world, the importance of efficient and effective public warning systems cannot be overstated. As we face a multitude of natural disasters, civil unrest, and health crises, the ability to swiftly communicate impending threats to the right individuals at the right time has become a matter of paramount importance.

Read Post

Everbridge

Read more about Enhance emergency alerts with Device-Based Geo-Fencing

The Unplanned Show: Things every PagerDuty user should know

Sep 21, 2023 By PagerDuty In PagerDuty

What are the secrets to successfully adopting PagerDuty? Dormain will sit down with Senior Manager of Customer Success Engineering, Matt Linebarger to get his list of things he wished every PagerDuty user knew.

View Video

PagerDuty

Incident Management

Read more about The Unplanned Show: Things every PagerDuty user should know

What's wrong with MTTR?

Sep 21, 2023 By Incident.io In Incident.io

Taken from our a full debrief on "Learning from incidents is not the goal", Chris walks through MTTR, the justifiable bad rap it has, and his thoughts on it as a measure.

View Video

Incident.io

Incident Management

Read more about What's wrong with MTTR?

Active and passive learning from incidents

Sep 21, 2023 By Incident.io In Incident.io

In this video, Chris shares his thoughts on the difference between active learning: writing and sharing debriefs, meeting to walk through an incident, etc., and passive learning: running incidents in the open, dynamic collaboration, reviewing past incidents.

View Video

Incident.io

Incident Management

Read more about Active and passive learning from incidents

The Debrief: Learning from incidents is not the goal

Sep 21, 2023 By Incident.io In Incident.io

In this video, incident.io co-founder and CPO Chris Evans walks through his blog post "Learning from incidents is not the goal". We cover why he wrote this, his thoughts on the gap between theory and practice, and how people can really learn from incidents.

View Video

Incident.io

Incident Management

Read more about The Debrief: Learning from incidents is not the goal

Top 5 Resiliency Trends of 2023

Sep 20, 2023 By Rohit Ghumare In Rootly

In today’s world, resilience is no longer a conditioned desire or methodology to try but has become a necessity for sustained success in software development and IT operations. As DevOps and Agile teams keep moving forward to cross boundaries, come up with new methodologies, and drive innovation, it is now important to have the ability to quickly recover from failures, adapt to changing conditions, and maintain high performance under pressure.

Read Post

Rootly

Read more about Top 5 Resiliency Trends of 2023

Twelve Key Learnings from PagerDuty People Team's Generative AI HackWeek

Sep 20, 2023 By Joe Militello In PagerDuty

Sometimes innovation requires ideas unconstrained by traditional structures and removed from day-to-day responsibilities. It was in this spirit that PagerDuty’s People HackWeek–a friendly competition to explore how generative AI might impact the future of HR–was born.

Read Post

PagerDuty

Read more about Twelve Key Learnings from PagerDuty People Team's Generative AI HackWeek

How to use the Rest API to manage SIGNL4 categories

Sep 20, 2023 By Ronald In SIGNL4

The categories in SIGNL4 are a powerful tool to make it clear to users at first glance what a particular alert is about. For example, colors, icons, location and predefined texts can be configured here. Categories can be created and edited manually.

Read Post

SIGNL4

Read more about How to use the Rest API to manage SIGNL4 categories

The balancing act of reliability and availability

Sep 19, 2023 By incident.io In Incident.io

As consumers, we expect the products and software we buy to work 100% of the time. Unfortunately, that’s impossible. Even the most reliable products and services experience some disruption in service. Crashes, bugs, timeouts. There are a ton of contributing factors, so it's impossible to distill disruptions down to a single cause. That said, technology is becoming more and more sophisticated, and so is the infrastructure that supports it.

Read Post

Incident.io

Read more about The balancing act of reliability and availability

The Unplanned Show, Episode 13: Jake Cohen and Generative AI for Automation

Sep 19, 2023 By PagerDuty In PagerDuty

On the heels of the public beta opening for AI-generated runbooks in Runbook Automation, we asked Jake Cohen from product management about how this is different from generating code with something like chatGPT or various AI-powered code completion tools available. We get into prompt engineering, managing output quality, and privacy and security concerns.

View Video

PagerDuty

Read more about The Unplanned Show, Episode 13: Jake Cohen and Generative AI for Automation

PagerDuty Appoints Eric Johnson as Chief Information Officer

Sep 18, 2023 By PagerDuty In PagerDuty

Industry Veteran to lead Information Technology and CIO Community Engagement.

Read Post

PagerDuty

Read more about PagerDuty Appoints Eric Johnson as Chief Information Officer

A better Grafana OnCall: Delivering on features for users at scale

Sep 18, 2023 By Devin Cheevers In Grafana

Enterprise IT is just a different animal. Whether it’s operating at scale, undertaking massive migrations, working across scores of teams, or addressing tight security requirements, engineers at these organizations can face different obstacles than their counterparts at smaller organizations and startups.

Read Post

Grafana

Read more about A better Grafana OnCall: Delivering on features for users at scale

Transformation in Travel: Our Q&A with TUI's Head of Technology

Sep 18, 2023 By Lisa Duckrow In PagerDuty

The travel industry is experiencing an unprecedented surge in demand from people seeking adventure and eager to explore new destinations. Given an abundance of choice and the desire to have a personalized experience, customers are turning to tour operators to remove complexity from planning so they can focus on the holiday and not on the process of planning it.

Read Post

PagerDuty

Read more about Transformation in Travel: Our Q&A with TUI's Head of Technology

TUI Powers Outstanding Digital Experience for Customers with the PagerDuty Operations Cloud

Sep 18, 2023 By PagerDuty In PagerDuty

PagerDuty Operations Cloud is essential infrastructure for TUI, enabling agility and cost efficiency to deliver outstanding digital experiences for customers. With PagerDuty’s AI and automation capabilities, TUI has streamlined incident management—reducing downtime and boosting customer bookings. Hear more in this video from Yasin Quareshy, Head of Technology at TUI.

View Video

PagerDuty

Incident Management

Read more about TUI Powers Outstanding Digital Experience for Customers with the PagerDuty Operations Cloud

Implementing Zero Trust: A Practical Guide

Sep 15, 2023 By Emily Arnott In Blameless

According to the Harvard Business Review, 2022 saw more than 83% of businesses experiencing multiple data breaches. Ransomware attacks, in particular, were up 13%. With cyber security being such a hot topic for business owners, it’s no surprise implementing a zero trust policy has become so important. In this guide, we’ll cover how to implement zero trust and why it’s important for your business to do so. Let’s get started.

Read Post

Blameless

Read more about Implementing Zero Trust: A Practical Guide

Mastering Incident Resolution: Process and Best Practices

Sep 15, 2023 By Emily Arnott In Blameless

For DevOps and IT teams, incident resolution is an important aspect of predicting, resolving, and documenting service disruptions. It refers to the part of the incident management process where responders restore the service to functioning. Modern technology has come a long way, but it’s not without flaws. When businesses suffer from cyber-attacks, system crashes, and network outages, it impacts the organization on many levels.

Read Post

Blameless

Read more about Mastering Incident Resolution: Process and Best Practices

The connection between incident management and problem management

Sep 15, 2023 By Luis Gonzalez In Incident.io

Sometimes, two concepts overlap so much that it’s hard to view them in isolation. Today, incident management and problem management fit this description to a tee. This wasn’t always the case. For a long time, these two ITIL concepts were seen as distinct—with specialized roles overseeing each. Incident management existed in one corner and problem management in the other. Then came the DevOps movement and the lines suddenly became blurred. So where do they stand today?

Read Post

Incident.io

Read more about The connection between incident management and problem management

What Is GitOps and Will It Eliminate Incident Management?

Sep 15, 2023 By Gilad Maayan In OnPage

Incident management is a critical aspect of IT service management (ITSM) that revolves around restoring normal service operations as swiftly as possible after an unplanned interruption or reduction in quality. Also referred to as “incidents,” these interruptions could range from a minor issue like a single user being unable to access a service to a significant problem such as a server crash or network outage affecting many users.

Read Post

OnPage

Read more about What Is GitOps and Will It Eliminate Incident Management?

Slack & OnPage: Extend Critical Alerting and on-call management to Chat Collaboration

Sep 15, 2023 By OnPage In OnPage

View Video

OnPage

Read more about Slack & OnPage: Extend Critical Alerting and on-call management to Chat Collaboration

The Unplanned Show, Episode 12: "Houston, we have a problem": Crisis Response with Jason Flint

Sep 15, 2023 By PagerDuty In PagerDuty

We discuss the going beyond “checkbox” firedrills, the value in cross-functional planning, and engaging workforces to improve preparedness.

View Video

PagerDuty

Incident Management

Read more about The Unplanned Show, Episode 12: "Houston, we have a problem": Crisis Response with Jason Flint

Inside Prezi's cost-saving switch to Grafana Alerting, Grafana OnCall, and Grafana Incident from PagerDuty

Sep 15, 2023 By Alexander Koehler In Grafana

Alexander is Senior SRE at Prezi, a video and visual communications software company. As a team, the Prezi SREs provide multiple services within the company. One of those is the observability stack where Prezi heavily relies on Grafana. Companies are always evolving to run more smoothly, serve their customers better, and operate in a way that is cost-effective.

Read Post

Grafana

Read more about Inside Prezi's cost-saving switch to Grafana Alerting, Grafana OnCall, and Grafana Incident from PagerDuty

Streamlining Incident Management with our latest feature update: Merge Incidents

Sep 14, 2023 By Nakul Shetty In Squadcast

Hey folks! We‘re back with another nifty feature to your Incident Management tool arsenal. You now have the ability to merge incidents with a few clicks! With this latest update you can reduce the noise while dealing with a complex incident by merging incidents across services under a parent incident. Typically this can occur when multiple incidents stem from the same underlying issue or root cause.

Read Post

Squadcast

Read more about Streamlining Incident Management with our latest feature update: Merge Incidents

Extend Critical Alerting and OnCall Management to Slack

Sep 14, 2023 By OnPage In OnPage

View Video

OnPage

Read more about Extend Critical Alerting and OnCall Management to Slack

Journey from Junior to Senior SRE: Key Insights and Strategies

Sep 14, 2023 By Anjali Udasi In Zenduty

As Site Reliability Engineering (SRE) continues to grow in popularity, many professionals are looking for ways to advance from junior to senior roles. While there is no one-size-fits-all approach, the transition from junior to senior SRE is marked by a gradual increase in experience and a set of key skills. In this blog, we will explore the valuable insights and strategies shared by experienced SREs.

Read Post

Zenduty

Read more about Journey from Junior to Senior SRE: Key Insights and Strategies

10 Benefits of Effective Incident Communication

Sep 14, 2023 By Eduardo Messuti In Statuspal

In today's digital landscape, most people understand that no system is perfect and data is never 100% safe. Incidents are bound to happen. How people learn about those incidents often influences their reactions. Mishandled incident communication can have drastic consequences for your company. For starters, it can drag out the incident response and harm your bottom line.

Read Post

Statuspal

Read more about 10 Benefits of Effective Incident Communication

What's the Difference Between an Agile Retrospective and an Incident Retrospective?

Sep 14, 2023 By Ken Gavranovic In Blameless

Blameless Chief Operating Officer Ken Gavranovic recently sat down with Lee Atchison, a renowned expert in system reliability, to discuss the topic of conducting effective incident retrospectives. You can watch their engaging, informative discussion below, or read on for our overview of the greatest hits from their talk. ‍ Agile development and incident management are the backbones of any tech-driven development cycle. At the heart of these practices lies the art of retrospectives.

Read Post

Blameless

Read more about What's the Difference Between an Agile Retrospective and an Incident Retrospective?

Empowering Hyper Local Resilience - Everbridge + Samdesk Podcast

Sep 13, 2023 By Everbridge In Everbridge

Organizations today face a myriad of threats in the form of civil unrest, cybersecurity, severe weather events, and more. Visibility into emerging events and potential threats extremely early in the crisis lifecycle enables security teams to take proactive measures to protect lives, reputation, and reduce liability. Proactive intelligence is critical to optimize preparation, response efficacy, and speed recovery.

View Video

Everbridge

Read more about Empowering Hyper Local Resilience - Everbridge + Samdesk Podcast

Blameless Garners Acclaim in Industry Reports from G2 and Gartner for Site Reliability and Incident Management

Sep 12, 2023 By Blameless In Blameless

Leading Incident Management Solution Named by G2 as a High Performer in the Incident Management Category; Included in Gartner Hype Cycle for Monitoring and Observability 2023.

Read Post

Blameless

Read more about Blameless Garners Acclaim in Industry Reports from G2 and Gartner for Site Reliability and Incident Management

PagerDuty Helps Customer Service Teams Resolve Issues Faster and More Efficiently with Workflow Automation and Private Status Pages

Sep 12, 2023 By PagerDuty In PagerDuty

New PagerDuty Operations Cloud capabilities bridge the gap between customer service and technical teams for reducing costs and improving customer satisfaction.

Read Post

PagerDuty

Read more about PagerDuty Helps Customer Service Teams Resolve Issues Faster and More Efficiently with Workflow Automation and Private Status Pages

Seven Models of Cloud Native Applications

Sep 12, 2023 By Rajiv Srivastava In Squadcast

In today's cloud-driven landscape, organizations are transitioning from legacy monolithic systems to agile, scalable, and secure cloud-native solutions. Some are even forging new cloud-native applications. However, the concept of cloud-native design remains subjective, lacking a universal blueprint. This blog aims to provide clarity and guidance for designing precise cloud-native applications and container deployment.

Read Post

Squadcast

Read more about Seven Models of Cloud Native Applications

More than downtime: the cultural drain caused by poor incident management

Sep 12, 2023 By Robert Ross In FireHydrant

The costs of lackluster incident management are truly far-reaching. We’ve learned they go beyond explicit costs, like lost revenue and labor expenses. And that they go beyond the opportunity cost of engineers being diverted from building revenue-building features. The final area of incident cost that’s often overlooked is cultural drain.

Read Post

FireHydrant

Read more about More than downtime: the cultural drain caused by poor incident management

OnPage's Automation in I&O Optimization Predictions (Inspired by Gartner Hype Cycle for I&O Automation, 2023)

Sep 12, 2023 By Halle Katz In OnPage

The release of the Gartner® Hype Cycle™ for I&O Automation, 2023 has inspired us here at OnPage to provide our insights on the latest trends in I&O optimization. In this blog, OnPage will predict the widespread adoption of technologies that can further automation efforts and thus contribute to I&O optimization.

Read Post

OnPage

Read more about OnPage's Automation in I&O Optimization Predictions (Inspired by Gartner Hype Cycle for I&O Automation, 2023)

The Future of ITSM: Exploring the Potential of AI-Powered Service Management

Sep 11, 2023 By Arun Prasath R In Infraon

IT Service Management (ITSM) is such that it constantly evolves, introducing new technologies and tools. But if you have noticed recently, there have been some constants. And one of the most promising developments is leveraging Artificial Intelligence (AI) to power IT service management. However, the fact that AI has the potential to revolutionize ITSM is not exactly breaking news. But what continues to slip under the radar of many ITOps teams is how to unlock AI's true potential. To know this, there's a dire need to understand the already critical and soon-to-be popular use cases.

Read Post

Infraon

Read more about The Future of ITSM: Exploring the Potential of AI-Powered Service Management

The power of Everbridge 360

Sep 10, 2023 By Everbridge In Everbridge

Everbridge 360™ represents our relentless dedication to provide customers with the most comprehensive and unified interface to manage critical events across one single platform so they can know earlier, respond faster, and improve continuously. More effectively manage critical events, minimize communication delays, and improve overall organizational resilience through the industry’s most advanced and unified dashboard.

View Video

Everbridge

Read more about The power of Everbridge 360

How to Set Up an IT War Room

Sep 9, 2023 By Anjali Udasi In Zenduty

IT issues can happen at any time and significantly impact an organization. Hence, it's essential to have a plan to handle these issues quickly and efficiently. And one way to do this is to create an IT war room. An IT war room is a dedicated space for teams to collaborate and resolve issues. Establishing an IT war room enhances an organization's capacity to swiftly and efficiently address IT problems, ultimately reducing their impact on the business.

Read Post

Zenduty

Read more about How to Set Up an IT War Room

Enhancing Incident Management: Seven Integrations to Complete Your Ticketing Systems

Sep 8, 2023 By Chitra Bisht In Squadcast

Squadcast offers some powerful integrations to simplify Incident Management processes and make your work easy. These integrations enhance Incident Management processes and complete your ticketing systems, ensuring seamless collaboration and timely issue resolution.

Read Post

Squadcast

Read more about Enhancing Incident Management: Seven Integrations to Complete Your Ticketing Systems

PagerDuty 101 Series, Part 3: Setting Up Schedules & Escalation Policies

Sep 8, 2023 By PagerDuty In PagerDuty

In Part 3 of the PagerDuty 101 series, learn how to create schedules and escalation policies for your teams in PagerDuty.

View Video

PagerDuty

Incident Management

Read more about PagerDuty 101 Series, Part 3: Setting Up Schedules & Escalation Policies

PagerDuty 101 Series, Part 2: Adding PagerDuty Users and Setting Up User Profiles

Sep 8, 2023 By PagerDuty In PagerDuty

In Part 2 of the PagerDuty 101 series, learn about adding users to the platform and setting up their profiles.

View Video

PagerDuty

Incident Management

Read more about PagerDuty 101 Series, Part 2: Adding PagerDuty Users and Setting Up User Profiles

PagerDuty 101 Series, Part 4: Setting Up Services & Integrations

Sep 8, 2023 By PagerDuty In PagerDuty

In final section of the PagerDuty 101 series, learn how to set up services and integrations for your PagerDuty instance.

View Video

PagerDuty

Incident Management

Read more about PagerDuty 101 Series, Part 4: Setting Up Services & Integrations

Practical guidance for getting started as a site reliability engineer

Sep 8, 2023 By Ben Wheatley In Incident.io

At the beginning of May, I joined incident.io as the first site reliability engineer (SRE), a very exciting but slightly daunting move. With only some high-level knowledge of what the company and its systems looked like prior to this point, it’s fair to say that I didn’t have much certainty in what exactly I’d be working on or how I’d deliver it.

Read Post

Incident.io

Read more about Practical guidance for getting started as a site reliability engineer

PagerDuty 101 Series, Part 1: Introduction to the Power of PagerDuty

Sep 8, 2023 By PagerDuty In PagerDuty

In this brand new PagerDuty 101 series, Part 1 introduces you to what's possible in PagerDuty.

View Video

PagerDuty

Incident Management

Read more about PagerDuty 101 Series, Part 1: Introduction to the Power of PagerDuty

Blameless Announces New CommsFlow Upgrade to Elevate Incident Management Communication

Sep 7, 2023 By Blameless In Blameless

New Enhancements to Blameless CommsFlow Help Engineering Teams Modernize Their Incident Response Process, Deliver Higher-Quality Retrospectives at a Faster Pace.

Read Post

Blameless

Read more about Blameless Announces New CommsFlow Upgrade to Elevate Incident Management Communication

Incident Priority Matrix: From Chaos to Clarity

Sep 7, 2023 By Eduardo Messuti In Statuspal

IT leaders often find themselves under pressure to support business outcomes while also trying to manage help requests. An incident priority matrix makes the incident management process much more seamless. It helps companies handle priority incidents within reasonable resolution times while ensuring other concerns are met. In this blog post, we delve deep into the concept of the Incident Priority Matrix, its significance, and how it can transform your incident management processes.

Read Post

Statuspal

Read more about Incident Priority Matrix: From Chaos to Clarity

Multi-Org takes FireHydrant for enterprise to the next level

Sep 7, 2023 By Joel Smith In FireHydrant

Too often, complexity means confusion — and confusion is your worst enemy when it comes to efficient incident response. We recently found that poor incident management practices (like confusion about what to do or how to escalate an incident) can cost companies as much as $18 million a year.

Read Post

FireHydrant

Read more about Multi-Org takes FireHydrant for enterprise to the next level

Hospital Discharge Best Practices

Sep 7, 2023 By Zoe Collins In OnPage

Establishing an effective hospital discharge process is a crucial part of a patient’s stay and can significantly impact the success of their recovery. Patients, families, and subsequent care providers require a detailed education on continued treatment, aftercare processes, and required medications, to avoid any complications that may surface during recovery.

Read Post

OnPage

Read more about Hospital Discharge Best Practices

SIGNL4 Onboarding: Alert Escalation

Sep 7, 2023 By SIGNL4 In SIGNL4

The SIGNL4 Onboarding series walks users through the process's of SIGNL4 from Signup to Alerts to Settings. Today's video focuses on escalating alerts both manually and automatically inside of SIGNL4. This video is packed with helpful tips to help you get the most out of your account.

View Video

SIGNL4

Read more about SIGNL4 Onboarding: Alert Escalation

SIGNL4 Distribution Rules Deep Dive

Sep 7, 2023 By SIGNL4 In SIGNL4

A quick video showcasing the power of SIGNL4's new distribution rules and how they can be utilized to help you ensure the right information is going to the right people at the right time.

View Video

SIGNL4

Read more about SIGNL4 Distribution Rules Deep Dive

Failure Metrics & KPIs for IT Systems

Sep 6, 2023 By Chrissy Kidd In Splunk

The game in enterprise IT is this: delivering amazing services to your customers while also reducing costs. That means the time it takes to respond to an incident is critical. Incidents can ruin service delivery and destroy your budget. Certain incidents almost surely deliver a poor customer experience. Response times, you hear? Yep, we’re talking about MTTR, but that’s not all.

Read Post

Splunk

Read more about Failure Metrics & KPIs for IT Systems

How generative AI is increasing cyber risk & what to do to make sure you're ready

Sep 6, 2023 By Noam Morginstin In Exigence

Generative AI is all the buzz these days with the popularity of platforms and tools such as ChatGPT, Bard, Scribe, Jasper, and others experiencing exponential growth. This is a technology that has come to the fore with the force of a runaway train that’s bringing us head long into the future at the speed of light. It is transforming everything we do from writing code to making travel plans. And cybersecurity is no exception.

Read Post

Exigence

Read more about How generative AI is increasing cyber risk & what to do to make sure you're ready

How to Ace Your Services with PagerDuty

Sep 6, 2023 By Débora Cambé In PagerDuty

It’s finals week for the US Open, one of the most celebrated sports events in the world. Tennis is my favorite sport to watch as I’m fascinated by the strength, composure and endurance each player displays while standing by themselves on the court, sometimes during incredibly long matches – the current record is 11h05.

Read Post

PagerDuty

Read more about How to Ace Your Services with PagerDuty

Reliably receive a call when an organ donor is matched

Sep 6, 2023 By Ritika Bramhe In OnPage

Within the broader context of organ transplantation, time is of the essence. Lives hang in the balance, waiting for that life-changing call announcing a matched donor organ. For organ transplant recipients, the waiting game is often a test of patience and resilience. However, with the advent of modern technology, a solution has emerged to alleviate this uncertainty – OnPage.

Read Post

OnPage

Read more about Reliably receive a call when an organ donor is matched

Streamlining Incident Investigation

Sep 6, 2023 By Honeycomb In Honeycomb

Honeycomb Customer Success Manager Josh Levin explains how to troubleshoot production incidents using Honeycomb's telemetry data: metrics, traces, and logs. While these data forms have separate interfaces, you can investigate seamlessly within Honeycomb. Josh highlights the key role of the "retriever" service in data ingestion and querying and demonstrates cross-validating tracing data with metrics to spot anomalies in pod deployments and resource usage, presented in a separate dataset. He also uses effective log filtering and searching for keywords like "update status.".

View Video

Honeycomb

Read more about Streamlining Incident Investigation

SLA vs. SLI vs. SLO: Understanding Service Levels

Sep 6, 2023 By Shanika Wickramasinghe In Splunk

In our service-driven world, businesses must provide the best user experience possible. Great service helps you retain long-term customers while also growing your customer base — to keep tabs on service performance, a few key metrics and signals come into play.

Read Post

Splunk

Read more about SLA vs. SLI vs. SLO: Understanding Service Levels

OnPage-ServiceNow Bi-Directional Integration

Sep 6, 2023 By OnPage In OnPage

View Video

OnPage

Read more about OnPage-ServiceNow Bi-Directional Integration

Enhancing Code Blue Workflow for Improved Survival Rates

Sep 5, 2023 By Ritika Bramhe In OnPage

In critical healthcare scenarios, swift response is the linchpin to saving lives. Enter code blue workflows – a set of protocols that guide healthcare teams in high-stress scenarios. When a patient’s life is at stake due to cardiac arrest, respiratory failures, or other life-threatening conditions, these workflows ensure a rapid, synchronized response.

Read Post

OnPage

Read more about Enhancing Code Blue Workflow for Improved Survival Rates

Gimme 5 with Checkout's Alexia Loizides

Sep 5, 2023 By Stephanie Gonzalez In FireHydrant

Gimme 5 by FireHydrant is a look inside incident management at some of the world's most forward-thinking DevOps teams. In this episode, we talk with Alexia Loizides, Senior Manager of IT Service Management for payments platform Checkout.

Read Post

FireHydrant

Read more about Gimme 5 with Checkout's Alexia Loizides

Celebrating Our Nine New G2 Awards

Sep 5, 2023 By JJ Tang In Rootly

We’re proud to share that we've been recognized as a High Performer and Enterprise Leader in Incident Management for the sixth consecutive quarter in the G2 Summer 2023 Report! In total, Rootly received nine G2 awards in the Summer Report.

Read Post

Rootly

Read more about Celebrating Our Nine New G2 Awards

6 Best Practices for Seamless Notifications with International SMS

Sep 5, 2023 By Cristina Dias In PagerDuty

There’s no denying it: in today’s interconnected world, Application-to-Person (A2P) SMS notifications have become an integral part of our daily lives. Whether it’s receiving crucial banking alerts, getting updates from our favorite retailers, or even surfacing a notification from PagerDuty when your service is down–SMS keeps us informed and connected. But have you ever wondered about the intricacies behind this seemingly straightforward technology?

Read Post

PagerDuty

Read more about 6 Best Practices for Seamless Notifications with International SMS

Starting with Incident management career

Sep 4, 2023 By Kaushik Thirthappa In Spike

Businesses and organisations are increasingly reliant on technology for their operations, the significance of alerting platforms has become paramount. Alerting platforms encompass the processes that enable organisations to acknowledge, respond, and to reduce various types of incidents that can impact their services. Incident alerts enable prompt responses,at the right time and minimise potential damage.

Read Post

Spike

Read more about Starting with Incident management career

Building Trust with our Customers with PagerDuty for PagerDuty: Crisis Response Management Operations

Sep 4, 2023 By Jason Flint In PagerDuty

A critical partner in your supply chain just went down. An earthquake just hit your main operations hub. Breaking news about your organization just hit social media. Bad news first—there’s always another crisis or existential threat to your organization on the horizon. If you don’t have an established Crisis Response process and team in place, you’re running a high risk of failure.

Read Post

PagerDuty

Read more about Building Trust with our Customers with PagerDuty for PagerDuty: Crisis Response Management Operations

SLO Driven Incident Response: Service Level Objectives for Effective Incident Management | Squadcast

Sep 4, 2023 By Squadcast In Squadcast

In today's tech-driven landscape, effective Incident Management is vital for seamless service and customer satisfaction. This webinar explores ways to uncover the role of Service Level Objectives (SLOs) in structuring incident response processes while acting as a compass, guiding incident prioritization and resolution to minimize customer impact and downtime. The webinar will help you demystify SLOs, their data-driven role in incident decision-making, and how to prioritize incidents to lessen customer impact by identifying critical incidents.

View Video

Squadcast

Read more about SLO Driven Incident Response: Service Level Objectives for Effective Incident Management | Squadcast

Grafana Incident auto-summary: AI in Grafana Cloud

Sep 1, 2023 By Grafana In Grafana

Check out a fun demo of Grafana Incident auto-summary, which uses generative AI to suggest a helpful synopsis that captures key details from your incident timeline with a single click. Grafana Incident auto-summary marks the first feature enabled by the new OpenAI integration in Grafana Incident. Simply bring your own OpenAI API key to get started in Grafana Cloud.

View Video

Grafana

Read more about Grafana Incident auto-summary: AI in Grafana Cloud

Manage incidents, real-time alerts, and oncall from Microsoft Teams

Sep 1, 2023 By Spike In Spike

Welcome to Spike.sh’s Microsoft Teams bot! At the heart of every successful team lies efficient communication and swift problem resolution. That’s precisely what our bot brings to the table – a dynamic toolset that empowers you to tackle incidents seamlessly. Features: Our new Microsoft Teams bot alerts are not only prompt but also smartly updated as the situation develops. It achieves this by seamlessly integrating incident management into Microsoft Teams, providing you with real-time alerts the moment an incident surfaces.

View Video

Spike

Read more about Manage incidents, real-time alerts, and oncall from Microsoft Teams

Operations | Monitoring | ITSM | DevOps | Cloud

September 2023