Operations | Monitoring | ITSM | DevOps | Cloud

June 2023

Incident Management vs Problem Management

In the dynamic landscape of IT service management, ITSM, two concepts reign supreme - Incident Management and Problem Management. They might seem similar, and many use these terms interchangeably, but they serve distinct purposes. Through this article, we’ll navigate the nuanced differences between Incident Management and Problem Management, and apply these concepts in our own approach to incident management.

Synchronizing mental models

In the heat of an incident, having a clear and shared understanding of what’s going on is absolutely crucial to effective response. But often what actually happens is that people involved in incidents build their own picture and narrative of the event, shaped by their own expertise, their past experiences, and what they’re seeing and hearing as the incident develops. The pictures and perspective people build is often referred to as a mental model.

Strengthen Your DORA Metrics with PagerDuty

For technical teams, the findings from DORA provide a model for measuring and improving performance. With almost a decade of data gathered from more than 33,000 professionals worldwide, the capabilities and frameworks detailed by the research help teams pinpoint areas for improvement and areas to celebrate. The team at DORA categorizes capabilities into three sections: Technical Capabilities, Process Capabilities and Cultural Capabilities.

The Art of Alert Management

With the ever-growing landscape of digital technology and the internet of things (IoT), businesses are becoming increasingly reliant on complex systems to monitor and manage their operations. This dependency has resulted in an explosion of alerts and notifications, overwhelming IT teams and affecting overall productivity. It’s never been more critical to have an effective alert management strategy in place to ensure the smooth running of your organization.

Announcing Catalog - the connected map of everything in your organization

One of the most painful parts of incident response is contextualizing the problem and understanding how and where it fits within your organization. If responders are unable to answer basic questions such as: Then you waste valuable time talking to the wrong people or solving the wrong problems — ultimately extending impact and hurting your response. It’s a common issue that, up until now, didn’t have a clear solution or workaround.

From Expense to Excellence: Transforming ITOps in 2023 through Strategic IT cost optimization

Most organizations view their tech and network operations center and their budgets as simply the cost of running their internal and external IT services. However, through IT cost optimization, you can improve how your Ops center team responds to service issues and save valuable resources too. So, what specifically is IT cost optimization?

Upgraded role-based access control brings more visibility - and control - to incident management at your organization

We’ve long believed that incidents (and technical team cultures) improve when more people are empowered to declare, follow, and contribute to their resolution. But not everyone in an organization needs to be able to manage the processes, rules, and settings a company enforces for their incident programs.

Welcome To xMatters - Ep3 - Sending Messages

There’s nothing better than a smoothly run operation but life is full of unexpected surprises. When things don’t go to plan, and help is urgently needed, no time can be wasted. Getting a message to a resolver on time is just as important as having a resolver to call in the first place! And letting people know that help is on the way is especially important to keep the situation calm until they arrive.

How our product team use Catalog

We recently introduced Catalog: the connected map of everything in your organization. In the process of building Catalog as a feature, we’ve also been building out the content of our own catalog. We'd flipped on the feature flag to give ourselves early access, and as we went along, we used this to test out the various features that Catalog powers.

Services are not special: Why Catalog is not just another service catalog

As you may have already seen, we’ve recently released a Catalog feature at incident.io. While designing and building it, we took an approach that’s a tangible departure from a traditional service catalog. Here’s how we’re different, and why.

Azure Incident Management with Escalation Policy

These days, businesses heavily rely on cloud services like Microsoft Azure to power their operations. While Azure provides robust infrastructure and services, occasional issues and incidents can still occur. Serverless360 provides enhanced capabilities to monitor and manage Azure incidents in a system. But to ensure seamless operations and timely resolution of problems, it is crucial to have a well-defined escalation policy in place for Azure Incident Management..

The Unplanned Show, Episode 3: LLMs and Incident Response

A software engineer, a data scientist, and a product manager walk into a generative AI project… Using technology that didn’t exist a year ago, they identify a customer pain point they might be able to solve, build on teammates’ experience with building AI features, and test how to feed inputs and constrain outputs into something useful. Hear the full conversation here.

incident.io Catalog hands on lab

The incident.io Catalog is a connected, navigable, map of "things" that exist in your organization. We can use it to describe an organization as a connected graph, and use that graph to drive powerful workflow automations during incidents. In this hands-on training session, we'll work through an example of building a catalog for a mock organization. We'll then use the catalog to solve some real business problems, including automated incident data attribution, and some realistic workflows which outline how it works and what it enables in the context of incident management.

How AIOps Revolutionizes Observability for TechOps Teams

Managing over 1000 services and applications is daunting for any organization’s IT and Tech operations team. With a diverse mix of on-premises legacy systems and modern cloud stacks, the sheer volume of activity can overwhelm even the most skilled ITOps teams. The task is made more difficult by the fact that observability is fragmented. On average, organizations depend on 21 systems that produce metrics, logs, traces, and alerts for various services.

Sponsored Post

Squadcast's Improved Mobile App for Better Incident Response

The 2020 pandemic has definitely changed the way teams operate across the globe. Many of you may have already experienced moving from 100% office work to 100% remote work, and now that it has been almost three years since the pandemic started many of us have resorted to hybrid models. We at Squadcast value the importance of efficient communication, reaching the right people during a crisis, and the freedom to resolve critical incidents from anywhere, anytime. Keeping that in mind, we have made major improvements to our mobile app to help you effectively partake in Incident Response activities anytime from across the globe.

Cyberattack Prevention with AI

Cyberattack prevention involves proactive steps organizations take to protect their digital assets, networks, and systems from potential cyber threats. Preventive measures, such as a combination of best practices, policies, and technologies, are employed to identify and mitigate security breaches before they can cause significant damage.

There Are No Repeat Incidents

People seem to struggle with the idea that there are no repeat incidents. It is very easy and natural to see two distinct outages, with nearly identical failure modes, impacting the same components, and with no significant action items as repeat incidents. However, when we look at the responses and their variations, we can find key distinctions that shows the incidents as related, but not identical.

Fast Track Video Series: See a demonstration of BigPanda's Incident Intelligence and Automation Platform

BigPanda transforms millions of events into a small number of actionable alerts, no matter where they originate. How? Watch this video to learn more. The video shows how BigPanda allows you to normalize tag values across all tools, aiding event enrichment and correlation. The open integration manager then makes it easy to pre-process the event data helping to filter unwanted events from the feed. The filtering strips out duplicate and low-relevancy events and keeps them from cluttering up the console.

Bitrix24 + Squadcast Integration: Simplifying Alert Routing

Bitrix24 is a cloud-based business management and collaboration platform that provides a suite of tools for managing various business processes. If you use Bitrix24 for your collaboration and CRM requirements, you can integrate it with Squadcast, an end-to-end Incident Response tool, to route alerts such as creating a lead on Bitrix24 CRM or creating a task in Bitrix24 to Squadcast. ‍

MSP Guide to Navigating an Impending Recession

Amidst mounting pressure from macroeconomic headwinds, businesses must prepare for declining consumer spending, less investment, and tighter credit conditions to survive. Managed Service Providers (MSPs) play a valuable role in helping businesses to navigate upcoming economic downturns, from optimizing costs to providing scalable solutions.

How to Create a Runbook Template for DevOps (With Examples)

A DevOps runbook is a little like a recipe book. Instead of rules for cooking, it’s a compilation of rules and procedures designed to maintain software systems and other applications. The purpose of each runbook is to cross-educate your entire team with the same knowledge base and provide easy-to-follow instructions in time-sensitive situations like incidents. Runbook templates are guides outlining a standard for the documentation of operations and development.

Endtest + Squadcast Integration: Alert Routing Made Easy

Endtest is a low code test automation platform enabling organizations to efficiently build automated end-to-end tests for web and mobile applications. If you use Endtest for your test automation requirements, you can integrate it with Squadcast, an end-to-end Incident Response tool, to route detailed alerts from Endtest to the right users in Squadcast.

FireHydrant Private Incidents & Runbooks: more control for you, more security for your customers

Ensuring the privacy and security of sensitive information is crucial no matter your company's size or industry. So when an incident comes up that includes sensitive information — Personal Identifiable Information (PII), financial data, accidental data breaches, or legal matters requiring privileged communication — your response process might need a higher level of security and discretion.

PagerDuty External Status Pages

External Status Pages offer public audiences a unified source of truth about your infrastructure’s health. This feature can be customized to fit your brand’s look and feel, and you can define different views and sets of Business Services to display. Product Manager Jacky Leybman joins the stream to show off how customers can stay informed about ongoing incidents and read status updates, or subscribe to your status page to receive notifications via email.

Addressing the dynamic incident communication challenges of the enterprise with CommsFlow

At enterprise scale, effective flow of incident awareness requires sharing many distinct pieces of information with many unique stakeholders serving different roles in the organization at precise moments in time. The creation of these dynamic communications and their delivery is constantly put to the test by the pressure of knowing that for every minute the incident is allowed to persist, potentially hundreds or thousands of customer businesses are being harmed.

PagerDuty Operations Cloud Product Demo

Check out the PagerDuty Operations Cloud in action. It detects and analyzes event data from across your digital operations, automates infrastructure and workflows, and mobilizes the right team members to minimize the impact of disruptive events on customers, employees, and brand reputation. It will help your teams free up time, reduce operations costs so you can deliver seamless experiences for your customers.

Ping Test for Network Connectivity: Simple How-To-Guide

Reliable network connectivity is paramount for uninterrupted communication and efficient data transmission. The ping test is a valuable tool to assess network connectivity, identify potential issues, and troubleshoot them effectively. If you're seeking to troubleshoot network issues or test connectivity between hosts, this comprehensive guide offers step-by-step instructions and valuable insights for performing an effective ping command test.

The "people problem" of incident management

Managing incidents is already tricky enough, and you want to get to mitigation as quickly as possible. But sometimes it feels like organizing everything surrounding an incident is more difficult than solving the actual technical problem and you end up getting delayed or sidetracked during mitigation efforts. We call that scenario the “people problem” of incident management.

Unlocking effective emergency response

The duty to care for employees and protect them from undue risk has never been more important. Each year, the U.S. experiences an estimated 240 million calls made to 9-1-1. Because of this, lawmakers have enacted federal regulations like Kari’s Law and the RAY BAUM’s Act. These legislations help ensure certain protections are provided during emergency situations.

SIGNL4 Onboarding: Routing Alerts to Teams using Distribution Rules

The SIGNL4 Onboarding series walks users through the process's of SIGNL4 from Signup to Alerts to Settings. Today's video focuses on sending alerts to the right users via distribution rules. Learn how to create a distribution rules and to route alerts to different teams using criteria included in the events. This video is packed with helpful tips to help you get the most out of your account.

Squadcast Named Category Leader in IT Alerting by G2 | Squadcast

🚀Squadcast has been recognized by G2 as a Category Leader in the IT Alerting category! Backed by immense customer love, advanced features, and the highest possible scores 💯— Squadcast has made it to the Leader Quadrant! This video offers all the related updates!

Our lessons from the latest AWS us-east-1 outage

In case you missed it, AWS experienced an outage or "elevated error rates" on their AWS Lambda APIs in the us-east-1 region between 18:52 UTC and 20:15 UTC on June 13, 2023. If this sounds familiar, it's because it's almost a replay of what happened on December 7, 2021, although that outage was significantly more severe and took longer to restore.

Synthetic monitoring as Code with Checkly and ilert

This post will introduce Checkly, the synthetic monitoring solution, and their monitoring as code approach. This guest post was written by Hannes Lenke, the CEO, and co-founder of Checkly. ‍ First, thanks to Birol and the ilert team for the opportunity to introduce Checkly. ilert recently announced discontinuing its uptime monitoring feature and worked with us on an integration to ensure that existing customers could migrate seamlessly. ‍ So, what is monitoring as code and Checkly?

Top 5 Use Cases for Custom Fields on Incidents

Chasing down critical information in disparate systems of record while trying to resolve an incident can make an already stressful situation even more taxing. Extra clicks, extra logins, copy/paste, socializing that information with other responders–it all wastes time and introduces more room for human error. Now PagerDuty customers can use Custom Fields on Incidents to enrich their incident data.

Featured Post

The Top 5 Trends on SRE Leaders' Minds in 2023: Insights from a Seasoned Executive

I've spent most of my career trying to solve big problems for people. In the early days at New Relic, we were trying to help people scale their systems based without compromising on performance, cost, or the customer experience. Not an easy feat but we gave them a solution that allowed them to accomplish their goals. The key was religiously listening to our customers talk about their wants, needs, hopes and fears. While I am rarely the smartest person in the room, which my partner rarely misses a chance to lovingly remind me, I always do my best to listen to what the brilliant folks in my sphere are talking about.

New related incidents functionality brings order to the chaos of highly complex incidents

We’ve all been there. You’re working through some rather frustrating blockers during an incident only to discover that you don’t own the dependency at fault. Or, you’ve been pounding away at an issue when a fellow engineer reaches out and asks if your service is affected by some particularly gnarly database failure they’re seeing. But then what? Do you merge efforts and work in parallel or head for a coffee break while the issue gets attacked upstream?

Understanding Major Incident Management: Beginners Guide

A major incident represents a critical event that poses a real or potential threat to an information system's confidentiality, integrity, or availability. Major incidents can disrupt normal operations, impact your customers, and may compromise the security of sensitive data.

Kubernetes Simplified: Understanding its Inner Workings

Kubernetes has revolutionized the world of container orchestration, providing organizations with a powerful solution for deploying, managing, and scaling applications. However, the complexity of Kubernetes can be daunting for newcomers. In this blog, we will demystify Kubernetes by breaking down its core components, revealing its operational principles, and guiding you through the process of running a pod.

What is Zero Trust Security and Why Should You Care?

Automation has become a game changer for businesses seeking efficiency and scalability in a rather unclear and volatile macroeconomic landscape. Streamlining processes, improving productivity, and reducing incidence for human error are just a few benefits that automation brings. However, as organizations embrace automation, it’s crucial to ensure modern security measures are in place to protect these new and evolving assets.

The Unplanned Show, Episode 2: Hadijah Creary Demystifies Customer Success vs Customer Service

In this episode, Hadijah Creary breaks down what Customer Service teams are versus Customer Success teams. What do they care about? How can they each get more proactive to improve the overall customer experience? And why is it PagerDuty Customer Service Operations and not Customer Success Operations?

We can now notify you through PagerDuty

When we detect a problem with your site, we can notify you via mail, a Slack message, a webhook, or any of our other notifications channels. This is enough for most of our users, but those who work in larger teams often need more flexibility. Today, we are launching our PagerDuty integration. PagerDuty is a cloud-based incident management platform that helps organizations improve operational reliability by providing real-time alerts, on-call scheduling, and incident tracking.

What is MTTR? Calculation and Reduction Strategies

In the fast-paced world of software development, every minute counts. When disruptions occur, whether there are minor or major system failures, organizations need to bounce back to maintain seamless operations. That's where MTTR (Mean Time to Repair) steps onto the stage as a game-changing metric. Are you ready to unlock the secrets behind reducing downtime, boosting performance, and ensuring software reliability?

IT Incident Management - What is it and how to do it?

Are you tired of dealing with IT incidents that seem to pop up at the worst possible times? Do you find yourself struggling to keep track of all the moving pieces involved in resolving incidents? If so, it’s time to revitalize your incident management strategy. In this article, we’ll explore the key pillars of incident process management, best practices, and how technology can help streamline your process.

Which Software Stack is best for IT service management?

IT-Incident Management - a hot topic and more important than ever in the digital age. Companies are increasingly relying on technology to maintain their operations, as any downtime can have catastrophic consequences. On average, one minute of downtime costs $9,000. ‍ Therefore, an efficient and especially organization-specific incident management system is essential. However, there are many components and options in incident management, so what software stack should you use? ‍

Common Causes of Outages and Tips to Prevent Them

Recently, Ron DeSantis used Twitter Spaces to launch his presidential campaign. At least, he tried to. As you may have heard, the event was marred with technical difficulties, resulting in false starts, confused hosts, glitches, echoes, and the “melting” of servers. Of the more than 600,000 Twitter users who initially tuned in, less than half remained by the time they relaunched the event using a different account.

On-call management on the go: Introducing the Grafana OnCall mobile app

We’ve all been there: Sleeping peacefully in bed over the weekend, finally getting rest after a long week at your computer making AI-generated memes writing code. Then at 3 a.m., your phone makes an ungodly sound, and you wake up startled, frazzled, and confused. When you finally type in your passcode to unlock your phone (because facial recognition doesn’t register your bleary-eyed, squinty face), you see an alert, and all dreams of sleep are over.

Maximizing Your Returns: The Proven ROI of Organizational Resilience

Recent years have been marked by a series of critical events that have challenged the resilience of organizations across the globe. From cyberattacks to natural disasters, these events have demonstrated the importance of strengthening organizational resilience. Companies that fail to prioritize resilience and prepare for the unexpected can face severe consequences, including lost revenue, damaged reputation, and even failure.

Streamline Incident Response with Komodor and Squadcast

With the growing popularity of Kubernetes as a container orchestration platform powering the microservices revolution, comes greater complexity with managing, monitoring, and responding to incidents at scale. Challenges with real production environments include full visibility into your clusters and environment’s health, alongside real-time incident management and response.

Using DORA metrics Mean Lead Time for Changes to deliver iterations faster

Here's what you can expect to learn from this article: Raise your hand if you like shipping changes quickly. (Yes, let's assume that everything you're shipping has value and isn't a vanity project). Chances are, you, the person reading this now, agreed with the above. When you start on a project, big or small, you want to keep any changes moving along and avoid getting stuck. The less time between the beginning and end of a project, the faster you can shift your focus to other things.

AWS CloudTrail vs CloudWatch: Features & Instructions

In today’s digital world, cloud computing is necessary for businesses of all types and sizes, and Amazon Web Services (AWS) is undoubtedly the most popular cloud computing service provider. AWS provides a vast array of services, including CloudWatch and CloudTrail, that can monitor and log events in AWS resources. This article will compare AWS CloudWatch and CloudTrail, looking at their features, use cases, and technical considerations.

AIOps and Automation: A Conversation Featuring Guest Speaker Carlos Casanova, Forrester Principal Analyst

At the beginning of 2023, I had a great conversation with Carlos Casanova, a Forrester Principal Analyst, in a recent webinar about how AIOps can help drive successful organizational change. According to our conversation, Carlos has divided the AIOps market into two camps: technology-centric (primarily APM/Observability players) and process-centric. PagerDuty is a process-centric solution leveraging multiple technologies.

Featured Post

After action reports: post-incident investigations

When something unexpected happens within the digital operations remit, software engineers put on their deerstalker hats and wax their fussy little moustaches-metaphorically. It's their time to play detective as they unravel the evidence and create the reports to explain the recent IT incident. But unlike with a hat-wearing Sherlock Holmes or a hirsute Hercule Poirot, cliff-hanger endings are not encouraged in software engineering.

Understanding Kubernetes Logs and Using Them to Improve Cluster Resilience

In the complex world of Kubernetes, logs serve as the backbone of effective monitoring, debugging, and issue diagnosis. They provide indispensable insights into the behavior and performance of individual components within a Kubernetes cluster, such as containers, nodes, and services.

What Is Root Cause Analysis?

Root Cause Analysis (RCA) is a systematic process designed to uncover the fundamental, underlying issues that lead to IT incidents. These 'root causes' are often masked by surface-level symptoms, making them challenging to identify without a systematic approach. Root Cause Analysis serves as a metaphorical excavation, drilling past the initial problems to discover deeper, hidden issues.

Incident Analysis: Understanding Importance and Benefits

Incidents and accidents can occur in various domains, from information technology and cybersecurity breaches to workplace accidents and transportation mishaps. When faced with such incidents, it becomes crucial to conduct a thorough analysis to understand the underlying causes and implications. Incident analysis goes beyond problem-solving; it offers valuable insights into preventing future occurrences and improving systems and processes.

Introducing powerful APIs and webhooks for Grafana Incident

Grafana Incident, Grafana’s powerful incident response tool, comes with a range of integrations out of the box, including Zoom and Google Meet spaces, GitHub and JIRA issues, and even a Google Doc template for post-incident review documents. However, every team has unique needs and workflows, and you may need to integrate with other systems not currently on our roadmap or even use your own in-house tools.

Unplanned, Episode 1: Damon Edwards Rages Against the Ticket Machine

In this, the inaugural episode of “Unplanned”, Dormain Drewitz talks to Damon Edwards about the “capacity conundrum” where everyone is working so hard, but everything takes too long and costs too much. We talk about the “coordination overhead” costs of getting unplanned work done, how generative AI is both adding complexity and offers to accelerate automating as much as you can, and four steps to creating capacity.

Proactive IT: Disaster Recovery Testing

In today's business environment, the continuity of IT systems is crucial to the success of an organization. Unforeseen disasters, such as infrastructure failures or cyber attacks, can severely impact the productivity of your organization. To mitigate these risks, IT departments must develop and implement robust disaster recovery (DR) plans. But, how can you ensure that these plans work effectively in times of crisis?

The PagerDuty Operations Cloud | Strategic Overview

In this two-minute video, learn more about the PagerDuty Operations Cloud - the platform used by modern digital enterprises to automate and accelerate mission-critical operations work. The PagerDuty Operations Cloud is essential infrastructure that detects and diagnoses disruptive events, mobilizes the right team members to respond, and automates workflows across your digital operations - so that your business moves forward, faster.

Callable Flows - xMatters Support

In xMatters Flow Designer, you can use callable flows to initiate a major incident process in any workflow. Instead of including the same sequence of steps in each workflow, such as posting to a status page or opening a help desk ticket, you can build the sequence once as a separate workflow and then include that as a step in any of your workflows.

Generative AI for the PagerDuty Operations Cloud

When it comes to keeping your business’s lights on, you need to manage and orchestrate your operational activities, prioritize high-impact and urgent work, and maintain day-to-day precision. Trust is paramount during mission-critical, time-sensitive crisis response and the narrow margin for error means there is little room and low acceptance for generative AI hallucinations or false positives.

Using PostgreSQL advisory locks to avoid race conditions

The first moments of incident response can be among the most crucial, which in turn can also make them among the most stressful. There are many ways to ensure incidents are kicked off smoothly, but a recent focus of ours was to ensure they could be kicked off quickly. After all, the faster you're able to start mitigating your incident, the more successful you'll be!

The 5 Incident Severity Levels - And a Free Matrix

Just as a red flag warns of imminent danger, incident severity levels in IT Service Management (ITSM) act as crucial indicators that alert organizations to potential problems. By understanding and leveraging them, businesses can swiftly and effectively respond to incidents, minimizing their impact on operations. In the dynamic business operations landscape, unexpected disruptions are an unavoidable reality.