Operations | Monitoring | ITSM | DevOps | Cloud

January 2022

No capes: the perils of being a hero-engineer

When I first started out as an engineer I really leant in to the idea of what’s often called “being a hero”; I would get to the office a bit early to make sure I could fix anything that had gone wrong overnight. I loved the camaraderie of someone outside engineering bringing their laptop over with a critical process broken for me to fix (even if I’d been the one to break it!). Being a hero feels really good for a while, but over time, it loses its shine.

Getting Started with Playbooks

It’s 2022: You’re good at your job, you’re maintaining modern systems, now you want to level up your team based on a solid foundation of their collective expertise. You want to standardize and centralize process documentation and make execution as easy and effective as possible so that everything runs smoothly, every time.

What's New: Updates to Event Intelligence, On-Call Management, Automation, Mobile, and More!

We’re excited to announce a new set of updates and enhancements to the PagerDuty platform. Recent updates from the product team include On-Call Management, Event Intelligence, and Mobile Products, to PagerDuty Community & Advocacy Events.

Reliability Through Automation for Your Infrastructure and Applications at Scale

As technology becomes more SaaS-based and organizations deploy applications in multiple clouds, there are requirements for more visibility into the cloud environment and better incident response and resolution automation capabilities. The two elements required to achieve this are integrations and workflows in an incident response software solution and effective experimentation, research, and testing in the cloud and on-premise.

Intelligent Service Design

Hello and welcome to the fourth post in our EI Architecture series focusing on Intelligent Alert Grouping. Previously we have talked about how to train Intelligent Alert Grouping using incident merges (here) and how to configure your alert titles to improve default matching. In this post, we’re going to cover how service design can also impact your experience with Intelligent Alert Grouping as well as the PagerDuty app in general.

DevOps Tools (All of the Tools Your Team Needs)

Wondering about DevOps Tools? We explain the best tools for every step of the DevOps development process. What are DevOps Tools used for? DevOps relies on effective tools to help teams manage the entire software development lifecycle. These tools can automate tasks, monitor applications, and facilitate sharing of information between teams.

Training Intelligent Alert Grouping

We’re continuing on with our third piece about how to utilize and improve your Intelligent Alert Grouping (IAG)! In case you missed it, the first two blog posts describe the feature (here) and explain how it uses merging to group alerts (here). We alluded to today’s post at the end of last: today we’ll be discussing how to use alert titles to improve IAG matches.

What Your System Outage Notifications Need To Say

System outages happen to the best of us. Communicating with your customers and other stakeholders effectively during downtimes is vital to maintaining a solid relationship with them. When a system outage occurs, technical teams are tasked with swiftly locating the cause and resolving the issue, while communications teams are tasked with notifying stakeholders and customers about the outage to maintain transparency.

Using Event Orchestration to reduce noise and trigger next best action

We often hear from customers that they’re dealing with unmanageable levels of noise and complexity, which makes it harder to pinpoint root cause and get to resolution quickly. All this effort spent on sifting through noise, processing events, and gathering context results in a lot of wasted time. That’s why we’ve launched Event Orchestration, which became generally available to our Event Intelligence and Digital Operations customers on Monday.

Announcing our newest integration: Confluence

Using FireHydrant’s Runbooks, incident and retro data can be automatically sent to Confluence at any point in the incident lifecycle. For example, the moment you’ve resolved an incident FireHydrant can create a fresh Confluence page with all of the critical incident information stored in FireHydrant. When utilizing Runbook conditions, you can choose the perfect moment to send your FireHydrant retro to a Confluence workspace.

Sponsored Post

Five Ways Developers Can Help SREs

Reliability is a team game. More the collaboration between Developers and SREs, greater will be the success of the product. In this blog, we have listed down the five best practices that developers can adopt, to make the SRE's life easier. It is not easy to be a site reliability engineer. Monitoring system infrastructure and aligning them with the key reliability metrics is quite a daunting task. Whereas, a software engineer's job is to deliver high-quality software.

Introducing CommsFlow for Context-Rich and Timely Updates to All Stakeholders

We’re so excited to announce our latest platform feature, CommsFlow™! This addition to the core Blameless product offering allows teams to keep stakeholders updated as the reliability of services and applications change. With our new automated and customizable communication flows, on-call, engineering, and business teams feel a sense of accomplishment and, of course, stay informed.

Get Paid to Write About Mattermost Playbooks

Mattermost Playbooks help software engineering teams orchestrate their work across all tools and teams to plan projects and hit milestones by uniting your tech stack through a single point of collaboration. We want to see how our community is leveraging Playbooks in their own tech stack and share your creations with everyone so the whole community benefits. We’re doing this by launching a new effort to commission original blog articles that show Playbooks in action.

Episode 2: Mooving to Remix: Code You Will be Happy With

Episode 2 of Mooving to… dives into a new tool called Remix, a framework to help create front-end code, you’ll love. This episode focuses on a new web framework that helps streamline your processes and eliminate downtime to the best of your ability. Thom Duran and Andrew Leonard of Moogsoft are joined by Kent C. Dodds, Director of Developer Experience at Remix.

Respond to incidents faster than ever with the New Mobile Incident Details Redesign

We’re working from anywhere, are you? With the PagerDuty mobile app, you’re always just a tap away from all the incident response tools you need. The new mobile Incident Details screen provides you with a more compelling visual experience and easier access to all your favorite features during incident response. Run a play, add a priority or note, post a status update, and more with the new carousel.

Event Orchestration Demo: Reduce Noise & Manage Event Routing with PagerDuty

Say hello to the next generation of event rules and cut down on manual event processing. With Event Orchestration, you can create custom logic with nested rules to enrich, modify, and control routing or trigger automation actions based on event conditions at scale. (This feature is only available to Event Intelligence and Digital Operations plans).

AWS Re:Invent 2021 - Accelerate Your Cloud Migration for Financial Services

Cloud migration and modernization projects for financial services are very complex initiatives with added challenges of visibility and incident response. He’s how we can help accelerate cloud adoption while reducing customer impact and streamlining and automating incident response.

Communicating to Users During Incidents

Imagine you're having a regular day at work, opening up your browser, double checking something for a client in that web app your team built for them, when suddenly, you see this screen: You hit refresh a few times, just to be sure. Nope. Still down. What happens next depends on how well your team has planned for incidents like this (some folks call it unplanned downtime).

Improving your team's on-call experience

Your engineers probably dislike going on-call for your services. Some might even dread it. It doesn't have to be this way. With a few changes to how your team runs on-call, and deals with recurring alerts, you might find your team starting to enjoy it (as unimaginable as that sounds). I wrote this article as a follow-up to Getting over on-call anxiety.

Getting over on-call anxiety

You've joined a company, or worked there a little while, and you've just now realised that you'll have to do on-call. You feel like you don't know much about how everything fits together, how are you supposed to fix it at 2am when you get paged? So you're a little nervous. Understandable. Here are a few tips to help you become less nervous.

Get Started with Playbooks Permissions

The goal of Mattermost Playbooks is to help teams consistently orchestrate any and all recurring workflows. A Playbook is a prescribed, repeatable process that a team has agreed on and formalized as a collaborative checklist saved on their Mattermost server. We at Mattermost use Playbooks for incident collaboration, customer onboarding, and product releases, along with many other complex processes.

A single pane of glass for automatic incident response for Bridgeport Public School District

“I have been doing this for 20+ years and have been using literally every product out there. Derdack is unique at how issues are addressed and communicated out because of the seamless integration, maturity and flexibility of the platform. Working with Derdack has been a game changer for us and helped us to do more with less.” Jeff Postolowski, Director Information Technology Services, Bridgeport Public School District

Sponsored Post

What is Incident Response?

When a service is down, a system is failing, or a security issue is in the midst of occurring, organizations need a solid incident response process to get up and running again. Incident response isn't just for high severity, lights out incidents either; if you've rebooted your computer to fix a problem, you've been an incident responder yourself! Incidents happen, and any successful organization knows that instead of pretending that one day nothing will ever go wrong, it's far more useful to develop a comprehensive operational response plan. And to do so, you need to know what incident response is! Let's get into it.

Improve Incident Response by Getting Control of Your (Unintelligent) Swarm

Incidents happen. Things go wrong. Systems fail. Sometimes they fail in unexpected and dramatic ways that create Major Incidents. PagerDuty makes a very specific distinction between an incident and an Incident. Your organization may also make such a distinction. Determining if an incident is major or not can come down to a number of factors, or a specific combination of factors, like the number of services affected, the customer impact, and the duration of the incident.

Achieving Maximum Patient Satisfaction Through Effective Clinical Communications

Judit Sharon, CEO and founder of OnPage Corporation, sits down with Healthcare Innovation to discuss how advanced, effective clinical communication systems help teams achieve ultimate patient satisfaction. How has the landscape around time-sensitive communications between and among clinicians and others in patient care delivery, evolved in the past few years?

Benefits of Enterprise Alert's Mobile App

Being in touch with your customers is key to any business. We at Derdack pride ourselves in being customer first when it comes to not only product enhancements and features but also support and building that customer/vendor relationship that lasts for years. We recently took a trip to Texas to visit several customers and the feedback was invaluable! We received a lot more information with a face-to-face meeting that just would not be the same if it were done virtually, like over Teams.

Communicating to Users During Incidents

Imagine you're having a regular day at work, opening up your browser, double checking something for a client in that web app your team built for them, when suddenly, you see this screen: You hit refresh a few times, just to be sure. Nope. Still down. What happens next depends on how well your team has planned for incidents like this (some folks call it unplanned downtime).

Presenting Role-Based Access Control for Squadcast users

Role-Based Access Control is an effective means to enable authentication and ensure only the authorized personnel have access to sensitive data within the platform. This blog explains how to implement RBAC in your organization's Squadcast account to achieve maximum security and confidentiality during Incident Management. We recently released this new functionality into Squadcast (called RBAC) that helps organizations fine-grain the access control provided to users within our platform.

Canary Deployments | The Benefits of an Iterative Approach

At Blameless, we want to embrace all the benefits of the SRE best practices we preach. We’re proud to announce that we’ve started using a new system of feature flagging with canaried and iterative rollouts. This is a system where new releases are broken down and flagged based on the features each part of the release implements. Then, an increasing subset of users are given access to an increasing number of features.

Want to accelerate your organization's digital innovation in 2022? Here's three ways to do it.

After two years of sky-high spending on cloud and related technologies, 2022 is the crunch point for corporate IT and digital leaders. Investments in technology helped facilitate the rapid shift to mass hybrid working and supported businesses to embrace the digital-first models of the new normal. But beyond merely investments to support new working styles, leaders also must ensure their organization continues to innovate.

What is a Workflow?

Workflows are no stranger in the DevOps world. But where did this term come from, and what does it really mean? Perhaps it’s no surprise that workflows originated from the industrial revolution, which brought powerful machinery for mobilizing huge workforces unlike ever before. To maximize the potential of these new industrial tools, people had to first figure out the best way to use them to get work done as efficiently as possible.

Effective Incident Management: How to Improve Collaborative Software Development

* Are you using Azure DevOps as the starting point of your delivery process on the Azure cloud? Join this webinar to learn advanced tips and tricks for simplifying and accelerating your CI/CD pipelines with Azure DevOps and the JFrog Platform. Sharing a detailed demo of a real-world release pipeline triggered from Azure DevOps, we’ll review best practices and hard-won lessons for how you can streamline your end-to-end process and ensure it meets the security and quality requirements of large-scale enterprise delivery.

PagerDuty Named a Leader in the Latest G2 Grid for AIOps Platforms

At PagerDuty, we are committed to championing the customer — it’s a core company value. Our product has to provide great value, we have to provide excellent service, and we need to make it simple to do business with us. The Winter 2022 G2 Grid for AIOps Platforms Relationship Index showcases these values and highlights PagerDuty as a leading player in the AIOps space.

xMatters Ninja Release Updates - xMatters Demo

Join Belinda Joseph, Sr. Director of Marketing Events, and Corey Blakeborough, Solutions Architect, as they highlight and walkthrough some of the fantastic new features that rolled out during the xMatters Ninja release. Some of these great new features include the service dependencies map, the automation of digital and business response, and brand new unified alert reports!

Equitably distribute on-call responsibility and streamline incident response with Round Robin Scheduling

PagerDuty is excited to introduce Round Robin Scheduling. Round Robin Scheduling allows teams to equitably distribute on-call shift responsibilities amongst team members. Automatically assigning new incidents across different users or on-call schedules on an escalation level ensures that teams are resolving incidents as efficiently as possible. And, by balancing the workload across multiple users, there’s less risk of burnout.

What exactly is Digital Operations?

IT modernization (for example, cloud computing), digital optimization, and the creation of new digital business models are all examples of digital transformation. The concept of combining company processes with agility, intelligence, and automation to build operational models that delight consumers while also improving performance is known as digital operations.

Intelligent Swarming vs. Tiered Support: How Customer Service Teams can use PagerDuty to Swarm Critical Issues

Most support organizations today adopt some form of the traditional tiered support model. It is one that is based on a process of escalations and customer handoffs. Under this model, customer issues get escalated through multiple levels of a support hierarchy, with three tiers being a common workflow.

Learn how PagerDuty can help address critical work across all departments

PagerDuty’s Operations Cloud helps organizations with critical work across the entire business, from IT teams to customer service to human resources, marketing, sales, and more. With PagerDuty, organizations can prioritize accurately, respond efficiently, and reduce operational overhead. In this blog post, we’ll share examples of how PagerDuty can be used for critical work in all departments, not just IT, using our new Solution Guides for Business.

SRE and the Practice of Practice

Part of the trepidation of being on-call is encountering unfamiliar emergency scenarios where we are surprised by suddenly not knowing how to do our jobs. We feel lost and alone, complicated by the world around us, powerless to resolve or even mitigate the problem. On-call need not be a solo affair full of fear and anxiety. There are ways we can employ practice and open collaboration outside of incidents to prepare us better.

What the Ideal Incident Lifecycle Should Be

Today’s organizations are managing increasingly complex IT ecosystems and pressured to deliver on innovation—all while trying to maintain service performance and reliability to keep up with the always-on digital economy. With IT complexity growing exponentially, incidents have become a common, if not day-to-day struggle for many businesses. Incident management is the process or method that modern organizations use to prepare for and respond to service disruptions.

The Universal Language: Reliability for Non-Engineering Teams

We talk about reliability a lot from the context of software engineering. We ask questions about service availability, or how important it is for specific users. But when organizations face outages, it becomes immediately obvious that the reliability of an online service or application is something that impacts the entire business with significant costs. A mindset of putting reliability first is a business imperative that all teams should share.

Building an SRE Team with Specialization

As organizations progress in their reliability journey, they may build a dedicated team of site reliability engineers. This team can be structured in two major ways: a distributed model, where SREs are embedded in each project team, providing guidance and support for that team; and a centralized model, where one team provides infrastructure and processes for the entire organization.

The Human Side of Being On-call: 5 Lessons for Managing Stress, Anxiety, and Life While Being On-call

Within DevOps, we talk a lot about the on-call process—but what about the human side of being on-call? For example, what are effective ways of managing stress and anxiety during a shift? How can one manage life situations that make being on-call difficult—such as being responsible for watching the kids during an on-call rotation? And how can an empathic team culture help prevent burnout and turnover?

Stakeholder Notifications

With the AlertOps ServiceNow integration, you can automatically send updates to stakeholders. Set each update to use the notification channel you choose (email, voice, SMS, mobile app, and chat). Set triggers to send alerts on any condition, such as SLA breaches, status changes or any custom field change. Automatically updates at time points that you set. AlertOps also logs all activities in ServiceNow so you can track everything in one place.

Major Incident Notifications

With the AlertOps ServiceNow integration, during a major incident, you can automatically send notifications to targeted groups of users (managers, stakeholders, customer service). Each group can have its own unique status update fields, so you can send contextual information with dynamic updates to each group at regular intervals, and a final message when the incident is resolved. Set each notification to use the notification channel you choose (email, voice, SMS, mobile app, and chat).

Squadcast + Amazon EventBridge: Routing Alerts Made Easy

Amazon EventBridge is an AWS serverless event bus service making it easier to build event-driven applications. It uses events generated from your applications, integrated Software-as-a-Service (SaaS) applications, and other AWS services. It delivers a stream of real-time data from event sources to target services like AWS Lambda. You can also set up routing rules to determine the destination where you wish to send the data and build decoupled application architectures.

Fairwinds: Kubernetes Guardrails and Governance to Enable Developers and Reduce Risk

Customers of both PagerDuty and Fairwinds Insights can generate and customize PagerDuty incidents for critical issues in their Kubernetes clusters. This capability includes over 100 checks that have been built-in to Fairwinds Insights for things like container vulnerabilities, insecure workload configurations, runtime security events, and resource usage—as well as custom user-defined policies for compliance and internal requirements.

Enterprise Alert 9.2 Update Brings Great Flood Protection Enhancements

We have released another update for Enterprise Alert 9 (version 9.2) which enhances the flood protection mechanism. This will help you to setup scenarios where you do not want the flood protection to be active for every notification channel. Read all details in this article.