Operations | Monitoring | ITSM | DevOps | Cloud

October 2021

Panel: Handling Incident Response - Dash 2021 (Datadog, PagerDuty)

When customer-impacting downtime happens, it’s crucial that responders are prepared and can resolve these issues as quickly as possible. Knowing the right tools to use, from wherever you are working from, will help to have a well-defined strategy in place to come together as a team, work the problem, and get to a solution quickly. In this roundtable discussion, PagerDuty and Datadog engineers chat about incident responses and how we use all the tools at our disposal to respond quickly and effectively.

Is it a ghost or is it Flow Designer?

Maybe it’s the time of year or the change in temperature, but sometimes using xMatters Flow Designer can seem a little… spooky? Maybe it’s the unlimited capability it offers, or maybe it’s that it can make changes for you without you being aware they’re taking place. But every once in a while, we’re not sure if we’ve just set up workflows too effectively, or that something a touch paranormal is happening with xMatters.

Improve your on-call experience with Datadog mobile dashboard widgets

Life happens—even when you’re on-call. You can’t take your laptop everywhere, but whether you’re on the train, at dinner, or at the gym, you can count on the Datadog mobile app for access to key data about the status and performance of your applications. Now, you can use Datadog mobile widgets to build an on-call mobile dashboard directly on your phone’s home screen, so it’s even easier to track the data you care about from anywhere.

Differences between Site Reliability Engineer Vs. Software Engineer Vs. Cloud Engineer Vs. DevOps Engineer

The evolution of Software Engineering over the last decade has lead to the emergence of numerous job roles. So how different is a Software Engineer, DevOps Engineer, Site Reliability Engineer and a Cloud Engineer from each other? In this blog, we drill down and compare the differences between these roles and their functions.

SRE and Fighting Games

When learning SRE, you might find its principles a bit unintuitive. For example, it might be difficult to learn why aiming for 100% reliability is wasteful, or how reliability isn’t the same as availability, or why failure ought to be celebrated. Believe it or not, there is a method to these ideas. My goal in this article is to shed light on the principles and to leave you a believer, such that you’ll take steps towards starting SRE practices.

Customer Service Ops & PagerDuty Zendesk Integration v3 Full Case Ownership Use Case

PagerDuty's Zendesk Integration enhances communication between engineering and support teams by providing visibility to high-impact incidents via the PagerDuty Status Dashboard that is integrated into the Zendesk interface. Automate workflows for a fast-paced support team and provide the right level of information so they can interact knowledgeably with their customers while also reducing time and effort.

PD, Salesforce Service Cloud, Slack: Proactive Case Escalation & Slack-First Intelligent Swarming

Learn about and see how PagerDuty, Salesforce Service Cloud, and Slack empower collaboration across your organization to accelerate time to resolution. Proactively improve customer satisfaction in real time and break down silos to connect customer service teams with engineering teams to address incidents quickly when seconds matter. Enjoy greater control when resolving issues and anticipating customers' needs through an incident command console that gives customer service agents and stakeholders instant updates on critical, customer-impacting issues.

Five steps to better customer communication

When you’re deep into an incident and there’s alerts firing, decisions to be made, and people to escalate to, it’s easy for outward communication with your customers to fall off the priority list. In many regards this makes sense; it seems natural to put all of your focus and energy into minimising the impact and getting things back on track as soon as possible.

Announcing our $1.9M round of funding

It is with a great deal of anticipation and excitement that I’m announcing our $1.9M round of funding, led by StartupXSeed Ventures along with participation from marquee enterprise SaaS investors Powerhouse Ventures, Secure Octane fund, Kwaish Ventures, Supermorpheus, Titan Capital, 100X Entrepreneurs, Viral Bajaria(CTO, 6Sense), Premal Shah(SVP, 6Sense), Hitesh Chawla(CEO SilverPush), Sumit Jain(CTO, BirdEye) and existing investors Anand Chandrasekaran(EVP, Five9), Rajesh Sawhney(GSF), Ashish To

What's New: Extending our Datadog Capabilities With New PagerDuty Widgets

In the last two years, we have seen the rise of remote and hybrid work, and with that, a proliferation of tools and apps needed to support critical communication and collaboration. Finding that app-life balance has become increasingly complex, so simplifying “how” we work is key for every organization.

Strategies to Reduce Hospital Readmission Rates

The Centers for Medicare & Medicaid Services (CMS) scrutinizes hospital readmission rates across the U.S. each year, and it levies financial penalties on organizations that overshoot acceptable hospital readmission rates. As healthcare systems across the country embark on a journey to introduce patient-centric models to their organizations, they must align their resources with ever-changing regulations for them to thrive.

Now Available: Private Slack Channels

Ever heard the saying “Too many cooks”? If you’ve responded to incidents, you’ll likely understand the parallels. There are cases when incident command on a public channel isn’t the best option: Whatever your reason, we’ve got you covered. Now available, users can spin up a private slack channel for an incident. Read more how to do this here.

Service Profile: Activity Tab Updates

PagerDuty's new service profile enhancements allow you to better command and control incidents directly from the Service Profile. Now you can perform bulk actions on incidents like acknowledge or resolve, search by incident ID, add and view change integrations, browse resolved incidents, view related escalation policies from the service profile header, and more.

Why ChatOps & Incident Management are the Perfect Pair

ChatOps has become an integral part of software development and IT operations, as teams rely on automated notifications to take the place of manual alerts. In the past, if there was an alert, someone would need to manually find that notification. Then, they would have contact team members to notify them one by one so they could start working on a resolution. In this complex network of communications, it was easy to lose information, duplicate work, and simply waste time coordinating the team.

Next Generation Slack Migration Tool and Stakeholder Updates Demo

Learn more about PagerDuty's Collaboration Applications that help you streamline incident remediation. Enjoy these demos of our latest updates to our PagerDuty Slack and Microsoft Teams Applications including the Webhook Migration Tool, Stakeholder Updates, and Resolution Notes.

Automated Diagnostics for Incident Response Demo

Learn about how you can speed up resolution times with Automated Diagnostics. Automate away as much manual toil as possible to increase team productivity so teams can work more productively. Learn about how teams across the organization can embrace workflows that help to diagnose and remediate incidents.

Runbook Automation: Rundeck Service Ownership Demo

Learn how PagerDuty Runbook Automation enables developers and service owners to equip other engineers, such as operations engineers or other developers with mechanisms to help them support their services. Service owners can allow other team members to help them in supporting their services via automated runbooks that enable others to apply short term fixes–reducing escalation to service owners.

When built-in alerting is not enough

Many ITOM or ITSM tools come with built-in features for alerting and notifications and are able to send at least an email or text notification upon incidents to operations teams. But is this enough reliability to respond to and handle major and critical incidents? Recently, we have been surprised to see more and more monitoring tools listed as alerting tools on review platforms like G2.

Postmortem Pitfalls

Last week, we spent some time talking to Gergely Orosz about our thoughts on what happens when an incident is over, and you're looking back on how things went. If you haven't read it already, grab a coffee, get comfortable, and read Gergely's full post Postmortem Best Practices here. But before you do that, here's some bonus material on some of our points.

A developer's guide to programatically overcome fear of failure

People are more than happy to talk about their successes, but if you ask them about their failures, they can be much more hesitant to share. Failure is a subject that, interestingly enough, is entangled with the emotion of shame. Yet it’s integral to achieving anything novel, and the learnings that come from failure are unparalleled. So, let’s find ways to get more comfortable with failing, and figure out why people fear it.

Incident Management Metrics That Matter - 2021

What are the Key Incident Management metrics/ KPI ‘s? How important is it to track Your Team’s Performance? If you are not doing so already the time is right to get your finger on the pulse by better understanding and managing your organizations incident management key metrics. How a company manages IT Incidents matters and most importantly the process has the power to impact sales – recent studies indicate 52% of U.S.

OnPage Clinical Communication and Collaboration Platform

Modern healthcare teams require a modern solution to streamline clinical communications and medical workflows. In life and death situations, it’s critical that physicians receive immediate alerts and messages to provide patient care promptly. OnPage is the industry’s most trusted clinical communications platform. OnPage is more reliable and secure than traditional pagers. The system enables care teams to easily communicate and achieve maximum patient satisfaction.

How Important is SaaS Reliability? 90% of Business Leaders Say "Very Important"

A couple of weeks back, Blameless attended SaaStr 2021, the go-to event for any business Go-to-Market (GTM) team which has been running since 2012. Our decision to sponsor was made in early 2020. Back then, we had no idea how long the pandemic would last or that it would be a full 18 months before we’d be able to do a physical event.

Uptime/SLA calculator: what is an SLA and how to calculate it?

A Service Level Agreement (SLA) is a document that details the expected level of service guaranteed by a vendor or product. This document generally sets out metrics such as uptime expectations and any payoffs if these levels are not met. For example, if a provider advertises an uptime of 99.9% and exceeds 43 minutes and 50 seconds of service downtime, technically the SLA has been breached and the customer may be entitled to some type of remuneration depending on the agreement.

Unlocking Resilience - Episode 1 - William Shatner - Resilience Makes the Leader

Everbridge CEO, David Meredith spends some time talking with award-winning entertainer William Shatner to discuss the role of resilience in leadership. Shatner reveals insights into his experience with numerous leaders and CEOs. From discussing his own experience as a leader in the films, television and beyond, to developing his personal brand of questing for knowledge, Shatner and Meredith touch on leaders of popular companies such as Priceline and Amazon, just as he heads into space on Blue Origin with Jeff Bezos. Listen for more about Shark Week, Rescue 911, and Shatner’s upcoming album.

Intelligent Alert Grouping: What It Is and How To Use It

It’s 2 AM and you’re paged when you’re still awake – how well can you find what you need to fix the latest mistake? When the incident begins it might only be impacting a single service, but as time progresses, your brain boots, the coffee is poured, the docs are read, and all the while as the incident is escalating to other services and teams that you might not see the alerts for if they’re not in your scope of ownership.

People Risk Management and Agile Organizational Resilience

As COVID-19 changed the landscape of global business travel, organizations must respond with agile, comprehensive plans that can account for continually evolving risk environments and regulatory requirements. It has become necessary for many organizations to revise old outlines and plans to match the realities.

What Operational Maturity Looks Like Today With PagerDuty's Kyle Duffy

Companies that underwent accelerated digital transformations during the past 18 months are looking to understand how they can improve their operational maturity to handle the increase in complexity. This is paramount to an organizations’ future success.

Ray Baum's Act

RAY BAUM’s Act requires that first responders have the necessary information needed to pinpoint the “dispatchable location,” and quickly reach a 9-1-1 caller regardless of the device they dial from, or their exact location inside a large building. Whether the calling device is wired, wireless, on-premise, or remote, if it connects to an MLTS it will fall under the FCC’s enforcement.

What is "Dispatchable Location"?

All businesses in the United States must now comply with Section 506 of the RAY BAUM Act. This requires organizations to automatically provide emergency call centers, or Public Safety Answering Points (PSAPs), with “dispatchable location” information alongside every emergency call placed from their network. However, defining exactly what that information must include can be a challenge.

4 Pressures at Tech Companies xMatters Can Help Relieve

Technology companies are at the forefront of innovation, changing the way consumers and the general public interact with their everyday lives. As the late Stan Lee so wisely stated, “with great power comes great responsibility,” and this heightened pressure often leaves little room for error when an issue arises—which happens more often than you’d think.

OnPage for Clinical Communication and Collaboration

Modern healthcare teams require a modern solution to streamline clinical communications and medical workflows. In life and death situations, it’s critical that physicians receive immediate alerts and messages to provide patient care promptly. OnPage is the industry’s most trusted clinical communications platform. OnPage is more reliable and secure than traditional pagers. The system enables care teams to easily communicate and achieve maximum patient satisfaction.

Process binds technology and people in cloud maturity success

This is the final blog in our series focusing on CloudOps maturity, where we’ve been looking at the key findings from a recent IDC study, commissioned by PagerDuty. In our previous blogs, we discussed the people-based transformations and the technological changes that organizations must undergo to mature their CloudOps practices.

Kari's Law

Under Kari’s Law, any calling device within your U.S. organization must be able to directly dial 9-1-1, without a prefix. All devices that can dial a phone number must have these capabilities. Failure to comply with this 2020 law could result in penalties from the United States Federal Communications Commission (FCC). Luckily, it’s not difficult to reach 100% compliance with the right guidance and technology. This article will show you how to get up to speed quickly.

Sponsored Post

AIOps - What It Is, Why It Matters, and Advice for Adopting It

The link between DevOps and artificial intelligence for operations (AIOps) has only started to become clear within the last few years. Monitoring and alerting has evolved from a "black box approach," where you don't actually know what's happening, into observability, where you have access to data that provides everything you possibly need to know about your IT systems. How does AIOps come into play? AIOps is the practice of applying artificial intelligence, machine learning, and advanced analytics to automate and improve IT operations. Since it entered as a formal discipline with Gartner in 2016, IT teams have been trying to figure out how to employ it to make their lives easier.

Should you care about AIOps? Obviously.

There's a lot of hype in the marketplace about AIOps right now, and there's a lot of people who've got some interesting ideas about what it should be. The most common idea that I hear is that it's essentially a layer of AI magic that sits across everything that you've got in your IT tooling today and then make sense of all of that for you and then we'll decrease the number of incidents you have and reduce your MTTR...

Developing A Disaster Recovery Plan

While it may seem like a disaster recovery plan and a business continuity plan are the same, businesses must consider their differences. A Business Continuity Plan (BCP) is an umbrella program comprised of various smaller parts that aim to keep operations running smoothly before, during, and after a disaster occurs. A Disaster Recovery Plan (DRP), on the other hand, zeros in on how to remediate the disaster as it transpires.

Incident Management Process- 6 Tips to Better Prepare Your IM Process for The Holiday Season.

Holiday retail sales are likely to increase between 7% and 9% in 2021, according to Deloitte’s annual holiday retail forecast with holiday sales totaling $1.28 to $1.3 trillion during the November to January timeframe. Deloitte also forecasts that e-commerce sales will grow by 11-15%, year-over-year, during the 2021-2022 holiday season.

How Patient-Centered Care Improves Patient Outcomes

The patient-centered care (PCC) model enhances the way providers interact with patients during the care delivery process. Clinicians that show compassion and empathy toward patients are more likely to achieve meaningful, positive doctor-patient relationships. Indeed, care teams that prioritize PCC have a proven approach to improving patient satisfaction and increasing patient retention.

Evaluating Opsgenie Alternatives

Atlassian’s Opsgenie is a leading incident alerting and on-call management tool, helping business manage their incident response and resolution needs. As part of the Atlassian product suite, Opsgenie has become one of the most popular solutions in the industry. But it’s not the only incident management tool on the market, and it’s vital when looking at Opsgenie and its alternatives, you do a deep dive into its features and abilities.

How Your ITSM Tool & PagerDuty Make a Dynamic Duo for Real-Time Work

There’s an incident. Your teams need to communicate with the development team that owns the service, but that team is too busy to stop and chat. Meanwhile, you in central IT have business leaders asking for updates, angry internal users calling the help desk, and customer service representatives asking for information. You have hundreds of tickets all pertaining to the incident in your ticketing system.

What SREs Can Learn from Facebook's Largest Outage

Facebook’s October 2021 outage was the type of event that gives SREs nightmares: A series of critical business apps crashed in minutes and remained unavailable for hours, disrupting more than 3.5 billion users around the world and costing about 60 million dollars. As incidents go, this was a pretty big one.

Enterprise Resilience During a Severe Weather Crisis

Since 2019, there has been a 40% increase in weather-related events causing a staggering $80B in insured losses according to Allianz Risk Barometer. As the world emerges from a global pandemic, more and more business leaders are no longer just preparing for 'one big disaster,' but rather preparing their companies to be agile against several severe weather threats.

Evolution of Mass Notification into Critical Event Management

According to the 2020-21 MIM Annual report, 73% of respondents felt their companies did not invest enough into Major Incident management. While your organization likely has a stand-alone mass notification tool, it is often no longer enough to handle a critical event in the most effective way. This is where a Critical Event Management platform comes in.

PagerDuty Integration Spotlight: Honeycomb

Honeycomb delivers observability for modern engineering and DevOps teams to observe, debug, and improve production systems efficiently. The PagerDuty + Honeycomb integration uses Honeycomb Triggers to notify on-call responders based on alerts sent from Honeycomb. This integration is maintained and supported by Honeycomb. Liz Fong-Jones from Honeycomb joined us live on Twitch to share more about how Honeycomb and PagerDuty can be used together to help your teams and to do some live investigation into Honeycomb’s own performance data.

Everbridge Public Warning Leader Wins 2021 Stevie® Award Honoring Female Executive of the Year

Following recent wins in the United Kingdom, Estonia and with one of Europe's most populous countries, as well as the launch of Australia's next-generation national alerting system, Everbridge leads with more countrywide Public Warning deployments than any other provider across the Americas, EMEA, and APAC regions.

Monthly Moo Update | October 2021

There’s a number of monitoring and observability solutions on the market today. It almost reminds me of the automobile market and the endless number of automobiles available. Sure, they all get you from point A to point B, in some way. But some automobiles do it faster, smoother, more efficiently, with guidance, more comfort, storage space, perhaps towing capability, and even autonomously. Moogsoft is the automobile you’ve been dreaming about in the monitoring and observability market.

FireHydrant expands Reliability Platform with Service Catalog

Today, we are happy to announce the launch of Service Catalog to help you better manage, query, and learn about the services that exist in your infrastructure. At FireHydrant, we envision a world where all software is reliable, and we’re on a mission to help every company that builds or operates software get closer to 100% reliability. Service Catalog helps you get closer to 100% reliability.

4 xMatters Use Cases That May Surprise You

xMatters is part technology, part service reliability, and a little bit of magic. If you’ve spent time on the xMatters website, you’ll likely have seen a number of valuable use cases for the platform—it can alert SREs when there’s a website outage, it can accelerate product development for DevOps teams, it can manage on-call schedules and alerts for support teams.

Incident Response: A Step-by-Step Guide to Managing Incidents

Looking into Incident Response? We explain incident response, the end-to-end process, the teams involved, and steps to take to avoid friction and slow-down. The goal is to manage the incident as efficiently as possible in order to restore or resume the service to its expected operational state.

The Cost of Increasing Incidents: How COVID-19 Affected MTTR, MTTA, and More

Digital transformation accelerated for many companies during the last 18 months. While it may have been on the agenda prior to COVID-19, teams were pushed to extreme speeds to digitize and meet the rising online demand. During this time, organizations learned important lessons that they’ll carry on with them into this new future. Leaders can take these learnings and use them to build better products, healthier and more efficient teams, and a happier customer base.

PagerDuty Integration Spotlight: InfluxData

InfluxData is an Open Source Platform built for metrics and events — a platform that is purpose-built for time series data. The essential time series toolkit — dashboards, queries, tasks and agents all in one place. InfluxDB is even more programmable and performant with a common API across OSS, cloud and enterprise editions. Send events to PagerDuty to keep your teams informed. Check out InfluxData’s integration.

Facebook, Instagram, and Whatsapp's Outage - Understanding MTTR

Yesterday the most used social media platforms in the world were inaccessible for 6 hours straight. Later, in a press release, Facebook revealed that the outage was due to configuration changes in their routers. There is no doubt that Facebook has an intense incident response plan, yet a small blind spot resulted in a significant business interruption. So how do we avoid this? The truth is, outages and performance issues are bound to happen in any network.

PagerDuty Integration Spotlight: HashiCorp Terraform

Manage your PagerDuty account objects with Terraform! Reap all the benefits of infrastructure as code and give your teams the flexibility they need to manage their services in real time. As infrastructure stacks grow increasingly more complex and involve an ever-growing number of services and systems, teams have looked to abstract configuration to its own layer of code. This concept of configuring infrastructure as code is gaining traction throughout the industry for a variety of reasons.

The Aftermath of the Facebook 6-Hour Outage

Less than 24 hours ago, the world came to a “social standstill” as Facebook, and its sister companies, WhatsApp and Instagram, became unavailable, leaving its 3.5 billion users in a flap. The outage, which lasted almost 6 hours, shut off access for users and businesses all over the world and caused ripple effects that we will likely continue to see in the immediate (and perhaps not-so-immediate) future.

Australia Successfully Goes Live With Everbridge Public Warning Platform Countrywide, Representing Official Launch of the Australian Government's Next-Generation National Population Alerting System

Everbridge powers Australia's Emergency Alert system, providing population-wide alerting to inform and protect the continent's 34 million residents and annual visitors (once borders are re-opened). The live deployment of Australia's national alerting system reinforces Everbridge's leadership in Public Warning solutions, with wide-spread countrywide contracts across the Americas, EMEA, and APAC regions, capable of reaching over 2 billion citizens and travelers globally

Evaluating Splunk On-Call Alternatives

Splunk On-Call (Formerly VictorOps) is a popular incident response and on-call management platform that allows engineering and operations teams to collaborate with ease and resolve issues faster. As part of the Splunk Observability Suite, Splunk On-Call is combined with related products to achieve the goal of bringing monitoring, troubleshooting, and investigation, into a single, comprehensive view — simplifying the process from incident detection to resolution.

PagerDuty Integration Spotlight: LogDNA

LogDNA’s Cloud logging platform helps your DevOps teams find and fix production issues faster so your teams can get back to doing what they do best, building amazing products. Send incident alerts from LogDNA directly to PagerDuty. Check out the LogDNA integration with PagerDuty to get started.

How Service Catalog Increases Productivity

Productivity is defined by measuring the amount of output over a given time frame. However, this discounts the quality of output, which is crucial in moving toward a more complete definition of productivity. Relating to services, increases in productivity generally highlight the amount of feature releases over time. This leaves out the critical measurement of quality compared to quantity. This is where a Service Catalog can greatly enhance true productivity within an engineering organization.

The Value of Digital Transformation in Financial Institutions

According to a 2020 survey done by Boston Consulting Group, 75% of executives regard digital transformation as an urgent priority in light of the recent COVID-19 crisis and 65% said that they were considering increasing their investment to that end. Standardized and automated threat detection and IT incident response across siloed operational risk groups is required to have the agility, reliability, and efficiency to establish organizational resiliency in financial institutions.

Time For Change: Managing A Successful Future Of Work

A successful return-to-work strategy involves transparency, flexibility, responsiveness, and support of employee wellness. Leaders at every level of an organization must recognize and communicate with distressed employees and respond to signs of trouble. This not only fosters a positive workplace for the employee to return to but has impacting effects on business itself. Higher retention rates and lower levels of stress often equate to highly productive business operations.

Learn where you rank and how it affects digital service resilience

We evaluated where enterprises are positioned in the Incident Management Spectrum and in their journey to digital service resilience and found that incident management needs its own transformation. In the report, you'll learn which approach to incident management is the best for meeting today's business imperatives.

Digital Transformation Secrets: Balancing Innovation and Uptime

Providing a superior digital customer experience is a critical component of business success for technology and digital service providers. But an enjoyable, effective, and reliable customer experience demands new IT architectures and places new expectations on the way SREs, development teams, ITOps, executives, and other previously siloed groups work together. And at what costs? To understand, we asked over 300 DevOps, ITOps and business leaders for perspectives.