Operations | Monitoring | ITSM | DevOps | Cloud

August 2021

Situation Room: On-Call Team Faces Worst Case of Sunday Scaries

Picture this: it’s Sunday night. You’re relaxing in bed, in that sweet spot where you’re geared up for Monday, but the fun of the weekend hasn’t yet faded. As you idly scroll through content on your phone, you see a message preview pop up. It’s to your work email. That’s bad. It’s from the hosting company you contract. That’s really bad. They’re saying they accidentally deleted the production database. That’s “jump out of bed” bad.

What Does Everbridge Crisis Management Do for Your Organization?

Everbridge Crisis Management provides organizations a single solution for business continuity, disaster recovery and emergency communication. In one application, crisis teams can coordinate all response activities, teams and resources to accelerate recovery times and maintain command and control when crises evolve into unanticipated scenarios.

How to Structure an IT Help Desk

Managed service providers (MSPs) need an IT help desk to address and answer the technical questions of clients. In the modern MSP environment, the IT help desk is the primary source of contact between customers and knowledgeable, responsive support personnel. Successful help desks are customer oriented and encourage clients to report IT incidents when they occur.

Monthly Moo Update | September 2021

This has been quite the summer to remember as we continue to witness our customers achieve remarkable efficiencies through automation such as deep integrations with change pipelines to suppress alerts during maintenance windows and correlating alerts to create incidents with dynamic and evolving descriptions that dramatically improve Incident management processes.

Has the firefighting stopped? The effect of COVID-19 on on-call engineers

With digital becoming the primary channel for work, education, shopping, and entertainment in the last 18 months, it’s no surprise that workloads for technical teams and on-call engineers have increased. Data from PagerDuty’s inaugural platform insights report, The State of Digital Operations, highlights this reality. As of July 2021, the average number of events managed daily by PagerDuty is 37 million, with 61,000 of those being critical incidents.

The Value of Hyper-local Risk Intelligence

Every enterprise has a unique risk profile. This is based on a wide range of factors including geographic disposition, sector, the scope of security and resiliency plans, organizational size and structure, supply chain, and much more. Without the right customized tools tailored for your organization in place, it’s challenging to stay ahead of threats and disruptions to your people, places, operations, and digital systems.

New feature: Templates for Incident Management

At Spike.sh , we are obsessed with making incident management more accessible to dev teams everywhere. With this goal in mind, we are always looking for ways to reduce the friction while setting up the Spike.sh platform. When we saw customers asking our advice for creating effective on-call schedules and escalations, we knew we had to do more than just good documentation - we needed a way to share best practices with our customers in the product itself.

3 Key Insights to Help You Build the Workplace for Today & Tomorrow

Everbridge sat down with two leading experts to discuss how innovative technologies are improving worker safety and operational functionality, and how firms can keep up. With such demanding times for the business world, it’s easy for companies to become fixated on survival, rather than thriving. But businesses that use unprecedented circumstances as a time to innovate and invest in new technology as well as rescoping the use case of their existing technologies, will emerge stronger than ever.

Balancing Healthcare Resilience with the Patient Experience

For healthcare systems, building resilience for the future is learned from adapting and responding to critical events and factoring in circumstances that are often unique to the communities they serve such as the patient population, size of the hospital and/ or community, and scope of services.

Safety Experts Plan for Fall

Everbridge recently hosted a Safety Experts Plan for Fall webinar, with an expert panel comprised of Dr. Rashid Chotani (Chief Medical Director/Senior Scientist, IEM), Steven J. Healy (President and CEO, Margolis Healy), Marisa R. Randazzo, Ph.D. (CEO and Founder, SIGMA Threat Management Associates) and James Podlucky (Industry Solutions Manager, Everbridge). The panel was moderated by Dan Pascale, Executive VP, Margolis Healy.

How to Mitigate the Effects of Floods on Your Supply Chain

Floods may now be an unfortunate counterweight to the wildfires that have come to characterize summers worldwide. In 2021 alone, floods wreaked havoc in Western Europe, China’s Henan province, and Tennessee and North Carolina in the United States. Hundreds of lives were lost, property damage ran in the billions, and global supply chains were thrown into disarray.

MTBF Is an Integral Part of Business Operations - Here's Why

In today’s fast-paced digital world, your customers expect your services to be available 24 hours a day, seven days a week. If your services are unreliable, these customers will likely take their business elsewhere — and spread the word. To retain their business, you must understand and optimize your service and system health to ensure your services are reliable. Gauging your service and system health requires much more than knowing whether they’re on or off.

What's new: Updates to Event Intelligence, mobile, and more!

As we near the end of the Summer season, we’re excited to announce a new set of updates and enhancements to the PagerDuty platform. These updates will help our users and customers: Make sure to view the latest PagerDuty Pulse or learn more from our community team and developer advocates who have launched new programs to help you learn more about our latest products and best practices.

Call Handling - Relieve the burden of your service desk and on-call staff

These days, I keep encountering inquiries from various customers on the topic of call handling. Due to the current transformation, triggered by the increased use of home offices, it is becoming more and more important to make on-call staff more accessible. Often the already overloaded service desk is used for this purpose. Of course, this leads to a) a deterioration in the quality of the service desk and b) delays between the receipt of the problem and the start of problem resolution.

Automate your LogDNA + PagerDuty Incident Workflow

LogDNA integrates with your PagerDuty instance to help trigger incidents based on log data coming in from your ingestion sources. This allows your teams to quickly understand when there are issues with your application, and where in the logs you can investigate to understand root cause. To help further accelerate your team’s ability to understand the state of your applications, we are introducing the ability to automatically resolve those PagerDuty Incidents directly from LogDNA.

Self-Compassion Instead of Self-Blame

The tech industry is competitive and not without challenges. People are always growing and improving by pushing their limits. Innovation comes in many forms. In order to foster a healthy culture while allowing people to flourish, organizations must carefully enact policies. Growth should be encouraged while discouraging competition and comparison. One of the core policies organizations implement to achieve these goals is blamelessness.

Best practices to help retailers make the grade for the holiday season

It’s hard to believe we’re already talking about the return to school, but it’s set to be a big one. In fact, this year promises to be the biggest in the last five years. The National Retail Federation expects back-to-school spending to reach $37.1B , up from $33.9B last year. Back-to-college spending is also expected to rise, reaching $71B this year. This increase is buoyed by parents and students gearing up for their first in-person classes after a year of virtual learning.

Introducing the Spike.sh Alert Reliability Engine

At Spike.sh, our mission is to help dev teams understand and resolve production issues faster. At the core of this is our Alert Reliability Engine, whose job is to make sure that a team member always gets an alert on their preferred channel. Currently, we support 7 channels - phone call, SMS, mobile push notifications, email, Slack, Microsoft Teams and Discord. We wanted to give you a peek into how we achieve high deliverability across these channels.

How MBTA modernized incident response to reduce alert fatigue and improve collaboration

Citizens utilize mobile and consumer-facing applications in everyday life, so it’s no surprise that they demand seamless access and high availability of government services online. Whether it’s making payments or applying for benefits, citizens and constituents alike expect these services to be available around the clock.

How Squadcast Benefits On-call Engineers - Part 1

It is difficult to stay completely reliable in an always-on world. So it's very important to choose the right Incident Management solution that can solve your problems. In this blog, we have highlighted the benefits of Squadcast and why you should adopt it. “Being on-call sucks!" Often incident response teams use this phrase when talking about their on-call experiences. Despite using best practices for managing infrastructure, incidents do occur from time to time.

Dynatrace and xMatters Make Seamless Efficiency Possible - xMatters Demo

How can organizations integrate their tools into a platform that maximizes uptime and simplifies operations? Is it possible for the tools you already rely on to be more efficient? With Dynatrace and xMatters in tandem, the answer is yes! Join Rob Jahn, Technical Partner Manager at Dynatrace, Eric Maxwell, Solution Architect at xMatters, and Rutuja Rajwade, Partner Marketing Manager at xMatters, as they discuss how Dynatrace and xMatters can work together to make incident management and development processes more efficient.

How the technology you choose influences CloudOps maturity

As the world becomes increasingly digital-first, it’s more important than ever for organizations to keep services always-on, innovate quickly, and deliver great customer experiences. Uptime is money, so it’s no surprise that many have made the shift to cloud in recent years in order to make use of its flexibility and scale—while controlling costs. And while 2020 wasn’t easy for any organization, those that are thriving have embraced the digital mindset.

DevOps & SRE Words Matter: How Our Language has Evolved

As the tech world changes, language changes with it. New technologies will always introduce new terms and descriptions to provide clear understanding. For example, the emergence of the cloud introduced language to describe the changing relationship between servers and clients. Then, of course, product providers will also dictate how their products are to be described, i.e. describing services as “cloud-native”.

WIRES and xMatters: Efficient Collaboration On a National Scale

An update on how xMatters service reliability platform is improving animal rescue response times through WIRES in Australia. We are extremely grateful for xMatters support and are excited to share this update with the xMatters community. We have made so much progress with our wildlife rescue response systems since the devastating bushfires of 2019 and 2020, despite the continuing challenges of COVID-19.

Managed Service Provider - How AlertOps Helps MSP Scale Digital Transformation Initiatives.

In an era where speed, productivity, and user experiences matter most what are the incident management capabilities managed service provider need most to grow, transform and mature their digital operations, processes and serve more organizations, faster and more efficiently. Many of today’s enterprises still have operations that are largely manual, reactive and lack the in-house resources and expertise to undertake a digital transformation initiative.

What's New: Introducing Delay Notifications to Control Alert Fatigue

The OnPage team is pleased to announce a new feature to the enterprise web console: Delay Notifications. With this new addition, organizations have the option to queue messages for specific time periods, delivering messages at the end of the Delay Notification schedule. The latest feature is designed to alleviate alert fatigue and improve work-life balance for incident respondents.

The Top 4 Key Levers to Build Towards Long-Lasting Digital Operations Maturity

Digital operations maturity is a journey. The first step is to understand where you are, where you want to get to, and what’s keeping you from getting there. Only then can you make strategic decisions and lay out a plan for how to approach any hurdles and land where you want your organization to be. For many organizations, upleveling operational maturity requires investment in driving cultural change with fundamental shifts to operating models.

Full-cycle observability with the Elastic Stack and Lightrun

An application running in production is a difficult beast to tame. Most experienced developers–ones who spent enough late nights or Saturday mornings trying to break apart a nasty production bug–will try and create the clearest possible picture for their later selves while writing their code, so that they could understand what’s actually going on in the system during an incident.

Chapter Ten: In Which Sarah Resigns from Animapanions and Heads Off to Start Up a Competitor

This is the tenth chapter in The Observability Odyssey, a book exploring the role that intelligent observability plays in the day-to-day life of smart teams. In this chapter, our DevOps Engineer, Sarah, throws in the towel at C&Js and moves on to build her own business.

Chapter Eleven: In Which James Speaks with the Industry Analysts

This is the eleventh chapter in The Observability Odyssey, a book exploring the role that intelligent observability plays in the day-to-day life of smart teams. In this chapter, our IT Ops Leader, James, speaks with the analysts about what’s happening in the AIOps space.

Getting Started with Site Reliability Engineering

Site Reliability Engineer (SRE) is one of the fastest growing jobs in tech, with Linkedin reporting 34% growth YoY in 2020 and over 9000 openings in their Emerging Jobs Report. If you’re new to SRE and exploring it as a career path, understand that it can be a challenging but rewarding experience. Here are some quick tips on how you can get started with SRE and jump-start a rewarding career.

Strategies to Strengthen Nurse Mental Health and Safety

No job is easy, but the job of a nurse is even more challenging, especially during a global health crisis. Nurses are at a higher risk of developing burnout due to the psychological trauma and cognitive overload that comes with the nursing profession. The situation is further exacerbated when nurses take on more responsibility during a pandemic or other large-scale incidents.

SLOs, SLIs, and where to find them with Jacob Plicque III

Identifying the right the right Service-Level Indicators is mission-critical for any SRE team responsible for meeting Service-Level Objectives and reporting on them. Find out how to sift through mountains of metrics and fill gaps in your data in order to visualize SLIs that actually matter for effective error budget tracking and actionable alerts in Grafana. Presented by: Jacob Plicque III, Senior Engineer at Grafana Labs at Grafana East Coast Virtual Meetup - August 2021

Real-time digital operations management puts connected vehicles on the road to success

As technology advances and applications for the Internet of Things (IoT) continue to expand, industrial and manufacturing companies are embedding more digital systems into their operations. From smart factories and intelligent shipping to automation and 3D printing, Industry 5.0 has arrived.

Lone Workers vs. Remote Workers: Knowing the Difference and Keeping Both Safe

The Covid-19 pandemic increased opportunities for remote work four to five times more than before, according to a report from McKinsey & Co. Although many office-based workers had no choice but to leave their desk jobs and make the move to work from home in early 2020, remote work appears to be here to stay. The rapid transformation brought forward by the pandemic has muddied the definition of remote workers versus lone workers, but it’s essential not to confuse the two.

How to Avoid the Executive 'Swoop and Poop' and Other Best Practices for Operational Maturity

We’re eating at restaurants again. We’re seeing family after too long apart. Some of us may even be returning to the office. But, that doesn’t mean that the pressure is off for digital services, and growing in operational maturity still remains top of mind. While the digital transformations have been taking place for the last two decades, COVID-19 added pressure to speed initiatives.

3 Focus Areas for Improving Business Resilience

More than 2,800 senior executives in organizations of all sizes across 29 industries and 73 countries weighed in on their 2020 crisis response plans in PricewaterhouseCooper’s (PwC) annual impact survey. This is a valuable insight into resiliency planning, business operations, and the future of the workplace.

Are You Spending Enough on Cybersecurity?

Cybercriminals do not discriminate against the organization, people or industry they target. These actors look to exploit vulnerabilities in resources to intercept valuable data from small and medium-sized businesses (SMBs). Cyberattacks are inevitable, and organizations must have the right controls and information security systems to mitigate the impact of an attack.

Improving your team's on-call experience

Your engineers probably dislike going on-call for your services. Some might even dread it. It doesn't have to be this way. With a few changes to how your team runs on-call, and deals with recurring alerts, you might find your team starting to enjoy it (as unimaginable as that sounds). I wrote this article as a follow-up to Getting over on-call anxiety.

SREview Issue #16 August 2021

We’re kicking off August with some thrilling news: Blameless has closed a $30M Series B fund raise! Learn more about how we’re entering the next phase of our journey to advance reliability for engineering teams here. We’re so grateful to our customers, collaborators, and the entire SRE community for their support! Let’s dive in with our favorite content for the month!

Supercharging incident response with runbook automation

The global pandemic is estimated to have accelerated digital transformation by at least seven years—and it’s showing no signs of stopping. In fact, companies are investing even more into software-driven experiences. A recent Gartner forecast points to worldwide IT spending increasing 8.4% to $4.1 trillion in 2021, with much of that spend on mission-critical, customer-facing services.

We've raised a $23M Series B to help us get to a world where all software is reliable

At FireHydrant, we envision a world where all software is reliable, and we’re on a mission to help every company that builds or operates software get closer to 100% reliability. Today, we’re thrilled to announce that we’ve raised $23 million to help us further our goal.

Timely Delivery with Enterprise Alert

Murphy’s Law states that anything that can go wrong, will go wrong. The challenge for most businesses is putting the right method of communication in place for when the inevitable happens. The only way to handle this is to expect the worst and then prepare for it. A key factor in deciding for any alerting solution is can my team be notified properly when a major outage happens .

Announcing our $6M investment to double down on IT incident and Reliability needs

When Squadcast was founded back in 2018, we had a concise yet clear goal—we wanted to make it as easy as possible for companies to manage their IT incident and reliability needs. In the spirit of continuing that mission, today I’m excited to announce our $6M fundraise led by DNX Ventures and backed by Wipro Ventures, Nexus Ventures, and Chiratae Ventures. We’re also pleased to announce the addition of DNX and Q Motiwala to our Board of Directors.

Everbridge is the place to be

Culture is about more than just a fancy office, benefits or team activities. It’s about the people. Our Bridgers build and own the company culture, enforce our values, and their passion fuels our continued innovation and growth. We wouldn’t be where we are without them, and our growth and great culture are because of what we’ve achieved as a team together! Individually we are amazing but together we are remarkable.

What's the ROI? How Operational Maturity Improves Customer and Team Satisfaction

Are we looking at the new normal now? In the last 18 months, organizations all over the world were compelled to undergo a rapid digital transformation and mature their operations to support services that were under unprecedented strain. Digital transformation allows companies to embark on large-scale cloud migrations and adopt modern development methods like DevOps and Agile.

Demystifying DevOps and SRE

How different are DevOps and SRE? Are they related to each other? In this blog, James Samuel sheds light on the similarities & differences between SRE & DevOps followed by the possible ways to structure an SRE team in your organization. One of the terms that people often find confusing is SRE and DevOps. People often ask, should I hire a DevOps Engineer or a Site Reliability Engineer? What is the difference between SRE and DevOps and which one do I need? In this post, I attempt to shed some light.

How PagerDuty Helps Manage Hybrid Infrastructure and Complex Ops Across Industries

If there’s one thing we learned from the 80+ sessions from Summit 2021, it’s that across the industries, companies are continuing to accelerate innovation in a bid to meet growing customer expectations of always-on services across all channels. In financial services, disrupting traditional banking or rethinking access to advisory services comes with operational and regulatory challenges.

Contextual Intelligence and Observability: Without the Former, You Really Don't Have the Latter

Observability is a hot term in the industry, but don’t let it fool you: having visibility into your organization's apps and services only gives you partial clarity into a system’s overall performance. To get a full understanding of your monitoring data, you need to apply contextual intelligence.

New Product Integration! Microsoft Teams Video

On the heels of our Microsoft Teams integration release to streamline incident management, we’re excited to share that we now support Microsoft Teams Video capabilities. We generate Microsoft Teams video conference links for each Blameless incident for fast and easy collaboration. Microsoft Teams Video joins Zoom, Google Meet, and GoToMeeting in our video integration suite.

Hear From Product PagerDuty for Customer Service Operations Lightning Talk

Learn about what's new with PagerDuty for Customer Service Operations from the Summit 2021 Launch. Our Product team shares how you can benefit from our latest updates and enhancements and enjoy demos that were recorded live from Summit 2021 featuring the PagerDuty Salesforce Service Cloud Integration V3, New Customer Service SKU, and Round Robin Workflows (Round Robin Scheduling).

PagerDuty Pulse Q1 FY22 Full Webinar

In this edition of PagerDuty Pulse, you’ll get to view our most recent platform updates and enhancements (March 2021 – June 2021) that extend from AIOPs and automation to a variety of new integrations. Teams must leverage PagerDuty and Modern Digital Operations to automate the day-to-day toil of repetitive tasks, master modern operations with full-service ownership, seamlessly collaborate across the organization, and accelerate enterprise-wide response by enabling customer service operations and business stakeholders.

Less is more: Incident management and monitoring in hybrid IT infrastructures

Many companies are continuously modernizing their infrastructure – but there is no standard way for the perfect IT infrastructure. Still, hybrid architectures have become the status quo in enterprises. Almost all organizations have migrated at least parts of their assets to the cloud or run applications as cloud services. At the same time, businesses want to dovetail their IT architecture with software development and are therefore embracing dynamic infrastructures. ‍

Resilience in Action E9: Vulnerability, Compassion, and Post-Incident Reviews in the Emergency Room with Dr. Al'ai Alvarez

‍ What can software engineers learn from post-incident reviews that physicians do in the emergency room? In our ninth episode, Christina, member of the Blameless strategy team, guest-hosts the podcast to interview both Kurt Andersen and Al'ai Alvarez, MD (@alvarezzzy). Dr. Alvarez is an assistant clinical professor of Emergency Medicine at Stanford. Clinically, he’s an emergency physician.