Operations | Monitoring | ITSM | DevOps | Cloud

November 2022

How to Help Teams Create Optimal Infrastructure for Availability

Teams are locked into a cycle of suffering characterized by the feeling that they are sprinting just to stay still. This morale and productivity-destroying state is caused by an inability to find time to save time. Our new research, The State of Availability Report 2022, discovered that teams know what they want to do—harness cloud and DevOps practices and tools to advance digital transformation—but something’s getting in the way.

Improving Incident Management with Automation

Incident management is your organization’s first line of defense. When incidents occur, internal teams must be ready to respond quickly. While incidents can happen anytime, it’s unrealistic to expect incident managers to be prepared to perform manual root cause analysis. Manually monitoring and analyzing applications on multiple servers is extremely difficult, which is why human reaction times have traditionally limited the speed of incident management.

What's New: Updates to Incident Response, PagerDuty Process Automation Software & PagerDuty Runbook Automation, Mobile App Experience, and More!

We’re excited to announce a new set of updates and enhancements to the PagerDuty Operations Cloud in addition to the November Product Launch announcements made earlier this month. Recent development and app updates from the product team include Incident Response, PagerDuty® Process Automation, the PagerDuty Mobile App, Integrations, as well as Community & Advocacy Events updates.

7 Incident Management Best Practices to Improve Business Efficiency

Think about the last time your IT systems had an outage: How did your team react to it? Were they organized with a clear idea of how best to resolve the issue? Or was it chaotic, with people firing questions from all directions and customer service channels ablaze with requests for help? Digital technology disruptions are typical (and even expected) at the workplace, but it doesn’t have to be chaotic, with teams rushing around to extinguish the metaphoric fire.

Slash MTTR, avoid costly downtime with improved cross-team Collaboration.

Every second counts when IT teams are called upon to resolve business impacting issues. In modern enterprises, poor communication, fragmented toolchains and spiralling IT complexity can conspire to slow down incident response, putting service availability and ultimately customer satisfaction in peril.

Just Maintaining Availability? Try Building Stability

Today’s customers see availability as a given. What do they really want? Bigger, better technology with new features and faster platforms. But, according to our recently released Moogsoft State of Availability Report, teams burn their time, money and energy on incident management. In fact, engineers overwhelmingly report that incident management takes up most of their time.

Incident Innovation: ITSM Incident Management vs FEMA Incident Command System - Goals

The FEMA Incident Command System responds to wide area disasters like an earthquake, fire, flood, hurricane, and tornado, while ITIL is used for digital services and applications. In large organizations, there is the facilities team and the data center team. FEMA is associated with the facilities team and ITIL with the smaller data center team. What characteristics are shared between the two and what are the main differences?

PagerDuty at re:Invent 2022 Launches Automated Diagnostics for AWS that Enables Organizations to Resolve Incidents Faster So They Can Innovate More

It’s that time of the year! PagerDuty is coming back to sin city for AWS re: Invent 2022! The global conference brings organizations of all sizes and is set to explore themes of modernization, automation, and resiliency in the cloud. With current economic conditions, enterprises are looking to scale operations and optimize costs while delivering always-on, digital experiences to their customers. Automaton plays a key role in helping support operational and cost efficiency.

Postmark + Squadcast Integration: Simplifying Alert Routing

Postmark is a simple email delivery system used to send transactional and marketing emails and it ensures getting them delivered to the inbox on time, every time. It also helps in reducing email delivery time considerably. If you use Postmark for your email delivery requirements, you can integrate it with Squadcast, an end-to-end incident response tool, to route detailed alerts from Postmark to the right users in Squadcast. The below steps will help you set up Postmark and Squadcast integration.
Sponsored Post

Unify Your Incident Management Process With the Fundamentals

In a perfect world, technology stays on and runs flawlessly. But we all know this isn't the case. Like any organization, xMatters sometimes experiences unplanned incidents. What we can control is how we respond to them. To resolve incidents quickly, it's important to coordinate an organized response.

How Incident Commanders Benefit from Actionable Insights

Knowing who is in charge helps teams avoid confusion about who to turn to during a crisis, allowing them to focus their efforts where needed. When the pressure is on, an incident commander should have an established response plan to ensure that responders act quickly and coordinate efficiently, and with actionable insights this can be made possible.

The Incident Retrospective Ground Rules

I joined Honeycomb as a Staff Site Reliability Engineer (SRE) midway through September, and it’s been a wild ride so far. One thing I was especially excited about was the opportunity to see Honeycomb’s incident retrospective process from the inside. I wasn’t disappointed! The first retrospective I took part in was for our ingestion delays incident on September 8th.

Global bank transforms incident alert management & communications

One of the top 10 largest financial services companies in the world 200,000+ employees worldwide. Serving tens of millions of customers. With operations in more than 60 countries, the Interlink Incident Alert Management app serves an audience of thousands of service owners and business stakeholders - across 20+ global markets.

Integrations on Rails: How we build and deploy integrations at FireHydrant

Implementing integrations without a mountain of technical debt can be challenging. But it doesn’t have to be all bugs, burn out, and outages when shipping integrations at a high volume. We’ve unlocked a pattern at FireHydrant to rapidly build and release integrations without swiping the technical debt credit card each time — and that gave us a fastlane to building premier integrations.

Building a metrics backend (time series db) with PostgreSQL and Rust

At ilert customers are already benefitting from our easy to setup private or public status pages and auto generated SLA uptime graphs for their business services. However, we decided to push the graph topic a bit further with custom metrics. Using ilert metrics customers can showcase additional business data and insights into their services on their status pages.

Services with a Smile: Service Graph and Service Standards

Principal Product Manager Davis Godbout joined the HowTo Happy Hour to talk about the Service Graph and Service Standards features in the PagerDuty platform. Service Graph gives your teams a visual representation of the relationships among your technical and business services. Service Standards guides teams to the benefits of PagerDuty’s features like integrations and configurable incident urgency.

CircleCI + Squadcast Integration: Alert Routing Made Easy

CircleCI is a continuous integration and continuous delivery (CI/CD) platform that helps in implementing DevOps practices. It is used to build, test, and deploy projects, by automating pipelines with jobs. If you use CircleCI for implementing your DevOps practices, you can now integrate it with Squadcast to route detailed alerts to the right users in Squadcast. The below steps will help you set up CircleCI and Squadcast integration.

Demystifying Availability KPIs - and What Most Companies Miss

Most engineering teams are no strangers to key performance indicators (KPIs), those metrics tracking progress toward critical goals and targets. Ideally, tech leaders design KPIs to focus teams on what matters and prove their contribution to the company’s overall performance. Of course, KPI data should also uncover critical information that guides informed decision-making. For engineering teams tasked with managing the customer experience, KPIs often track availability.

New features + new CI: Metrics, Status Page Widget, PandoraFMS, Automation rules, Alert report export

This post highlights some of the features and improvements that we have released in the last month. If you want to submit your own ideas or vote on existing feature requests, you can now use our new public roadmap at roadmap.ilert.com. ‍

Reducing MTTR for DevOps and SREs with PagerDuty Process Automation and InfluxDB

Mean time to resolution (MTTR) is a metric that transcends industry and technology. It’s a measure of how quickly, on average, support teams identify, act, and resolve IT issues and incidents. Because MTTR directly relates to service quality, maintaining a low MTTR is a critical goal for DevOps and SRE teams. These teams have a vested interest in resolving issues quickly because escalating incidents to higher levels of the support team increases response and resolution times.

My Most Surprising Discoveries from The SRE Report 2023

I’ve had the honor and privilege of authoring The SRE Report for the last three years. For the 2023 version, this included working with some amazing individuals like Anna Jones, Kurt Andersen, and Steve McGhee. Download The SRE Report 2023 here (no registration required).

How to implement a mature incident response strategy

In 2021, the Biden administration issued an executive order outlining that the government and private sector need to work together to combat cyberthreats and improve the nation’s collective cybersecurity stance. As cyberattacks become more common and more costly, the United States — like other nation-states — needs to do everything it can to prevent attacks and rapidly respond to them when they occur, which requires modernizing its approach to incident response.

Best In Resilience - Dow Chemical's Perspective

Scott Whelchel, Chief Security Officer on the Value of Resilience It’s time to look at #resilience in a new way. For Dow Chemical Company, resilience is an important part of sustainability and innovation. Dow strives to build resiliency through the responsible care of their people, and the communities and environments in which we operate.

A Deep-Dive Into PagerDuty's New Incident Workflows

It doesn’t matter if you’re a startup or in the Fortune 500: cost optimization, tool consolidation, and efficiency efforts are top of mind. Removing toil and automating more often during the incident response process doesn’t only help teams resolve faster, it also helps them become more efficient. In a resource-strapped world, protecting developer and responder time and focus is critical to reducing total cost of operations and optimizing customer experience.

What's New: PagerDuty Mobile Home Screen Experience

Hybrid and remote work is now the status quo. Companies campaigning for workers to return to the office are facing resistance, with some employers finding that they’re losing employees to jobs that give prospective hires the flexibility they want. Flexible work models have become a competitive advantage in a strained labor market. According to the latest Future of Work report from Accenture, 63% of high-growth companies have adopted a “productivity anywhere” workforce model.

OAuth Authentication - xMatters Support

OAuth is an open standard system that uses tokens to grant access to systems or information without using a password. OAuth authentication authorizes requests to the xMatters Rest API by passing a token in the header of your requests. This means you don’t have to store user names or passwords in your applications, keeping your user’s information secure.

MTTD: An In-Depth Overview About What It Is and How to Improve It

In this post, we'll learn all about the incident metric mean time to detect (MTTD). We'll see how to measure it and look at its relationship with other incident metrics like MTTR (mean time to recover). Both metrics give useful insights into your incident recovery ability.

A multi-billion-dollar software giant leverages Exigence to improve incident management collaboration & outcomes

A global leader in SaaS-based and on-premise software solutions that power innovative digital experiences was looking to replace the internal tool that was being used for resolving outages, service degradation, data center connection loss, and other incidents.

Incident Management and Status Pages for Enterprise IT Departments

The Incident Management and Status Page solution that lets you organize your enterprise IT team and communicate with users for a coordinated response that restores services rapidly. StatusCast works as an Incident Management platform to increase employee productivity inside organizations. There’s a lot you can do with StatusCast status pages to create the brand look you are seeking.

How to detect anomalies in logs, metrics, and traces to reduce MTTR with Elastic Machine Learning

Elastic Observability has extensive machine learning capabilities that support and improve analysis in APM. Learn techniques for correlating and detecting anomalies of telemetry data from APM agents for a particular application.

Blameless culture drives incident learning and other key insights from Catchpoint's 2022 SRE Report

SRE is a constantly evolving field, responding to the challenges of increasing reliance on tech and the opportunities of its evolving abilities. Reliability has to remain a step ahead of the cutting edge, whether it’s navigating remote work, implementing AI assistance, or optimizing internal processes. But how do we know that SRE is keeping up? ‍ We’re proud and excited to announce the results of the SRE Survey we ran in partnership with Catchpoint.

Expanding Incident Response with Microsoft Teams

Last week we launched a number of features across the PagerDuty Operations Cloud portfolio to help teams minimize downtime and protect customer experience. One of the areas where PagerDuty continues to invest is collaboration and communication during incident response to ensure that all impacted stakeholders across the business are updated in real-time.

Managing a Slew of Monitoring Tools? Here's How to Make Them Talk.

Engineering teams use a lot of single-domain monitoring tools. In fact, the average team manages and maintains 16 monitoring tools — and up to 40 — according to Moogsoft’s State of Availability Report. While IT leaders select and implement these tools to save teams time, our research finds they do quite the opposite. Engineers spend far and away more time on monitoring than they do on any other task — innovative, value-creating tasks included.

The Importance of Role-Based Messaging in Healthcare

Do you remember the classic board game where you have to go back and forth with your opponent deducing which characters on the board you’ve each selected? It’s still played by children today, and unfortunately by healthcare teams as well. Every day, healthcare teams are forced to play a game of “Guess Who?” is on-call if they do not have systems in place for role-based messaging.

Early stage data teams: a balancing act

Most well established data teams have a clear remit and a well defined structured for what they work on and when: from the scope of their role (from engineer to analyst) to which part of the business they work with. At incident.io, we have a 2 person data team (soon to be 3) with both of us being Product Analysts.

Empower the SREs - Conclusions from The SRE Report 2023

Let's be honest, nobody loves surveys. Ok, well I sure don't. But surveys satisfy a huge need in our demand for insights into complex human-computer, sociotechnical systems. It turns out that we've been measuring the computer part pretty well, but the humans – not as easy to keep track of. When Google SRE first defined toil as a metric we wanted to reduce, we spent far too long trying to quantify it numerically based on tooling and insights from computer systems.

Building an incident management process - incident.fm

In this podcast, our panellists discuss the foundations that any team needs to put in place when designing their incident management process. Starting from the basics of defining what we really mean by an incident, to how to set your severity levels, roles and statuses, Chris and Pete share their tips for building solid foundations to run your incidents.

Building an incident management process

In this podcast, our panellists discuss the foundations that any team needs to put in place when designing their incident management process. Starting from the basics of defining what we really mean by an incident, to how to set your severity levels, roles and statuses, Chris and Pete share their tips for building solid foundations to run your incidents.

3 questions to ask in the build vs buy debate for incident response tooling

As a former incident responder and now as a responder advocate for FireHydrant, I’ve seen the “build vs. buy” debate play out many times. In fact, I even supported the tool that former employers used for managing incidents for years before they decided to buy (more on that in a future blog post).

Webinar: Real talk: automation for ITOps

IT operations move fast. If you’re an ITOps leader, you need to be moving just as fast to make sure your team has what it needs. Positioning your team for success isn’t easy: complexity in IT is increasing every year and can reach a point where it exceeds a person’s capacity to keep pace. In the face of massive growth, ITOps teams can face major challenges with productivity, burnout and efficiency.

For incident management, should you build or buy?

Is your incident response held together by a thread? Are you manually recording incident updates in a shared doc? Do you struggle to juggle the incident management workload with your other responsibilities? Does everyone on-call report data the same way? These are all common problems faced by DevOps teams still relying on homegrown incident management tooling.

[Report:] The true costs of modern IT outages

If you’re in IT, no doubt you’ve heard the age-old statistic that an average minute of downtime costs $5,600. It turns out that information is a bit outdated and does not reflect the real and nuanced costs of a modern IT outage. BigPanda suspected this and wanted to uncover the true numbers behind outage costs so ITOps can have a better understanding of costs, causes and “cures” of an IT outage.

AppExchange Mavericks: PagerDuty Empowers Customer Service Agents to Resolve Cases Quicker

Jonathan Rende, SVP of Products at PagerDuty and AppExchange Mavericks, Salesforce MVP Barb Dietz discuss how PagerDuty is working to empower customer service agents to resolve customer-impacting issues faster. BONUS: In this video, you will get a front-row seat to PagerDuty’s product demo. See what's in the video.

PagerDuty November 2022 Product Launch - Product Highlights Demo

Learn how PagerDuty's latest capabilities can help you solve critical, unplanned work faster in this new product highlights demo. Our host of new capabilities help you improve team productivity, avoid escalations, and optimize digital services. Features highlighted in this video are the following new PagerDuty features and more.

Service Level Management Process Explained (with Examples)

‍ Service Level Management, or SLM, is defined as the process of negotiating Service Level Agreements and ensuring that they are met. ‍ Service Level Management is a fundamental part of SRE and DevOps. It encompasses the expectations and perceptions that both the business and the customer have about the service and its performance. Service level management will include existing and new services as they are added, with the service level agreements (SLAs) being modified accordingly.

Getting started with severity levels

An incident can take many forms. It can look like a small issue that locks a few customers out of their accounts or a huge catastrophe that brings down your entire product for a full day. How you respond to the incident should vary based on the impact of the incident. And that’s where severity comes into play. Defined severity levels are crucial to any good incident management program.

4 New Product Announcements to Help Teams Do More with Less

Incidents are costly. It’s not just revenue that takes a hit every time you have an outage–brand reputation and client satisfaction are also on the line. To protect current and future revenue, companies have to deliver on customer expectations. Innovation alone is no longer enough: digital experiences must also be fast, flawless, and highly available. This means teams have to get more proactive with real-time, unplanned work.

Interlink Software Achieves Cyber Essentials Certification

Cyber Essentials is a UK government backed scheme, developed by the National Cyber Security Centre. Since its inception the scheme has become the benchmark for IT security, helping organizations to deploy technical controls to guard against the common types of cyber-attacks and improve data security.