Operations | Monitoring | ITSM | DevOps | Cloud

February 2022

xMatters Out Run Release Recap: Service-centric Automations, Callable Flows, and More!

What’s one of the fundamental principles of DevOps? Automation. There are many ways to leverage automation to facilitate DevOps practices for enabling consistency, reliability, and efficiency within the organization. That’s why we’re taking serious strides to ensure that xMatters can allow full automation and coordination of the many tools we use to make incident management easier and more efficient for front-line responders.

Creating Subscription Forms - xMatters Support

In xMatters, you can use subscriptions to ensure that you are always informed about certain events. These subscriptions will send you notifications whenever an event occurs that matches your pre-determined criteria, even if you are not directly targeted to receive a notification for that event. Follow us on social!

Traditional vs Modern Incident Response

An incident is an event (network outage, system failure, data breach, etc.) that can lead to loss of, or disruption to, an organization's operations, services or functions. Incident Response is an organization’s effort to detect, analyze and correct the hazards caused due to an incident. In the most common cases, when an incident response is mentioned, it usually relates to security incidents. Sometimes incident response and incident management are more or less used interchangeably.

Finding a pricing model that's just right

Getting your pricing right is critical to the success of any SaaS company, but finding a model that works can be tough. Price too high, you won’t close enough deals - your business will fail. Price too low, your business model will be unsustainable - your business will fail. To add to the complication, when you’re a new startup your goals are evolving.

Putting the "Action" in Actionable Intelligence

AIOps combines machine learning and people to deliver technical outcomes in IT operations. The promise of this capability continues to drive new contenders to the market. AIOps has become a core messaging component for all the major event management players. Many have just rebranded their products to specifically highlight AIOps features. Emerging event management players have arrived and tried to also claim the AIOps space.

Can Endpoint Protection Keep up With Modern Threats?

Endpoint protection is a security approach that focuses on monitoring and securing endpoints, such as desktops, mobile devices, laptops, and tablets. It involves deploying security solutions on endpoints to monitor and protect these devices against cyber threats. The goal is to establish protection regardless of the endpoint’s location, inside or outside the network.

Incident severity and priority 101

Severity and priority can be challenging for a company to nail. When an incident is declared, it's essential to have a system to define the impact and how urgently it should be handled. Incident severity and priority are the two knobs teams can leverage to define scope and urgency, and eventually, the appropriate process to take action. But how should we define them, and what are the differences?

Sponsored Post

What Is a DevOps Toolchain and How Does It Work?

Picture yourself trying to resolve a code error when you notice an additional issue outside your realm of expertise that's making matters worse. Your instinct is to get in touch with the right contact as quickly as possible to resolve the issue so that there's no further impact on the system's uptime. But what if you can't get in touch with them immediately, or don't know who to contact? Instead of trying to solve the problem without support, a DevOps toolchain could have mitigated this chain reaction from the start.

Major IT Outage 2021 Recap

We saw that no one is immune from major IT outages in 2021, not even mega titans like Google, Facebook, and Amazon AWS. The following is a recap of some of the major IT outages with widespread impact for 2021. Amazon Web Services’ (AWS) historic outage occurred on December 7, 2021 and lasted roughly 6 and a half hours. The breadth of Amazon and its reach caused not only their warehouse and delivery operations to stop.

Slack outage

Slack, a popular enterprise communications platform, faced a 5-hour system outage yesterday between 9:25 AM – 2:24 PM EST on February 22, 2022. Slack services affected included: messaging, search, link previews, apps/integrations/APIs, posts/files, workspace/org administration, login/SSO, notifications, connections, and calls. AlertOps was NOT affected by this outage.

Cloud Incident Management Guide

It is a well-established fact that companies looking to grow in the digital age can facilitate this mission by adopting the cloud. When pursued with the right intent and implementation strategy, cloud adoption acts as a powerful force multiplier, yielding a cutting-edge IT powerhouse for businesses and helping them grow and innovate at an accelerated pace. Organizations that adopt a cloud-first strategy must safeguard themselves from critical, service-disrupting incidents.

PagerDuty Receives Financial Services Competency From AWS

We are excited to announce that PagerDuty is now an approved AWS Financial Services Competency Partner. We’re looking forward to expanding our global reach and helping financial services organizations accelerate their cloud migration and digital acceleration journeys. This will allow us to further streamline and automate financial service companies’ digital operations while helping them reduce risk and manage compliance requirements.

Episode 3: Mooving to... Stability: The Role of Catastrophic Failure in Software Design

In this episode of Mooving to… Stability: The Role of Catastrophic Failure in Software Design, we had the opportunity to chat with Jeff Atwood, yes that Jeff Atwood of, Coding Horror, Stack Overflow, and Discourse (Chief Happiness Officer). Jeff started writing 911 software in Boulder, Colorado for a small company, which was a crash-course in writing code for software that has real consequences. With this unique and deep perspective, B.J.

Starting projects at incident.io

We’re a small startup (10 people at time of writing) with big ambitions, particularly when it comes to our product. With so many things we want to do, it’s important for us to be structured the way we approach our work, without being so process-driven that we lose all the benefits of being small and nimble. As we’re still new, and the team is growing all the time, very little is set in stone.

Everything you need to know about Squadcast and Microsoft Teams Integration

Microsoft Teams is one of the most versatile tools in terms of providing collaboration and chat solutions to numerous enterprises. We at Squadcast understand how important Microsoft Teams can be for your organization. Hence, we bring you this blog on Squadcast-Microsoft Teams integration that will tell you how this integration can help in improved incident management, effective collaboration and a lot more.

Sprint planning - How to prioritize urgent production issues?

Small engineering team members wear a lot of hats while working on a product. It becomes hard to prioritize and deal with issues that arise during production when a sprint is already planned and put in place. This not only makes sprints harder to plan but also reduces accountability. How do you tackle this problem and make sure your engineering team does not burn out at the same time? Let’s list down a couple of characteristics of this engineering team that is quite common across the board.

Designing your incident severity levels

We wrote this article in response to a question asked in our Slack Community. Click here to join hundreds of technology leaders discussing best practices for incident response! ✨ We know a thing or two about incident response. As such, we're often asked to advise when companies are designing their incident response processes. A common question is "How do you design your incident severity levels?". It's a great question given how central they are to incident response!

Why and How SREs Can Benefit from Feature Flags

When you think of who uses feature flags, your mind most likely goes to developers. In general, feature flags are closely associated with software engineering. But Site Reliability Engineers, too, can benefit from feature flags. SREs may not be the ones to create feature flags, but they should work closely with developers to ensure that the applications their teams support include feature flags.

Prepare Your Organization for a Hurricane

Hurricanes pose immense risk to the safety of an organization’s people, the continuity of operations, and the connectivity of communications systems. During a hurricane, critical event managers must be able to communicate crucial safety information to the people for which they are responsible. In addition to hurricane preparedness, critical event managers should ready their business in the case of any severe weather event.

An easier way to create runbooks

Runbooks have been a game changer for many incident response teams, and we just made it easier for you to get up and running with them. Runbooks reduce toil for responders and ensure consistency in your incident management processes.In the thick of trying to resolve an issue, remembering things like emailing customers is likely the last thing on responders minds but yet forgetting to do so can be detrimental.

February 2022 Update - Centralized and time-based notification patterns

With our February update, it is now possible to centrally configure how Signls should be notified. And of course, each team can have a different configuration of their notification preferences. This also includes response and escalation settings. In addition, it is now possible to set different notification patterns per day and time of the day, e.g. to notify via different channels at night than during office hours.

A Day in the Life of a DevOps Engineer

In the past five years, DevOps adoption has almost doubled. In fact, 74 percent of companies now use DevOps in some form. As a growing number of organizations seek to implement DevOps practices, the need for qualified DevOps engineers is soaring. But what exactly does a DevOps engineer do, and what skills are required to succeed in this in-demand role?

Customer Service Ops - New Features Release

Over the last few years, our world has become increasingly digital, from streaming and shopping to work and health care. Customers want these digital experiences to be seamless. This has become a key priority for all businesses as well, as they depend on happy customers to drive sales and brand reputation. To ensure these seamless digital experiences, technology teams have doubled down on reliability, user experience, and building new features.

Cloud Complexity - Bringing Resources together in Multi-cloud Environments

The world is still getting used to operating within the cloud. Moving to the cloud is challenging for many organizations. So why do we see a rise in the adoption of multicloud strategies? In this blog, we will explore why this trend is worth considering for your organization, as well as look at the challenges that it brings.

Customer Success at an early-stage B2B SaaS company

Based on our newfound data feet, we’ve started consistently tracking the adoption rate of our latest features. As it happens, we’ve been impressed with the results! For example, we were delighted to see that our new tutorial flow was completed end-to-end by 35% of our users (against an industry average of less than a quarter for 6-step product tours like ours). I know, I know: being at such an early stage means it is arguably easier to hit customer needs on the head.

How We Define SRE Work

At the time of writing this post, I have officially been at Honeycomb for one year as a site reliability engineer (SRE). I had shared my initial experiences and impressions in this post and thought it would make sense to check back in now that I’ve had the opportunity to spend time learning about the team, the culture, and the code base more in depth.

Exploring the Importance of Change Management in Healthcare

Change management is an organized, structured approach with methods that enable healthcare organizations to transform workflows seamlessly. Organizational change management requires the collective involvement of C-level executives and stakeholders to successfully implement changes within a care facility. Change is required when individuals, processes, teams, and tools cannot keep pace with the ever-changing needs and expectations of the organization.

Improved routing for Jira Cloud and Jira Server tickets with multi-project support

If you love Jira then you probably love customization, and we’ve made your integration with Jira Cloud and Jira Server even better with multi-project support! You can now route your incident tickets and follow-up work to remediation teams' Jira projects directly from FireHydrant, saving you valuable time and clean-up work. Let’s take a look at what has changed and some additional use cases unlocked with this integration.

New Native Slack functionality from PagerDuty - Available Now

At PagerDuty we invest a significant part of our time listening to our customers. From what we have learned from those conversations we are adding a new set of features to our Slack Integration. These features will make leveraging PagerDuty from Slack even more seamless and allow Incident Responders to conduct their work without switching context, expediting response times, and ultimately maintaining high customer satisfaction.

The three pillars of great incident response

There’s no one-size-fits-all incident response process. Depending on your organisation’s shape and size, you’ll have different requirements and priorities. But the same three pillars form the core of any good process, whether it’s for the largest e-commerce giant or a scrappy SaaS startup.

It's not ready for production until it has an Operational Readiness Checklist

Maintaining the reliability of complex services just got easier with Operational Readiness Checklists. Service owners and engineering leaders can now evaluate and maintain the production readiness of the services their users rely on every day: spot risks in your service dependencies before they cause incidents, and respond quickly if they do. Before you put a new service into production, readiness checklists help you dot-your-is and cross-your-ts.

Integration Options with SIGNL4

SIGNL4 integrates with various backend systems like IT monitoring, service management, IoT systems, sensors, etc. to automatically alert users and teams about certain incidents. A list of selected tools along with integration descriptions is available in our integrations section. How can you integrate SIGNL4 with your own tools? In the following we list some options offering different levels of sophistication.

12 ways to ace customer communications during a system outage

System outages are the worst nightmares for IT support teams, but they also provide an opportunity to stand out. During a major service outage, customers are often impacted a lot more because they have much less information about what is happening. Some of the biggest outages that affected users all over the world last year include those of Slack, PlayStation, Airbnb, FedEx, and Amazon.

The Math & Fun Behind Nesting Event Rules with Event Orchestration

PagerDuty Senior Product Manager Frank Emery joins us on Twitch to talk about Event Orchestration, a new feature in the PagerDuty Platform. We found in our data that 20% of incidents are resolved - by human responders - in under 5 minutes. Why are team members being interrupted for these alerts? Automation is a better answer. Event Orchestration utilizes powerful, flexible rules to turn alerts into automated activities so your team can keep working and avoid unnecessary interruptions!

SauceLabs & PagerDuty Notifications Channel for API Tests & Monitors

"APIs are the backbone of the apps and web services that run the world, yet most companies don’t have a true understanding of their functional uptime and reliability. Sauce Labs collects those insights by leveraging functional and integration tests as monitors. This provides a single source of truth for uptime and detailed reporting for when problems occur with functionality or performance. With PagerDuty, Sauce Labs' users gain granular control over notifications to ensure compliance with company policies while centralizing test and incident response processes among developers, testers, and product owners.

Squadcast Earns a Spot on G2's Top 50 Best Software Awards for IT Management Products 2022

We are thrilled to announce that G2 has recognized Squadcast as a High Performer in the Incident Management space and rated us as one of the Best Software for IT Management Products. Over the last three years, G2 has acknowledged our impact in the IT Incident Management space, which led to us being recognized as a Momentum Leader in the Incident Management and IT Alerting categories. Thanks to our learnings from customer feedback, we have been able to shape our product vision and grow further.

Three Common Incident Response Process Examples

What makes an engineering team? Communication, collaboration, process, order, and common goals. Otherwise, they would just be a bunch of engineers. The same is true of their tools. Connectivity and process turn a bunch of tools into a DevOps toolchain. If you need a DevOp toolchain, you can use it to easily build an incident response process.

Slash MTTR, avoid costly downtime with improved cross-team Collaboration

Every second counts when IT teams are called upon to resolve business impacting issues. In modern enterprises, poor communication, fragmented toolchains and spiralling IT complexity can conspire to slow down incident response, putting service availability and ultimately customer satisfaction in peril.

Use your words: the importance of clear writing in product development

The role of an engineer at a startup is a tangled web: as well as writing code, you have to be your own product manager, QA tester, customer support and designer. But there’s another hat that you have to wear which you might not have thought about: copywriter. All products have copy, from welcome messages to text on a submit button. At incident.io, we have to put on our copywriting hats every time we add a new feature.

Sponsored Post

What is MTTR? Resolve incidents faster through ops, alerting and documentation

When downtime strikes any distributed software deployment or platform, it's all hands on deck until the lights are green and service is restored. This process, from the recognition of a problem to a deployed solution, has most commonly been defined as MTTR - mean time to resolution. In just the last few years, DevOps and site reliability (SRE) professionals have developed sophisticated new models for how they work and audit their successes. In 2022, MTTR is one of the most widely-used software performance success metrics.

Now You can Invoke PagerDuty Rundeck Actions Within the PagerDuty Slack Integration

Last year, we released PagerDuty Rundeck Actions, a PagerDuty add-on product that connects responders to automated diagnostics and remediation for common problems directly in the PagerDuty incident response workflow. After working with our customers and listening to the community, we are excited to announce that PagerDuty Rundeck Actions now integrates with PagerDuty’s Slack integration.

The startup guide to sensible incident management

If you’re working at an early stage startup and looking to get some good incident management foundations in place without investing excessive time and effort, this guide is quite literally for you. There’s an enormous amount of content available for organisations looking to import ‘gold standard’ incident management best practices – things like the PagerDuty Response site, the Atlassian incident management best practices, and the Google SRE book.

Announcing Grafana Incident, smart incident management for your teams

A huge challenge when dealing with incidents is the coordination and communication needed to put things right. What’s happened so far? Who has tried what query? Did we remember to keep stakeholders informed? What is the severity of the incident? Does this affect customers? Figuring this out requires a lot of back and forth as new team members join the incident.

Grafana Incident: First look at the smart incident management tool

Announcing Grafana Incident, the smart incident management tool for your teams. Grafana Incident allows teams to start collaborating immediately by automatically setting up all the essential spaces and resources needed for incident response, from Zoom meetings and Slack channels to a tracker for important tasks and TODO items. A chatbot offers a command-line interface for managing incidents, and provides the ability to instantly embed Grafana queries, dashboards, and metadata, GitHub issues and pull requests, and more. Grafana Incident is available in preview for Grafana Cloud users.

Grafana OnCall is now generally available on Grafana Cloud, with a generous free tier

Today we’re announcing the general availability of Grafana OnCall on Grafana Cloud for all paid and free plans. A big part of delivering great software is ensuring the right people get the right information when the inevitable incidents occur. We want to help you do that with Grafana OnCall, an easy-to-use, developer-first on-call management tool that’s built on top of the Grafana stack you know and love.

Top tips to make Round Robin Scheduling successful for your team

You may have heard of Round Robin Scheduling before and thought to yourself, is this right for my team? Understanding how Round Robin Scheduling can be used and what teams it works best for is important when considering this method of on-call. Additionally, it comes with some pitfalls you’ll want to avoid, as well as best practices to adopt. In this blog post, we’ll share everything you need to know about Round Robin Scheduling within PagerDuty and how to get started.

What is Crisis Management?

Crisis Management is an organization’s process- and strategy-based approach for identifying and responding to a threat, an unanticipated event, or any negative disruption with the potential to harm people, property, or business processes. Being prepared for any event to become a crisis requires a crisis management plan.