Operations | Monitoring | ITSM | DevOps | Cloud

March 2022

Heartbeat check and more - is your monitoring still alive?

SIGNL4 is a cloud-based mobile alerting and incident response service. Third-party systems like monitoring tools, control systems or IoT sensors detect abnormalities and transmit events to SIGNL4 over the Internet. What if your systems cannot transmit critical events anymore? That might happen when the Internet is down or when the tool itself has a problem. In this case, SIGNL4 would miss critical events and could not turn them into alert notifications to your IT admins, technicians and experts.

Lunch and Learn: Optimizing Outreach

This is the latest in a series of sessions exploring how Everbridge can be used for one key aspect or stage of the response lifecycle. The more people that receive critical life safety messages at the proper time, the safer the community will be. Though that calculus sounds simple, multiple complexities, such as technology or population behavior, prevent most jurisdictions from reaching a significant portion of their population with notifications and alerts.

What's New: Updates to On-Call Management, Incident Response, Event Intelligence, Process Automation, and More!

We’re excited to announce a new set of updates and enhancements to PagerDuty’s Digital Operations Platform. Recent updates from the product team include On-Call Management and Incident Response, Process Automation, to PagerDuty Community & Advocacy Events. New capabilities enable users and customers to resolve incidents faster, do the following, and more.

PagerDuty: Event Intelligence for AIOps - Demo!

Noisy alerts and manual remediation can be things of the past. In this vidoe, learn about how your team can leverage Event Intelligence, a powerful AIOps solution from PagerDuty that helps teams harness machine learning to reduce alert noise, create context for faster resolution, and remove toil by automating repetitive tasks.

Six Stages of the Business Continuity Management Lifecycle

Business continuity is a crucial part of any scalable operations plan, but many businesses fail to realize how important it is until their first critical emergency. Only then does business continuity management come to the forefront of planning exercises, and stakeholders are forced to reflect on what went wrong, why it went wrong, and determine if they can avoid it happening again, or be better prepared if it does. The true business continuity management lifecycle begins long before an incident.

Building your First Flow - xMatters Support

Creating flows with xMatters low-code workflow builder, Flow Designer is simple. To execute a flow you need two things: A trigger and at least one step. The trigger will initiate the flow, and the step will perform an action. However, flows can get as large and complex as your incident process requires and can be used to automate and add intelligence to your resolution processes.

Five Considerations for Choosing Self-Managed Automation vs. SaaS Automation

Sometimes heritage is better than new. Some people favor Coca-Cola Classic over New Coke, and heirloom tomatoes over regular tomatoes. Some Luddites might say the same thing about cloud computing. “I won’t put my (app/data) in the cloud! It will be more (secure | reliable | cheaper) if I run it myself in my own data center.”

SRE vs. Platform Engineering: The Key Differences, Explained

Site Reliability Engineering (SRE) teams and Platform Engineering teams share similar goals -- like maximizing automation and reducing toil -- and similar methodologies. But they have different priorities, and use somewhat different tools to achieve them. What are SREs, what are platform engineers and how is each role similar and different? This article explains.

Flow Designer Overview - xMatters Support

xMatters low-code workflow builder, Flow Designer, lets you build and execute multi-step processes simply by dragging, dropping, and connecting steps. Steps perform an action in your resolution process. That action could be enriching data from an external tool, creating a ticket in a service desk, or sending an actionable notification. The possibilities are endless.

How to build a strong incident response process

When building an incident response process, it’s easy to get overwhelmed by all the moving parts. Less is more: focus first on building solid foundations that you can develop over time. Here are three things we think form a key part of a strong process. I’d recommend taking these one at a time, introducing incident response throughout your organisation. Just being honest: we’re a startup selling incident management software.

Rundeck + Squadcast Integration: Simplifying Alert Routing

Rundeck is an automation tool that helps to make existing automation, scripts, and commands more secure, auditable, and easier to run. It is a software Job scheduler and Run Book Automation system that automates routine processes across development and production environments. It brings together tasks scheduling, multi-node command execution, workflow orchestration. It also logs everything that happens in the system. Squadcast is an end-to-end incident response tool.

Zenduty iOS and Android v3.0

We constantly update the platform to provide the best-in-class experience to our users. These updates are not something that we feel is right for the client; these updates are based on the user data, behavior, and requests that our users provide. We are always excited to bring new updates and share them with people but this one is special! We bring to you Zenduty iOS and Android v3.0!

Quick reads from the Value and Adoption team: The Importance Self-Healing (and How It Works)

Most folks familiar with BigPanda know that automation is a foundational block of our technology. Our platform automates the entire events pipeline with functions including standardizing and deduplicating alerts, cutting down on the volume of incidents, and automated enrichment that provides better context and alert payloads. But these are all part of an inbound flow of events through integrations.

OAuth Authentication - xMatters Support

OAuth is an open standard system that uses tokens to grant access to systems or information without using a password. OAuth authentication authorizes requests to the xMatters Rest API by passing a token in the header of your requests. This means you don’t have to store user names or passwords in your applications, keeping your user’s information secure.

SolarWinds Orion + Squadcast: Alert Routing Made Easy

SolarWinds Orion is a scalable infrastructure monitoring and management platform. It is designed to simplify IT administration for on-premises, hybrid, and software as a service (SaaS) environments, in a single pane of glass. SolarWinds Orion ensures you do not have to struggle with numerous incompatible point monitoring products, as it consolidates the full suite of monitoring capabilities into one platform with cross-stack integrated functionality. Squadcast is an end-to-end incident response tool.

Continuous Availability: How It's Changed, and Why It's Critical

Remember when Slack went down in early January? The three-hour outage, set off by AWS capacity issues, cost the company an untold amount of money. And the effects rippled across the enterprise. The outage devalued the company’s stock and seemed to send all 142,000 of its customers to Twitter to gripe. This high-profile outage is just the most recent of many outages highlighting the critical nature of continuous availability. And there’s only one answer to the problem.

Sponsored Post

Orchestration vs Automation: Which Does Your Business Need?

Digital transformation is accelerating rapidly to include virtually all enterprise functions. Organizations of all size, across all industries, are leveraging digital technology to enhance customer service and improve work efficiency. Integrating automation into core business functions has become a must to stay aligned with the ongoing digital revolution. The growing migration to the cloud has resulted in the distribution of company data and applications across multiple locations. This means that many complex business processes must leverage IT resources from the cloud and on-premises. This is where automation and orchestration can greatly improve the performance and efficiency of these complex tasks.

Video: How to configure and customize Grafana OnCall

Managing your on-call rotations just got a little less stressful. With Grafana 8.0, we introduced unified alerting, which centralizes alerting information into a single, searchable view. With the introduction of Grafana OnCall, an easy-to-use on-call management tool available in Grafana Cloud, you can now extend the alerting workflow in Grafana to ensure that the right notifications reach the right people at the right time using the right method.

The Six Trends Overwhelming IT Ops-and What to Do About Them

IT Operations is experiencing lightning-fast change right now. From the emergence of cloud computing to the explosion of data—not to mention ever-present cyber threats—every day is a new day for IT Ops. At BigPanda, we’re laser-focused on making life easier for IT Ops teams, which means we’re staying on top of all this change to help IT Ops keep up.

PagerDuty Runbook Automation Joins the PagerDuty Process Automation Portfolio

Spring is blooming here at PagerDuty, and so is our automation product line. We’re thrilled to share some exciting product announcements. First, we’ve officially rebranded our automation product line, Rundeck®, as PagerDuty® Process Automation. Fundamentally, everyone who buys Rundeck becomes a PagerDuty customer, so we decided to make it less confusing.

What's a fair compensation for being on-call?

For the vast majority of organisations, it’s necessary to have some form of round the clock cover to support the business. Whilst it’s most commonly a concern for engineering, it’s increasingly common to have folks from various disciplines available out-of-hours. Irrespective of role, compensating people fairly is an important factor of running a healthy and effective on-call system.

Top 5 Takeaways From HIMSS 2022

The HIMSS 2022 conference, hosted in Orlando, was an enjoyable, insightful experience with sessions that covered many healthcare topics, such as business operations, data and information, care delivery, policy and technology. Education sessions were led by renowned thought leaders who examined challenges and trends in modern healthcare. Here are five takeaways from our HIMSS 2022 experience.

Exploring Enterprise Alert User Types

Enterprise Alert offers the ability to get in touch with your users multiple ways to help you fit your business needs. We at Derdack pride ourselves in being customer first when it comes to not only product enhancements and features but also support and building that customer/vendor relationship that lasts for years. We want to ensure that you and your users have the abilities needed to handle any situation that may arise.

Honeycomb + Squadcast Integration: Routing Incident Alerts Made Easy

Honeycomb is an application monitoring tool that helps DevOps and SRE teams to operate more efficiently by offering rich observability solutions and intuitive team collaboration. It helps understand complex relationships within your distributed systems and troubleshoot issues accordingly. Squadcast is an end-to-end incident response tool. Built with an SRE mindset, it streamlines all the incident response activities.

Salesforce Cloud + Squadcast Integration: Routing Detailed Incident Alerts

Salesforce Cloud is one of the leading cloud-based customer relationship management (CRM) solutions. It provides a shared view of your customers and their relationship with the business. With Salesforce Cloud, users can automate service processes and streamline workflows. Squadcast is an end-to-end incident response tool. Built with an SRE mindset, it streamlines all the incident response activities. Squadcast aligns your teams towards a common organizational goal of better reliability.

A B2B sales stack from Seed to Series A

I joined incident.io recently to lead Sales, after having set up my own company. In both startups, one of the first questions I’ve landed on was: “What sales tools should we use as we scale?”. In this post, I’ll walk through our sales stack, and by extension, what I think most B2B SaaS startups can get away using when they have less than ~100 employees.

Closing the Gap: Deploying Automation the Right Way

Automation in the enterprise is nothing new. Engineers have been working with automation tools and frameworks for decades. From configuration management tools, to continuous integration and delivery pipelines to cloud formation, you name it—automation is part of the fabric of nearly any technology use case in the business landscape. If the previous statement is true, then why does automation still seem to pair with so much manual work?

All or none; Things to consider before major code refactoring

You just hired a bunch of superstars who are in tune with the latest industry trends. They roll up their sleeves and get to work. In a few days, they point out some systematic issues in your codebase that is making it difficult to expand current capabilities, and chalk out a plan for significant refactoring. But there’s a caveat - the refactor work will be a blocking issue for your near future roadmap.

The Anatomy of a Rollback Deployment Workflow

Your new release tested fine on staging, but it’s not playing nicely with applications and services in the wild. Your monitoring application notices something going wrong and raises the alarm. But often raising the alarm isn’t enough – to solve complex issues, you might need to roll back to the last good deployment while you figure out the root cause and get multiple people working together on the solution.

Configuring an External Conference Bridge - xMatters Support

The Conference Bridge page is used to configure externally hosted conference bridges for use in your workflows. An external conference bridge is hosted through a third-party provider rather than xMatters. You can use an external bridge to connect to Webex, InterCall, GoToMeeting, or other conference hosting solutions.

FireHydrant is now on Microsoft Teams

Engineering teams can now manage incidents in Microsoft Teams. You’ll have the consistent process and automation of FireHydrant right in the messaging tool you use every day. Effectively run through the entire incident response lifecycle: declare and manage incidents, collaborate with stakeholders, and resolve incidents faster when you integrate FireHydrant with Microsoft Teams.

Severity Levels (What They Are & Why They Matter)

Wondering about severity levels? We explain what incident severity levels are, how to classify them, and how they will affect your incident management process. What are severity levels? Incident severity levels are the measure of the impact an incident will have on a system. In general, a lower number severity level, such as SEV-1, denotes a higher impact on the system.

Lightstep Incident Response: Helping teams reduce downtime

Downtime—especially in customer-facing services—can cost businesses thousands of dollars an hour and incalculable customer trust. No company can afford to pay this price. To reduce downtime, software engineering teams must act quickly and decisively. But that’s easier said than done. With Lightstep® Incident Response, generally available from ServiceNow today, we're unlocking speed, agility, and productivity for your engineers and your software-powered business.

FireHydrant is now free for small teams

We envision a world where all software is reliable, and today we’re making that vision more of a reality for small teams. Available today, our new Free Tier helps smaller teams wrangle their reliability challenges with our enterprise-grade Incident Management, Service Catalog, and communications products. Our new package also has every feature that makes FireHydrant great with generous limitations.

Overheard at Bamboo Lounge: Making sense of IT Ops KPIs

Every IT Ops team uses key performance indicators (KPIs) to track metrics that keep them accountable, improving, and contributing to long-term success. But it’s easy for teams to lose sight of what KPIs to use, how many they should use, and how to derive meaning from them. To shed light on what constitutes a meaningful KPI, Sterling Nostedt, BigPanda’s Value and Adoption advisor, hosted a community conversation which spanned across multiple industries.

Whiskey and wisdom: AIOps as a strategy

Whiskey and Wisdom is a monthly executive-only forum where ITOps leaders can network independently and discuss high-level AIOps and ITOps strategies with their industry peers. In our most recent session, the discussion was geared specifically towards AIOps—its hype and its reality. Here are some quick value takeaways from the conversation.

Rolling out Roles

We’ve been pretty lucky at incident.io to be able to avoid dealing with more complex authentication issues for quite a while, because we piggy-back on Slack to know who you are and which organisation you work in. Whole companies have been built around doing authentication and user profiles really well, so it was pretty neat to be able to avoid doing most of that work for so long!

When to hire an Incident Commander

What comes to mind when you hear the term 'incident commander'? You are not alone if you think about fancy, tri-cornered hats, well-polished shoes, and a uniform weighed down by medals. The roles of incident commander, incident manager, or technical escalation manager have been typical in large organizations but are gaining popularity in smaller companies. For the purposes of this article, we will use the term 'incident commander,' but any of the above titles could work.

What Does AIOps Mean for SREs? It's Complicated.

If you’re an SRE, you might view AIOps with great excitement. By automating complex workflows and troubleshooting processes, AIOps could make your life as an SRE much easier. Alternatively, SREs may choose to view AIOps with disdain. They might think of AIOps as just a fancy buzzword that doesn’t live up to its promises, and that can become a distraction from the SRE tools that really matter. Which perspective is right?

Handoff Communication in Healthcare

Handoff communication occurs when a patient is transitioned from one care setting to another. Communication is central to handoffs, and clinical staff are expected to share comprehensive details about the patient’s health to the next care provider in charge. During handoff, sensitive information is passed in real time to another care provider during changes in shift or care setting.

Metrics for Problem and Incident Report Managers

Many IT professionals are familiar with the popular metrics and measures of IT operational success. Such metrics as Customer Satisfaction, Average Handle Time and First Contact Resolution are typically memorized by service desk managers and stored for quick reference during planning and other types of meetings. But how do we measure the effectiveness of the processes that support those teams?

What You May Not Know About Major Incident Management

You likely deal with major incidents regularly, but do you know who first coined the term? You also probably use the best tools on the market to help you fix those incidents, but do you know what some of the first tools were? When incident management is part of your day-to-day, it’s easy to think you know it all. But we have a hunch that there are some interesting facts that haven’t crossed your mind yet!

How Digital Operations Empower Value Stream Management

Reliability, scalability, and innovation are three terms at the forefront of any discussion about how businesses can achieve long-term success. When you put those three together, you create a business that’s capable of producing the best possible product with the least amount of waste, known simply as a lean enterprise. Being a lean enterprise is the ideal state for most organizations but becoming one can be an ambitious all-hands-on-deck undertaking. The best way to do this?

What Is Microsoft Azure Sentinel and Why Is It Important?

Microsoft Azure Sentinel is an intelligent, next-generation security information and event management (SIEM) solution designed to detect threat anomalies. Azure Sentinel is also categorized as a security orchestration automated response (SOAR) service that expedites the incident detection and event response process for cybersecurity teams. Azure Sentinel provides an extra layer of security to protect critical resources across an organization.

Three communications best practices for incident handlers

The importance of well-managed communications when handling IT and security incidents cannot be overstated. If updates are not communicated in a timely and accurate manner, misunderstandings, misalignment, and costly errors will occur. Not to mention, resolution will be prolonged. And if highly sensitive information is communicated to those who should not be privy to such, then the risk of legal ramifications is high, as would be the damage.

ServiceNow + Squadcast Integration: Automate IT Ticketing and Project Tracking

ServiceNow is a workflow automation platform used by organizations for their IT ticketing and project management needs. In contrast, Squadcast is an end-to-end incident management and SRE platform that is used by organizations for their reliability requirements.

What SREs Can Learn from Capt. Sully: When to Follow Playbooks

When are you smarter than your playbooks, and when are your playbooks smarter than you? That’s a question that engineers rarely step back to consider. The rational, disciplined parts of our minds tell us that the playbooks we are supposed to follow were carefully designed and tested, and that we should stick to them at all costs.

Incident Response Lifecycle | A Complete Explanation

Wondering about the incident response lifecycle? We explain what it is, and how each phase helps lead to effective incident resolution. What is the incident response lifecycle? The incident response lifecycle is an organization’s framework for responding to an incident that disrupts service. The incident response lifecycle contains the following phases.

Monthly Moo March 2022

What a start to 2022 has been for us all. We are incredibly proud of the continuous innovation, velocity and delivery of new features and functionality. We’ve heard success story after success story from our brilliant customers, each unique in their own way and continue to collaborate with them on our roadmap. So, this March update is for you and a massive thank you. We couldn’t do it without you, and it’s been our honor to be part of your success.

xMatters Overview - xMatters Demo

Join Stephen Walters, Solutions Architect and DevOps Institute Ambassador, and Daniel Topham, Solutions Architect, as they guide you through a high-level demo of the xMatters solution. See how xMatters sends alerts to the right users at the right time and enriches notifications with relevant data. And, learn how easy it can be to use Flow Designer to integrate different tools and software to create innovative workflows with drag and drop capability.

Workflow Form Layout - xMatters Support

In xMatters, the form layout is where you customize the content and options that are available to the message sender. You can use the form layout to do things like predefine recipients for your messages, add a conference bridge, attach documents, specify a customized sender display name, or add a map that the sender can use to target users at specific sites.

Amplify Artifactory and Distribution Changes Through PagerDuty

When automated software delivery runs smoothly, it can whisper, and quietly attend to itself. But when your delivery and distribution pipeline runs into a problem, it must shout. Boosting the volume of Artifactory and Distribution change events and issues through PagerDuty can help ensure they’re heard by everyone whose job it is to monitor your software delivery pipeline.

Kubernetes Health Check Using Probes

Kubernetes is an open source container orchestration platform that significantly simplifies an application's creation and management. Distributed systems like Kubernetes can be hard to manage, as they involve many moving parts and all of them must work for the system to function. Even if a small part breaks, it needs to be detected, routed and fixed. These actions also need to be automated. Kubernetes allows us to do that with the help of readiness and liveness probes.

Mastering Digital Operations Across the Enterprise

I’m excited to announce that today, PagerDuty is taking our automation capabilities to new scale and scope as we enter into a definitive agreement to acquire Catalytic. With their technology and talented team we accelerate the delivery of enterprise-wide process automation that manages no-code workflows across the business, broadly applicable to any workflow, for any employee.

Postmortems Now Called Retrospectives in Blameless

Something big happened at Blameless this month — our “Postmortem” feature was updated to its new name, “Retrospective”. To the naysayer, I suppose you’re thinking, This seems trivial. Different teams call it different names anyway, so why bother making the change? First let me say, thank you for reading our blog and I hope you finish this one through to the end. Now, allow me to explain our reasoning and why we’re excited about this update.

Customizing Error Pages (Nginx Ingress Controller)

The most common way to do it, which is part of the offical solution is to create a Docker image server capable of responding to any request with 404 content, except /healthz and /metrics. This could be an Nginx instance. /healthz should return 200 /metrics is optional, but it should return data that is readable by Prometheus in case you are using it for k8s metrics. Note: Nginx can provide some basic data that Prometheus can read. /returns a 404 with your custom HTML content.

Alert Fatigue in SRE: What It Is & How To Avoid It

Wondering about alert fatigue? We describe what it is, how it affects software development teams, and how to avoid it. What is alert fatigue? Alert fatigue is the phenomenon of employees becoming desensitized to alert messages because of the overwhelming volume they receive, and the number of false positives they receive. The risk with alert fatigue is that important information will be overlooked or ignored.

The BigPanda ScaleUp Journey: Human/AI Collaboration, Predictive Accuracy, and Scale Power in AIOps

At the beginning of the COVID-19 pandemic, we anticipated a slow-down in IT-related spending. In reality, the opposite occurred. Companies massively expanded their digital offerings using the same IT staff they’d had pre-pandemic, even as the teams lost access to many of their existing tools while working from home. This acceleration put immense pressure on IT teams everywhere, resulting in messy incident management, outages, and a huge shortage of talent.