Operations | Monitoring | ITSM | DevOps | Cloud

April 2022

7 Skills Leaders Must Master for Effective Response to Critical Events

Some critical events may be familiar to organizations, they may happen repeatedly or even on a set schedule. Others may present new challenges that responders haven’t seen or experienced before. In a worst-case scenario, events could even happen concurrently, forcing responders to split their attention while trying to anticipate and account for the combined effects.

Interlink Software: Enterprise AIOps Platform Mobile App

To protect the availability of the services your customers rely on, AIOps adoption is an imperative for large enterprises. Interlink Software’s AIOps platform applies machine learning to automate ITOps; reducing alert noise, performing event correlation, anomaly detection and root cause determination. As the world emerges from the Covid-19 pandemic, organizations are increasingly embracing the flexibility of home and hybrid working.

Post-Incident Review | Why It's Important & How It's Done

Curious about the post-incident review process? We give a complete explanation of post-incident reviews and why they are important and discuss best practices. What is a post-incident review? A post-incident review is an evaluation of the incident response process. The goal of the process is to have clear actions to improve the incident response process and to also help prevent further incidents.

AlertOps And BMC Partner To Reduce Incident Resolution Times

Chicago, IL – April 27, 2022 – AlertOps, a major incident response orchestration platform, today announced a technology integration partnership with BMC Helix, a service management platform. This new relationship empowers Helix users with intelligent alerting, advanced escalation policies, schedule management, workflow automations for complex enterprise teams to rapidly remediate major incidents.

Monthly Moo | April 2022

We are well into 2022 and are busy bringing new exciting features to market. Our customers continue to provide input into our product roadmap and many new features are based on this collaborative effort. A big thank you to our valued customers. Throughout the year we will continue to drive innovation and allow our customers, of all sizes, to implement the most advanced AIOps solution in the shortest time possible.

Logbook: Team Discussion and Full Incident History

We've launched a feature that will help you fix errors and performance issues as a team! 🎉 With Logbook you get the full incident history. Read and leave team comments, see which notifications were sent at what time, and see team activity for changes in incident states. It's now easier than ever to see what the current state of an incident is.

Get more from your Jira integration with custom field support

When FireHydrant originally launched our Jira Cloud and Jira Server integrations, we did not support custom fields. This prevented customers who rely on Jira epic ticket types or other custom required fields from getting full value from our Jira integrations. That has changed with the launch of Jira custom field support. We now support the most common type of Jira epic tickets and field-level mapping of Jira custom fields with FireHydrant incident data.

mooving to... Practical Post-Mortems | Thomas Duran, Senior Manager of Productivity, from Panther

Post-mortems are a common practice amongst many organizations, but not everyone knows how to make the most out of an incident. Thomas Duran, Senior Manager of Productivity, from Panther joins us to discuss how to leverage post-mortems to effectively learn from failure.

Jira Integrations with Blameless platform

In this video, our Solutions Engineer walks you through the steps of creating Jira tickets and follow up actions in Blameless. You'll learn how to leverage our Slack integration for quick ticket creation and also how to create tickets from within the Blameless platform. You'll also see how closing a ticket in Jira will automatically close a ticket in Blameless. Additionally, you'll discover how to manage open tickets and incidents in an organized Blameless dashboard.

Service dependencies help you instantly discover all services impacted by an incident

When an incident happens, most organizations have a way of identifying all affected services. The trouble is, it’s often a human-centered process that depends on the knowledge of key individuals or manually updated documentation. There might be a version in your alerting tool, a version in your corporate Wiki, and a different version still in your team’s head.

Collaboratively author retrospectives with our new Google Docs integration

When it comes to learning from incidents, your tools should adapt to the way your organization works. Many of you conduct your retrospectives in rich-text document editing tools, like Google Docs. That’s why we’ve introduced the option to export your retrospectives to Google Docs. Retrospective export to Google Docs can be automated as part of your incident management process with a Runbook step.

Continuous Availability vs. Continuous Change

All companies are going through some form of cloud adoption - whether cloud migration for the first time, hybrid cloud adoption, or extending cloud-native with newer microservice architecture. But, according to a recent survey by Aptum*, only 39% of companies are completely satisfied with their current rate of digital transformation. Cloud adoption projects create a continuous state of change for engineering teams juggling to keep things up and running while limiting the impact on customers.

Sponsored Post

Infrastructure monitoring using kube-prometheus operator

Prometheus has emerged as the de-facto open source standard for monitoring Kubernetes implementations. In this tutorial, Kristijan Mitevski shows how infrastructure monitoring can be done using kube-prometheus operator. The blog also covers how the Prometheus Alertmanager cluster can be used to route alerts to Slack using webhooks. In this tutorial by Squadcast, you will learn how to install and configure infrastructure monitoring for your Kubernetes cluster using the kube-prometheus operator, displaying metrics with Grafana, and configuring alerting with Alertmanager.

Build custom API integrations with incident.io

We’re building incident.io as the single place you turn to when things go wrong. When an issue is disrupting your business-as-usual, the last thing you want is to start opening ten different tools to diagnose and fix it! As your central incident hub, we need to give you two powers: Workflows cover the former. Workflows are like a mini incident.io Zapier.

How to Pick the Best Incident Response Software

With the rising complexity of our digital ecosystems, incidents are occurring at an unprecedented rate. To combat the additional strain, incident responders are looking to software to help them establish a scalable, repeatable incident response process that reduces toil and noise and gets the right people on the scene at the right time. The best incident response software addresses the entire lifecycle of an incident.

AIOps' certainty in an uncertain future

BigPanda’s recent coronation as a Unicorn has prompted its leaders to look to the future of IT Operations and how it relates to artificial intelligence (AI) and machine lifiearning (ML). What is BigPanda’s role in improving IT Ops? How can AIOps contribute to greater achievement in global enterprises? These are questions a VP of Product Marketing like BigPanda’s Mohan Kompella, who has spent 15+ years in IT Operations, has been asking.

Sponsored Post

Your Goals Could Be Holding Your DevOps Teams Back

In the era of Agile, organizations are increasingly moving their IT service management teams toward a DevOps world. There are significant challenges to transforming ITSM to DevOps, but one of the most significant is goal setting. In today's face-paced business environment, it's important to establish the parameters for measuring success and determine which objectives teams need to meet to accomplish business goals.

SRE: From Theory to Practice | What's difficult about on-call?

We launched the first episode of a webinar series to tackle one of the major challenges facing organizations: on-call. SRE: From Theory to Practice - What’s difficult about on-call sees Blameless engineers Kurt Andersen and Matt Davis joined by Yvonne Lam, staff software engineer at Kong, and Charles Cary, CEO of Shoreline, for a fireside chat about everything on-call. As software becomes more ubiquitous and necessary in our lives, our standards for reliability grow alongside it.

Accelerate AIOps Scalability With New Self-Service Incidents API

BigPanda offers a diverse set of APIs to enterprises looking to move faster and scale incident response workflows seamlessly. APIs are core to automating repeated incident response workflows that enable IT Ops to keep up with the pace of change and innovation agile teams need to thrive. In Q4 of 2021, BigPanda announced the general availability of new self-service APIs including an updated Incidents API.

How Well Does Your Infrastructure Support Major Incident Management?

Effective major incident management depends on many things, including planning, precise execution, effective communication, and applying learnings from previous incidents to update those plans. Traditional major incident management wisdom addresses the importance of the remediation process, but it doesn’t speak on the issue of configuring your IT infrastructure.

SRE Adoption | A 2-Year Retrospective (From A Business Point-Of-View)

This month I hit my 2-year anniversary with Blameless and as our industry progresses and matures, I thought it would be a good opportunity to look back and review how far we have come and also ruminate on where we’re headed. Our shared vision at Blameless is to help engineering teams adopt reliability practices with ease and advance to a resilient culture.
Featured Post

The State of Incidents and Site Reliability: Q&A with Blameless SRE Architect Kurt Andersen

In the latest of an occasional series, today we hear from Kurt Andersen, SRE Architect at Blameless, discussing the evolution of incident management, current trends in site reliability affecting engineering teams, as well as an update on how Blameless is addressing the needs of SRE and DevOps.

Podcast: Break Things on Purpose | JJ Tang: People, Process, Culture, Tools

For this episode we’re continuing to “Build Things on Purpose” with JJ Tang, co-founder of Rootly, who joins us to talk about incident response, the tool he’s built, and his many lessons learned from incidents. Rootly is aiming to automate some of the more tedious work around incidents, and keeping that consistency. JJ chats about why he and his co-founder built Rootly, and the problems they’re trying to fix and eliminate when it comes to reliability.

Service level objectives: How SLOs have changed the business of observability

Forget the latest tech gadgets and the newest products. One of the most talked about trends in observability right now? “SLOs have really become a buzzword, and everyone wants them,” said Grafana Labs principal software engineer Björn “Beorn” Rabenstein on a recent episode of “Grafana’s Big Tent,” our new podcast about people, community, tech, and tools around observability.

Improve IT Operations with Response Analytics

Your IT team just finished resolving a complex incident, customer service finished their last call about the issue, and your business is back to being fully operational. Now that the storm has passed, you should be planning a postmortem to determine the cause of the incident and lessons learned. Postmortems require specific data that can highlight where your team is succeeding and where they can improve.

What's behind BigPanda's customers' success?

As the Regional VP of Customer Success for the West and Central Region at BigPanda, Chris LaPierre gets a unique opportunity to see first-hand how BigPanda customers use their AIOps platform. Charged with ensuring every BigPanda customer derives high value and return on investment from the solution, BigPanda’s customer success teams make certain customers leverage the AIOps platform to increase their bottom line.

Outage Alert: Top 5 Outages of Q1 2022

By now it’s no secret that system outages and website downtime are more widespread and frequent than ever. In fact, the frequency of outages jumped 9% in just the first week of 2022. This can be attributed to a rapid increase in traffic and reliance on tech infrastructures – resulting in connectivity, server, and other technical issues that are alternately unforeseen and unavoidable.

Managing Burnout | Tips To Minimize The Impact

Burnout is real. Today, the source of burnout can be anything from pandemic fatigue, to the onslaught of political divisiveness, or simply the pace of life worldwide. Whatever the culprit, we’re living in a stressful time. People working in cloud native environments definitely feel burnt out. Silicon Valley investor Marc Andreessen famously said, “Software is eating the world,” and that seems to be quite true. High demand is fueling churn. System and cloud operators feel pressure.

Accelerate incident investigations with Log Anomaly Detection

Modern DevOps teams that run dynamic, ephemeral environments (e.g., serverless) often struggle to keep up with the ever-increasing volume of logs, making it even more difficult to ensure that engineers can effectively troubleshoot incidents. During an incident, the trial-and-error process of finding and confirming which logs are relevant to your investigation can be time consuming and laborious. This results in employee frustration, degraded performance for customers, and lost revenue.

Product update: ensure consistent data across all your retros with two new features

FireHydrant captures your incident, from declaration through remediation, and gives you a framework to run your retrospectives. But retrospectives are only as effective as their inputs. Now we're delivering a better way to learn from and analyze retrospectives by guaranteeing consistent, structured, and sufficient data from your team.

OnCallogy Sessions

Being on call is challenging. It’s signing up to be operating complex services in a totally interruptible manner, at all hours of the day or night, with limited context. It’s therefore critical to have proper on-call on-boarding procedures, offer continuous training sessions, and continuously improve documentation. We also need to make sure people feel safe by providing ways to reduce their stress, and make room for questions to surface all sorts of uncertainties around our operations.

The Pros and Cons of Embedded SREs

To embed or not to embed: That is the question. At least, that’s one of the questions that companies have to answer as they decide how to implement Site Reliability Engineering. They can either embed SREs into existing teams, or they can build a new, separate SRE team. Both approaches have their pros and cons. The right strategy for your company or team depends, of course, on your needs and priorities.

Conflict Management and the Major Incident Management Process

Major incidents are, by their very nature, stressful and intense. The ITIL 4 definition of a major incident is: High-stress situations can cause conflict that left unchecked could delay the fix effort. Since we already have a definitive guide on incident management, this blog post will focus specifically on the major incident management process.

xMatters remains a G2 Grid Report Leader

Worldwide businesses and their technical resources use G2, the leading business solution review platform, to analyze software, gather user feedback, and make informed decisions about technology. Although we value all the recognition we’ve earned on G2 over the years, there’s one that always stands out and makes us feel extra proud of what we’ve accomplished so far.

Debug issues and automate remediation with Shoreline and Datadog

Shoreline is an incident response automation service that enables DevOps engineers and site reliability engineers (SREs) to quickly debug and remediate issues at scale and develop automated routines for incident management. Using Shoreline’s proprietary Op language, customers can run debug commands across all their hosts simultaneously and then deploy custom scripts via Actions to trigger automated remediations.

How to use PagerDuty with Blameless

Blameless integrates with PagerDuty so you can notify teams and key stakeholders during an incident. We also help you search escalation policies and on-call rotation schedules. In this video, our Solutions Engineer walks you through navigating the initial setup and configuration in the Blameless UI. He'll then demonstrate how the integration works in real-time. If you use Slack or Microsoft Teams for internal communications, you'll also learn how to access and manage the PagerDuty integration from within those tools.

Making Go errors play nice with Sentry

Here at incident.io, we provide a Slack-based incident response tool. The product is powered by a monolithic Go backend service, serving an API that powers Slack interactions, serves an API for our web dashboard, and runs background jobs that help run our customers incidents. Incidents are high-stakes, and we want to know when something has gone wrong. One of the tools we use is Sentry, which is where our Go backend send its errors.

Four Use Cases for Optimizing Your Cloud Operations With PagerDuty Runbook Automation

The cloud is easy and powerful—until it’s not. Once companies have customers, commitments, and compliance concerns, they often have to create cloud operations teams to manage the cloud on behalf of their fellow employees. Often, organizations that migrate to the cloud find themselves hampered by inefficient cloud operations if they haven’t standardized their IT procedures for operability.

Keep Stakeholders Informed During Major Incidents

During major incidents, it’s crucial that all stakeholders are provided with the status updates they need. Those communications however need to be tailored to what the stakeholder actually needs, and provided in a streamlined format that works best for them. Just like alert fatigue, communication fatigue can be detrimental during an outage or other service reliability issue.

What BigPanda's recent funding means for our customers

The effects of BigPanda’s most recent round of funding—amounting to $190 million—will be reverberating throughout the company for years to come. And it’s not just BigPanda employees who have experienced a surge of enthusiasm in the wake of our Unicorn status. Our customers are thrilled at the prospect of more innovation from our team and new products that help them automate and evolve.

Incident management best practices: before the incident

When incidents inevitably occur in your software stack, managing them well could be the difference between losing customers and building trust with them. In this article, we’ll give you and your team some best practices on how to prepare for managing incidents. It’s crucial to define service ownership, a declaration process, and practice all of it. With a little planning now, you'll be able to cut your incident response time drastically.

Freshdesk + Squadcast: Enabling Streamlined Incident Response for Enterprises

Freshdesk is a cloud-based customer service platform used by enterprises that provides a centralized help desk(with the help of support tickets) across multiple channels, including email, phone, chat, and social media. Squadcast is an incident management platform that integrates with major monitoring, ChatOps and project management tools to provide a centralized place for reliability.
Sponsored Post

Show character with Blameless Postmortems (part one)

This is Part 1 of a two-part series on Blameless Postmortems. Today, we'll discuss why blameless postmortems are so important and their implications for your team; the second part will go into detail on how to set them up as a process and make them successful. Somebody wise may have once told you that how we handle adversity shows our character. Being able to acknowledge and admit mistakes is the first step towards learning - it's a key part of success both in personal relationships and in large companies.