Operations | Monitoring | ITSM | DevOps | Cloud

October 2022

On-call compensation in IT

On-call is a special working hour arrangement under employment law. It comes into effect when the employee is obliged to be contactable at least by phone, so they can start work in an emergency. On-call duty is generally counted as time specifically meant for work purposes. In practice, this means that employees are normally not allowed to work while on-call. However, there may be exceptions. For example, on-call employees may also work from home if they can be reached through their work device.

Ensuring visibility with monitoring tools in 2023

Not long ago, monitoring tools were just nice additions to have and did not have a lot of purposes. However, as technologies scaled up and became more complex, keeping track of all the systems and their health became a huge challenge. As more and more brands started offering new digital services and moved the existing platform, the competition skyrocketed and being on top of system health and proactively resolving potential incidents became crucial.

Ensuring visibility with monitoring tools in 2022

Not long ago, monitoring tools were just nice additions to have and did not have a lot of purposes. However, as technologies scaled up and became more complex, keeping track of all the systems and their health became a huge challenge. As more and more brands started offering new digital services and moved the existing platform, the competition skyrocketed and being on top of system health and proactively resolving potential incidents became crucial.

Ghouls and Goblins Beware: You Do Not Stand a Chance Against AIOps

It is getting spooky out there, folks! Every year on October 31, we don our spookiest (or silliest) garb, an evolution of old practices where people would dress up to ward off ghouls, goblins and all manner of things that go bump in the night. After all, people believed these pesky spirits stirred up trouble. While pieces of this spooky tradition persist, just a few other things have changed in the past 2,000 years. For starters, we are a digital society.

Why 'owning Services' is critical for effective Incident Response

There is a famous quote that goes like this…‘For every minute spent organizing, an hour is earned.’ At least in the world of incident response, nothing is more apt than this. Digital infrastructure these days is made up of multiple services, an outage could result from either one impacted service or multiple impacted services. So it's essential to have a catalog of all the services along with the point of contact (service owner) responsible for maintaining it.

incident.fm, post-incident processes, and Crocs

As usual, it’s been all systems go at incident.io this month. New joiners, new features and new swag (yes, you heard right!). But most excitingly, we launched our new podcast this week. We had a blast recording it - we hope you enjoy listening to it just as much. Here’s a round-up of some of this month's highlights…

What are the Best Practices to Improve the Incident Management Process?

DevOps and IT Operation teams employ the incident management process to respond to an unanticipated event or service outage and return the service to operational status. In the ITIL framework, it is a mechanism that links end-users and the IT department for more effective incident response. A robust incident management system in any company will allow the employee to raise a ticket detailing the issue he/she is facing.

Routing alerts from AWS Elastic Beanstalk via CloudWatch

Amazon Web Services (AWS) offers 100+ services, each focusing on a specific area of functionality. However, it can be challenging to pick the right services for the task and also to provision them. AWS Elastic Beanstalk, lets you easily deploy and manage applications without the need to learn about the underlying infrastructure that runs these applications.

Everbridge Live: Using AI Ops to Filter Out the Noise

IT and Digital teams around the globe are constantly being asked to reduce downtime, do more with less and increase their productivity. Sometimes that goes beyond the capabilities of an already considerable workload. Using a combination of xMatters and Moogsoft you can avoid outages, meet SLAs, reduce error budgets, and help accelerate the digital transformation of your business using AI and Machine learning by identifying and resolving incidents before they become service incidents.

What's New: Updates to Incident Response, PagerDuty Process Automation Software & PagerDuty Runbook Automation, Integrations, and More!

We’re excited to announce a new set of updates and enhancements to the PagerDuty Operations Cloud. Recent development and app updates from the product team include Incident Response, PagerDuty® Process Automation, as well as Community & Advocacy Events updates. We continue to help customers further automate to optimize cloud operations and reduce the amount of issues escalated to other teams.

Fast track video series: Extracting alert data from emails

With BigPanda’s self-service Email Parser, extracting alert data from emails has never been more simple. In our latest video in the Fast track series, we explore the benefits of this tool. This parser is ideal for monitoring tools and systems that do not support REST API and or rely solely on email to generate and send alerts. So no matter what tools your organization utilizes, this feature can help you turn all of those alert emails into actionable incidents within BigPanda’s platform.

Do You Understand Your Essential Business Processes?

Before you can choose the proper tools for your organization, you have to understand its essential business processes. Once you know an essential business process, you can review software applications that will help make your organization more efficient and accurate. Unfortunately, many organizations do not understand their essential business processes. This makes it nearly impossible for them to streamline their organizations, which puts them at a disadvantage in the marketplace.

A deep-dive into event correlation

Event correlation is a powerful capability that can help reduce IT noise, detect incidents in real-time, and improve the performance of critical applications and services. Read on for a deep dive into event correlation as we explore everything from its origins to its current state-of-the-art techniques. We’ll also discuss how event correlation fits into the bigger picture of integrated service management.

The Roblox Outage

Just before Halloween 2021, Roblox engineers experienced a horror story: a service outage that also took down critical monitoring systems. It seemed like the issue was a hardware problem, but it wasn’t. Users were frustrated, and the clock was ticking. After three full days of downtime, service was finally restored on Halloween day. While the incident itself was an IT nightmare, Roblox’s detailed technical post-mortem several months later was an excellent way to bounce back.

How to build a successful on-call team - incident.fm

In this podcast, our panellists discuss what it means to build a successful on-call team. Drawing on their experiences at fast growing start-ups and scale-ups, incident.io co-founders Pete and Chris cover everything from who should be on the rota and how to build a compassionate on-call culture, to compensation structures and tips for operationalising on-call.

PagerDuty and DataOps: Enabling Organizations to Improve Decision Making with Better Data

Many organizations have been digitally transforming their operations and the majority of them are moving to the cloud. With this transformation, data teams have to analyze ever larger and more complex data sets to allow downstream teams to make faster and more accurate decisions on a daily basis. Consequently, most organizations need to work with: customer data, product data, usage data, advertising data, and financial data.

Sponsored Post

Introduction to Automation Testing Strategies For Microservices

Microservices are distributed applications deployed in different environments and could be developed in different programming languages having different databases with too many internal and external communications. A microservice architecture is dependent on multiple interdependent applications for its end-to-end functionalities. This complex microservices architecture requires a systematic testing strategy to ensure end-to-end (E2E) testing for any given use case. In this blog, we will discuss some of the most adopted automation testing strategies for microservices and to do that we will use the testing triangle approach.

From checklist to playbook: Creating structure for your processes

Playbooks aim to be a super-powered checklist for repetitive tasks. Before you can get to the “super-powered checklist,” though, you need to identify the process that you’ll use to build your first playbook and create a structured process as a Playbook checklist. Let’s go on that journey today.

Enterprise Alert 9.4 Update introduces Remote Actions for hybrid scenarios and proxy support for MS Teams

We have released another update for Enterprise Alert 9 (version 9.4) which enhances the cloud bridge and MS Teams integrations. This will help you to setup scenarios where you wish to active your Enterprise Alert remote actions from with the Signl4 app as well as allowing for using a proxy to configure the MS Teams integration. Read all details in this article.

Webinar: AIOps in healthcare

Healthcare around the world is constantly evolving. The amount of data being generated daily from every appointment and interaction, no matter how small or large, needs to be processed and analyzed in order to improve patient outcomes. The data must be accurate, stored, accessible and secure. Without a core infrastructure of smart IT, any outcomes are extremely challenging to generate, and data must be available in seconds for doctors to make life-saving decisions. The bottom line?

Point Solution Monitoring vs. Domain-Agnostic AIOps. Which is Right for You?

Just consider how much of your day relies on online digital technologies. Perhaps you hopped on an app to pre-order your morning coffee and then logged onto a platform to book a car to work. Or, perhaps you stayed home to work, using digital tools to connect with your colleagues and exchange information.

Sponsored Post

Network Performance Monitoring Is Only Step One

Incident response aims to identify, limit, and mitigate an incident. Whether such an occurrence is a security breach or a hardware failure, formulating and continuously strengthening an incident response strategy has become vital for all businesses in the digital age. Your incident response strategy consists of the processes your organization takes to handle incidents-such as network outages and service-impacting bugs-and the steps taken to mitigate incidents.

Key takeaways from MIM Expo 2022 for incident management professionals

The MIM Expo (Major Incident Management) always delivers, and this year’s recent gathering was no exception. At this annual event, we always get a unique opportunity to hear about what’s top of mind with major incidents and SRE professionals from all the world.

7 ways teams are using incident.io's Decision Flows

One of my favourite features in incident.io is Decision Flows. With it, you can create a series of questions which eventually lead to a decision based on what you’ve answered. You can pull up this flow during an incident and it’ll guide you through the questions. It’s like having an experienced on-caller calmly guide you through what to do when a crisis hits. This is complementary to incident.io’s Workflows feature.

FireHydrant is now more powerful across the entire incident lifecycle

FireHydrant has partnered with incredible companies to transform incident response inside their organizations, but our goal has always been to support the full incident lifecycle. That’s because we know that investing in good incident management can kickstart your reliability efforts when it includes both a streamlined incident response process that helps you recover faster and the ability to learn from incidents and then feed those insights back into your system.

3 Ways You Might Have an NOC Process Hangover

NOC, or network operation center, processes have been set in stone for decades. But it’s time for some of these processes to evolve. Digital transformation and the cloud era have led to the rise of DevOps, and with it, service ownership. Service ownership means that developers take responsibility for supporting the software they deliver at every stage of the life cycle. This brings development teams closer to their customers, the business, and the value they deliver.

3 Ways You Might Have a NOC Process Hangover

NOC, or network operation center, processes have been set in stone for decades. But it’s time for some of these processes to evolve. Digital transformation and the cloud era have led to the rise of DevOps, and with it, service ownership. Service ownership means that developers take responsibility for supporting the software they deliver at every stage of the life cycle. This brings development teams closer to their customers, the business, and the value they deliver.

4 Challenges Facing CXOs in A World of Digital Everything

As a busy executive, taking time to attend an event and listen to sessions is a luxury. And yet, I know that many of my best breakthrough ideas on how to lead my teams have come from taking those moments to tune into new ideas. The challenge is figuring out where the hidden nuggets of wisdom are buried in a mountain of content.

ITIL, ITSM and incident management. What are they and how do they fit together?

You’ve probably heard the terms ITIL and ITSM, but the distinction between the two can be a little unclear. Throw incident management into the mix, and the whole thing can feel pretty confusing. This article aims to explain what they are, the differences between the three, and importantly how they fit together. First, let’s establish what each of the terms actually mean.

The modern incident management software stack

We’re fortunate enough to speak to a huge number of companies about their incident management processes. In doing so, we’ve noticed an emergent trend in how modern companies are using software to support their incident management processes, and a common set of challenges faced by them too.

SaC - How to build status pages as code with Terraform

Status pages are a clever solution to bundle all your services, and see the status of them at one sight. We at iLert took this one step further: why not build your status page as code using Terraform? We want to show you how we make it possible, and how you can set it up for your own infrastructure - a real SaC solution.

What Metrics and KPIs Really Matter in Availability?

In our inaugural State of Availability Report, we discovered that not only do metrics matter but the way we use them also does. Our research found that teams with fewer KPIs were more likely to meet their Service Level Agreements (SLAs) and provide their customers with higher levels of availability. The problem with having too many KPIs is that they cause information overload and noise.

A Guide to Incident Severity Levels

Maintaining IT infrastructure is a consistent challenge for system administrators, site reliability engineers (SREs), supporting developers, and technicians. Several factors can impact system performance, cause outages, or impact customer experience. On top of that, not all incidents are created equal. The impacts and severity of a system outage affecting 10% of your users are different from an outage impacting 90%.

PagerDuty Named a G2 Leader for Enterprise Incident Management Software

With the announcement of their Fall 22’ Review awards, PagerDuty has been named a G2 Leader for Incident Management Software for the sixth quarter in a row. We owe a special thank you to our customers who have consistently given PagerDuty high satisfaction scores that take into account their likelihood to recommend PagerDuty, our ability to meet their requirements, and the overall ease they’ve found in doing business with us.

Monthly Moo | October 2022

Summer has passed and it’s time for fall - cue transitioning leaves, cozy blankets, and all the pumpkin-themed things your heart could ever desire. As we move into the new season, we are excited to announce our fall product releases across Moogsoft Cloud that enable engineers to detect incidents earlier, resolve them faster, and work as a team across the entire lifecycle. Moogsoft’s Fall product updates enable you to: … and so much more! Read on for deeper details.

Event types and use cases for event correlation

As organizations grow and become more complex, so does the need to monitor and troubleshoot issues across the entire IT infrastructure. Event correlation is a powerful technique that can help make sense of the huge volume of alert data generated by monitoring systems and identify problems as they occur. In this blog, we’ll look at event types, use cases for event correlation and approaches that organizations can use to get the most out of this valuable tool.

How we do realtime response with incident.io, Sentry & PagerDuty

Like most tech companies, we use an on-call rota and various alerting tools. We do this to respond to incidents before they’re reported. Proactively identifying issues and communicating to customers helps us provide great experiences and fosters trust. Internally, we’ve been using these alerting tools in tandem with our auto-create incidents feature. We’ve found that it’s made responding to the pager much smoother - it’s one less thing to do when you get paged at 2am.

iLert is now a verified integration with HCP Consul

More than 16 months ago we provided a solution to integrate HashiCorp Consul with our alerting and on-call management platform by using consul-alerts - a dedicated application that allows for communication between a deployed Consul instance and an existing iLert account. ‍ With more code infrastructure being moved to the cloud to ensure better security and availability, we too have ensured that our service integrates with the HashiCorp Cloud Platform (HCP).

Powering Resilience with Critical Event Management

Businesses and communities are experiencing a growing number of disruptions from threats like severe weather, civil unrest, theft and vandalism, pandemics, and cyber-attacks. These disruptions have left many organisations concerned about the safety of their people and operations.

The Blameless Complete Guide to Incident Management

Incidents are inevitable. As your service expands and becomes more complex, you are more likely to encounter outages, slowdowns, errors, and other disruptions to healthy operation. At the same time, as your service becomes more popular and relied on by users, the cost of incidents becomes higher. Studies have shown that the cost of downtime is high, and growing fast in the digital-first world. Since you can never fully prevent incidents, it's important to resolve them as efficiently as possible.

PagerTree 4.0 is finally here!

Today I am excited to announce we have officially shipped PagerTree 4.0! Here are the highlights: This effort has been a year and half in development and I sincerely want to thank each and every one of our customers for the constructive feedback, ideas, and countless hours on Zoom calls. Without you this journey wouldn’t be possible. We are excited to get this major release shipped, just in time for the holidays. You can check out the full details of the upgrade below.

How Many SREs Does Your Company Need? Here's How to Decide

So you’ve decided to take advantage of Site Reliability Engineering by hiring SREs for your company. Now, you have a second decision to make: Exactly how many SREs to hire. Do you need just one or two SREs? Or should you build a sprawling SRE team, with a dozen or more SREs on hand to support your organization’s reliability needs? The answers to these questions will, of course, vary; every business’s needs are different.

Webinar: Making the case for AIOps

Over the past few years, artificial intelligence for IT Operations (AIOps) has risen in popularity within the technology landscape. It’s become a buzzword in the marketing world, and while there are many ways to define AIOps, the best way to start thinking about it is through the lens of outcomes, correlation and strategy—it’s all about the data.

Public Safety

For over 20 years, Everbridge has been a trusted partner to governments worldwide. From fires or floods to terrorist attacks, we’ve monitored potential hazards, preparing, responding to incidents, and effectively providing the right people with the right information. Be it a country-wide emergency or a neighborhood outage, communities rely on Everbridge to keep them informed and safe.

Why you should ditch your overly detailed incident response plan

When critical incidents happen — which they inevitably do 😅 — and you’re in the middle of trying to figure out what the best thing to do is, it can feel comforting to know that you’ve got a pre-prepared list of instructions to follow, commonly known as an “incident response plan”: In theory this sounds quite simple, and a typical flow you might envision is: It might be tempting to think that the hardest part of running incidents is finding or writing a checkl

Announcing Incident watchers: Subscribe to incidents and receive incident updates in real-time

Hey folks, We’re back with another feature update for all our customers! We have recently gone live with the incident watchers feature which nests within an incident details page. This blog will outline how you can access the feature, its primary functionalities and how we foresee it helping improve your incident management process. Note: This feature will be available to pro, premium and enterprise plan users only.

New reports stress the importance of strategic incident management practice

Engineers have been managing incidents for as long as they’ve been building software, but the idea of incident management as a strategic practice in its own right is still finding its place. We’re starting to see big shifts in that area, though — more companies are dedicating headcount, resources, and tools to help them better prepare for, respond to, and learn from their incidents.

How to Put Software Development Security First

What are the keys to building software development security into the early stages of product development? And what are the costs of ignoring security? In this article, xMatters Product Manager Kit Brown-Watts provides his insights on the matter. Every investment decision comes with trade-offs, usually in the form of cost, quality, or speed. The CQS Matrix, as I like to call it, captures the dilemma most product people face.

Beating the odds: How log data helps detect and lower MTTR

Depending on your business, MTTR stands for mean time to repair or mean time to recovery – but it can also mean resolution, resolve, or restore. No matter how you define it, the basic measurement is the same: it’s the time it takes from when something goes down to when it is back and fully functional. This includes everything from finding the problem to fixing it. For ITOps teams, keeping MTTR to an absolute minimum is crucial.

Differentiating Between SLO vs. SLA vs. SLI: What They Are and How to Improve Them

Recently, technology roles have become more generalized—cloud computing, for instance, requires a broader knowledge of technologies like storage and network. As technology has continued to evolve over the decades, many job positions have blurred into many roles or even morphed into new roles with new responsibilities.

Building great developer experience at a startup

At incident.io, our number one priority in engineering is pace. The faster we can build great product, the more feedback we can get and the more value we can deliver for our customers. But pace is a funny thing. If you optimise for pace over a single month, you’ll quickly find yourself slowed down by the weight of your past mistakes.

Kubernetes alternatives to Spring Java framework

Spring Cloud and Kubernetes both complement each other to build a cloud native platform and run microservices on the Kubernetes containers. Kubernetes provides many features which are similar to Spring Cloud and Spring Config Server features. Spring framework has been around for many years. Even today, many organizations prefer to go with Spring libraries because it provides many features. It's a great deal when developers have total control over cloud configuration along with business logic source code.

The Monitoring Problem: Too Many Tools + Too Much Time = No Room for Innovation

Continuous availability and unceasing innovation are prerequisites for today’s digital businesses. So it makes sense that business leaders invest heavily in teams and tools to monitor digital apps and services. In theory, these tools should also free up time for engineers to push new functionalities that wow customers. But do these investments actually result in more uptime and customer-delighting innovations?

What Metrics Should Be Tracked Within Incident Management?

As digital services have become increasingly important to businesses and organizations, reducing downtimes and service disruptions have become critical objectives for business operations. This means management reporting and KPI’s are now crucial to quality management, providing the insight to let you improve incident remediation over time.

Introducing Squadcast Premium

For the last few years, Squadcast has been building out a market-leading on-call and alert management solution. Over the past few quarters, we have significantly enhanced our on-call product by releasing and improving features related to Incident Response - including Slack / MS Teams integration, Runbooks, Postmortems, Service Level Objectives, and Status Pages. We believe that a reliability platform involves both on-call and incident response - one cannot work effectively without the other.

There's a better way: how an incident management tool helps you conquer response challenges

As a solutions engineer for FireHydrant, I speak with a wide variety of companies about their incident management programs — from start-ups with a handful of employees to large enterprise companies with thousands of engineers. Whether they’re looking to establish their incident management program or mature it, the same questions remain.