Operations | Monitoring | ITSM | DevOps | Cloud

December 2023

Sponsored Post

Runbook vs Playbook: What's the difference?

What's the difference between Runbook and Playbook?- for once and all we'll end this confusion today. If you find yourself worrying about forgetting the detailed process of the incident your team just resolved, you're not alone. This is where documentations like Runbooks and Playbooks come into play. Runbooks and playbooks serve as the organizational guides, providing essential information and instructions for teams to navigate through tasks and processes effectively. They not only help your team help themselves but also frees up your time for your ever-growing to-do list.

2023 Rewind: Squadcast Year-End Review

Hold the confetti, everyone, because it's time to POP the champagne! 2023 was a year where Squadcast truly leveled up. We dropped some remarkable features that got our hearts racing (and alerts under control!), snagged some fantastic recognition for our impact, and even gave our website a stunning makeover. And we couldn't have done it without you! Buckle up to get a rewind of everything altogether, Let's get started.

Public Safety - Everbridge

For over 20 years, Everbridge has been a trusted partner to governments worldwide. From fires or floods to terrorist attacks, we’ve monitored potential hazards, preparing, responding to incidents, and effectively providing the right people with the right information. Be it a country-wide emergency or a neighborhood outage, communities rely on Everbridge to keep them informed and safe.

G2 Winter Report 2023: Squadcast Maintains Leadership in IT Alerting and Incident Management

2023 has been a year of significant growth for Squadcast, with an expanding presence in both Mid-Market and Enterprise segments across IT Alerting and Incident Management categories. And with the release of the G2 Winter Report '23, it's an opportune moment to share some of our key achievements.

On-Call Software Engineer Roles and Responsibilites

Most software engineers know that they are typically tasked with on-call shifts, but new software engineers entering the field may be asking themselves – What do I even do if I get scheduled for an on-call shift? This is a common question that often doesn’t get answered until that first on-call shift, and unfortunately that can be overwhelming for a young professional who is nervous about their first on-call shift, let alone their first incident.

A Little Resilience Goes A Long Way

‍ Let’s call this the mother of all understatements. If you’re reading this blog, there’s a good chance that you: ‍ a.) Agree wholeheartedly with this sentiment and think it should go without saying, AND… b.) Are surrounded by folks who pay lip service to this idea while not taking it as seriously as they should.

Reflecting on a momentous 2023 at incident.io

2023 at incident.io was a year to remember. While it's easy to be cyclical about proclaiming that every year was better than the last, a few things stand out that made 2023 truly a year for the books. TL;DR, a lot happened! Especially when you consider that a lot of things didn't make the list above. So as we turn the page to 2024, we wanted to take a moment to reflect on the transformative year that was 2023, not only for us but our customers as well.

The Debrief: Incident management for data teams

If you're on a data team, have you ever considered using an incident management tool to respond to pipeline issues? If the answer is no, then you might want to check out this episode. Here, we chat with Jack, Data Analyst at incident.io, to better understand why data teams can—and should—look to incident management tools like incident.io to manage issues. We chat about.

The Debrief: A year in review-2023 at incident.io

What a year 2023 was at incident.io! While it's hard to summarize 365 days, a few things stand out: So as we close the curtain on 2023, we sat down with the three co-founders of incident.io to do a bit of reflection on the wild ride that was this year. In this episode you'll hear them discuss challenges, big wins, moments of growth, what's next for us, and most importantly, what the three co-founders like most about one another.

How To View Previous Incidents To Gain Helpful Context During Incident Triage?

Picture this: you're knee-deep in resolving a P1/P0 incident, urgently seeking answers. What if you could tap into past incidents to get important incident insights and streamline your troubleshooting process? In this blog, we pitch into the practical aspects of leveraging Squadcast's Past Incidents feature to help you enhance your Incident Management process.

Setting the foundations for on-call that's fair, balanced, and human-focused

Whenever you're providing a service to businesses or individuals that they rely on, it's important to make sure that it's up and running as much as possible without disruptions. But the reality is that, despite your best efforts, downtime does happen. Regardless of when incidents strike, whether it’s 2 PM in the middle of the working day or 2 AM, it's important to have people available to diagnose and resolve issues as soon as possible.

SRE Essentials: Building a Team and Culture

What differentiates tech companies that weather digital storms with unwavering resilience? In many cases, the answer lies in a deeply ingrained SRE culture, which fosters proactive approaches to system reliability. Site Reliability Engineering (SRE) culture extends beyond mere tech tools and automated scripts. It emphasizes proactive care, shared responsibility, and continuous improvement, leveraging incident management software as a vital component in fostering these core values of SRE.

Tracking developer build times to decide if the M3 MacBook is worth upgrading

All incident.io developers are given a MacBook which they use for their development work. That meant when Apple released the M3 MacBook Pros in October, people naturally started asking questions like “wow, how much more productive might I be if my laptop looked that good?” and “perhaps we’d be more secure if our machines were Space Black 🤔” Pete’s (our CTO) response to this was “if you can prove it’s worthwhile, we’ll do it”

BigPanda's latest Unified Console features unveiled

In the fast-paced realm of incident management and response, the need to stay ahead is more vital than ever. In recognition of this, BigPanda has significantly enhanced the Unified Console, introducing a suite of new features designed to revolutionize incident handling. Let’s explore these transformative updates and how they can redefine your approach to incident management.

Year in Review: Key Trends in Critical Event Management

As we approach the end of 2023, it’s vital to reflect on the transformative year in the field of critical event management. Throughout the year, we’ve witnessed escalating geopolitical tensions, a surge in security threats encompassing both physical and cyber domains, and growing concerns over the intensifying impacts of climate change-induced severe weather events.

What is a multi-cloud management platform?

As an IT leader, you’re acutely aware of the struggles of juggling multiple cloud environments, from integration headaches to holistic incident management to monitoring multiple clouds at once. Seeking a more efficient multi-cloud management solution is crucial to alleviate these pressures and streamline your cloud operations.

Episode 23: Zero-Downtime Updates with Todd Whitney

With limited error budgets and low user tolerance for maintenance window, the ability to execute routine updates without a maintenance window is an increasingly important socio-technical capability. Hear from Todd Whitney, who recently spoke at HashiConf about how PagerDuty performs updates while upholding its promise to customers of taking zero maintenance windows.

How MSPs and MSSPs can reduce risk and liability for their clients

For 83% of companies, a cyber incident is just a matter of time (IBM). And when it does happen, it will cost the organization millions, coming in at a global average of $4.35 million per breach. Add to that stringent data protection laws and the growing frequency and reach of ransomware and other sophisticated attacks.

All I want for Christmas... from Slack

When declaring and responding to an incident with incident.io, most of your interactions with our product will go via Slack. You might configure your forms in our web dashboard, but the responder using them to declare an incident is most likely doing so from a Slack modal, and the incident announcement will be posted as a Slack message. This means a lot of our product design falls within the constraints of what we can build using Slack’s block kit.

Impressions from Gartner IOCS 2023

Gartner’s IT Infrastructure, Operations & Cloud Strategies Conference (IOCS) is an annual event that attracts ITOps, SRE, and DevOps leaders from around the world. As Gartner explains, IOCS “brings the world’s technology leaders together to hear top trends, find objective answers, and explore topic coverage in addition to best practices. Gain the insights and guidance to create an effective pathway to the future and network with your peers.”

Why monitoring your application is important

Effective monitoring and observability tools are critical for modern enterprises. Daily operations, digital transformation, moving to a cloud-native architecture, and an ever-evolving tech stack all require ITOps, DevOps, and SRE teams to monitor increasingly complex systems. So what happens if your applications suddenly cease to function? Every moment of downtime translates to lost income, decreased customer satisfaction, and harm to your company’s reputation.

APAC Retrospective: Learnings from a Year of Tech Turbulence

Throughout 2023, one thing has become abundantly clear: regardless of an organization’s size or industry, incidents are inevitable. Recently across the APAC region, we’ve seen numerous regulatory bodies clamp down on large companies who are failing to provide acceptable service, with some handing out quite severe penalties. For many, the cost of an incident is no longer just lost revenue and customer trust, but financial penalties and business restrictions.

The Debrief: A year in review-2023 at incident.io

What a year 2023 was at incident.io! While it's hard to summarize 365 days into just a few sentences, a handful of moments stood out from this transformative year: So as we close the curtain on a momentous 2023, we sat down with the three co-founders of incident.io—Chris, Stephen, and Pete—to do a bit of reflection on the wild ride that was this year.

Understanding ServiceNow Incident Management: A comprehensive guide

You’re focused on swiftly identifying, analyzing, and resolving disruptions in IT services. And you know all too well that correctly deploying and adopting incident management holds the key to delivering a more reliable and responsive IT environment for your applications and services. That’s why you’re using or are considering using ServiceNow’s incident management to ensure a structured and efficient approach to handling your IT service incidents.

Better Incidents Winter Bonfire: Inside On-Call

Engineers are bombarded with pages left and right. There's uncertainty about how to escalate. A constant blur exists between what's urgent and what can wait. This never-ending ping-pong game takes a toll. Burnout creeps in, and your engineering culture has taken a nose dive before you know it.

Automated incident response in ITOps: Here's everything you need to know

If you’re like most IT leaders, you realize that automating repetitive, low-level incident response actions is key to unlocking enhanced workforce productivity, improved IT services, minimized downtime, better user experiences, cost savings, and the freedom to focus on innovation. Yet you don’t know where to start – or maybe aren’t sure of the best approach.

BookMyShow's Cinematic Product Journey - Incidentally Reliable Podcast with Viraj Patel

Grab some popcorn and catch Viraj talk about his experiences and BookMyShow's journey from its inception in the early 2000s to the entertainment behemoth it is today, their stints innovating at the forefront of the mobile and e-commerce revolutions, and their harmony with reliability engineering in the colourful, ever-changing yet challenging world of movies and online ticketing. Exclusively on The Incidentally Reliable podcast — made by SREs for SREs, hosted by Zenduty.

LLM Monitoring and Observability

Large Language Models (LLMs) are advanced artificial intelligence models designed to comprehend and generate human-like language. With millions or even billions of [parameters, these models, like GPT-3, excel in natural language processing, understanding context, and generating coherent and contextually relevant text across various applications.

The Everbridge Risk Intelligence Monitoring Center (RIMC) real-time alerting

The Everbridge Risk Intelligence Monitoring Center (RIMC) analyzes thousands of trustworthy, vetted, and hyper-local data sources – across over 100 risk categories – using machine-learning and AI technology, complemented by an experienced team of global risk analysts. The RIMC team’s real-time alerting streamlines your organization’s ability to monitor and analyze worldwide incidents and events, dramatically increasing your ability to respond to risks that threaten your people, organization, supply chain, and more.

Everbridge Signal - Open Source Threat Intelligence to Keep People Safe and Operations Running

There are billions of people online right now. Among that noise is information that could be vital to your organization’s safety and security. Everbridge Signal will help you find relevant information using Artificial Intelligence and Machine Learning. Detect incidents in real-time by gathering data from public sources including the dark web, deep web and social media. Whether your issues are cyber or physical, Signal can help.

Everbridge Flow Designer - Overview

Flow Designer is a stunningly simple, visual workflow builder that’s as easy as drag, drop, and done. Built-in steps make it easy to create virtually any workflow connecting your applications. Just drop in the steps you need to launch a critical event management process, post progress updates to a public page, and create spaces for personnel to collaborate.

Lessons in Incident Response I Learned While Waiting Tables

Before I stumbled into the tech industry (a story for another day), I spent several years in the customer service world as a server and front-of-house manager in restaurants. It was in these jobs that I first honed some critical skills that would later lead me on the path to incident response.

Getting started with IT operations automation

Tech companies face a daunting challenge: a staggering 90% of their IT teams are stuck doing mundane, repetitive tasks, leaving only 10% to focus on strategic innovation. Companies know that automation is the solution to these repetitive, low-level incident response actions; however, many need support to begin automating.

The ultimate guide to incident management KPIs and metrics

IT incident management aims to swiftly identify, address, and resolve IT disruptions to restore normal service operations. Tracking IT incident management key performance indicators (KPIs) is a vital step toward minimizing disruptions for customers and users. But there are several different KPI and metrics choices, and it’s not easy to identify the right ones that can drive meaningful improvements in incident management.

Adobe Experience Cloud Outage: The Impact of Relying on Third-party Services

On December 8, 2023, Adobe's extensive customer base was impacted by a series of outages in the Adobe Experience Cloud, starting from 8:00 AM EST and continuing until 1:45 AM EST on December 9. We haven't seen a third-party outage of this magnitude since the DoubleClick outage of 2018.

The Debrief: Incident management for data teams

If you're on a data team, have you ever considered using an incident management tool to respond to pipeline issues? If the answer is no, then you might want to check out this episode. Here, we chat with Jack, Data Analyst at incident.io, to better understand why data teams can—and should—look to incident management tools like incident.io to manage issues. We chat about: Read Jack's blog post about incident management for data teams.

How BookMyShow Empowered SREs - Incidentally Reliable Podcast #incidentmanagement #devops #shorts

Incidentally Reliable Episode 4 dropping this Thursday the 14th, chatting about BookMyShow's journey from inception to the entertainment behemoth it is today, their experience innovating at the forefront of the mobile and e-commerce revolutions, and their harmony with reliability in the colourful yet challenging world of movies. Zenduty is a revolutionary incident management platform that gives you greater control and automation over the incident management lifecycle.

What is Mean Time to Resolution - and why does it matter?

Mean Time to Resolution (MTTR) is a key performance indicator (KPI) that measures the average duration needed to restore normal operation for an application, service or piece of infrastructure component. Your MTTR directly impacts customer satisfaction, so you must have a keen understanding how it influences the reliability and availability of your services and applications to make informed decisions, enable operational efficiency, and ensure a seamless customer experience.

Incident vs Bug: Understanding the Key Differences

Incidents and bugs are two common occurrences that can disrupt the smooth operation of systems and applications. While these terms may seem similar, they represent distinct concepts with different implications. Understanding the nuances between incidents and bugs is crucial for effective incident management and proactive problem resolution.

What is Mean Time to Detect (MTTD) - and why does it matter for ITOps?

Have you ever wondered about your IT team’s efficiency in detecting incidents? Your Mean Time to Detect (MTTD) is an incident management Key Performance Indicator (KPI) that reveals your productivity during the first stage of incident resolution and enables investigation into opportunities for improvement. ITOps and DevOps teams that can lower their MTTD can more quickly identify issues, minimize potential downtime, and maintain system reliability too.

Understanding IT event analytics: From basics to AIOps

A wise person once said, “What’s measured is what matters.” This couldn’t be more true than in the high-stakes world of IT operations, where the ability to swiftly measure, analyze, and respond to events is crucial for improving IT operational performance. This blog delves into defining IT event analytics, guiding you on getting started, showcasing real-world examples, and introducing essential methods to transforming your incident response strategy.

Where Intention Meets Sweet Innovation.

Welcome to our latest video on system layering – where intention meets sweet innovation! Discover the delectable world of technology architecture as we unveil the secrets behind system layering, likened to the art of crafting a perfect cake. Just like each layer contributes to the overall masterpiece, each system layer plays a crucial role in creating a robust and efficient IT infrastructure.

Winter safety tips for employees in private and public sectors

Winter storms can significantly impact both private and public sectors, affecting their people, operations, and critical infrastructure. The NOAA stated that, in 2022 alone, the total cost of winter storms in the United States was 8.7 billion dollars.

Comparing Uptime Monitoring, Heartbeat Monitoring, and Synthetic Monitoring

In the quest for a high-velocity development environment, one fundamental question looms large: "How can you ensure an exceptional end-user experience when an array of engineers continually push and deploy code?" The unequivocal answer to this pivotal inquiry lies in the establishment of robust, straightforward, and well-defined monitoring practices.

Incident tracking: How it works and why it matters for IT operations

Constantly juggling IT incidents can be exhausting as you try to track and resolve them before they escalate into disruptions. With each incident demanding prompt and precise attention, keeping up takes significant work. However, you can manage these challenges more efficiently and with less stress and less risk by optimizing your incident-tracking process.

Fault Tolerance: What It Is & How To Build It

Fault incidents are inevitable. They occur in any large-scale enterprise IT environment, especially when: In fact, research indicates, more than half (50%) the leaders in tech and business organizations consider the complexity of their data architecture a significant pain point. From an end-user perspective, businesses must overcome complex architecture in order to ensure service delivery and continuity.

Now in beta: alerting for modern DevOps teams

Although FireHydrant has spent five years focused on what happens after your team (erg, I mean service 🙄) gets paged, the topic of alerting often comes up in discussions with our community. People are tired of paying big bucks for software that’s expensive, bloated, and hasn’t seen much innovation. Clearly, there’s a problem here – and we’re tackling it head on.

Autocorrelate Alerts With Squadcast's Key-Based Deduplication

With the increasing complexity of technology stacks and monitoring tools, managing incidents can become overwhelming, leading to alert noise, alert fatigue, and delayed responses. This is where Key-Based Deduplication comes to the rescue, streamlining incident handling and enhancing the effectiveness of your Incident Management platform.

How to create an on-call policy and rotation in OneUptime?

In this tutorial video, we walk you through the process of creating an on-call policy and rotation in OneUptime. We start by explaining what an on-call policy is and why it’s crucial for your organization. We then guide you step-by-step on how to set up a policy, including defining the policy name, setting the escalation rules, and adding users to the policy. Next, we delve into creating a rotation for the policy. We explain how to set the rotation length, start time, and participants. We also show you how to handle holidays and time-off requests within the rotation.

How to build workflows in OneUptime and integrate OneUptime with anything?

OneUptime is a complete open-source observability platform. It allows you to create workflows and integrate with over 5000 different services and products without writing any code. This integration capability allows OneUptime to connect with the rest of your software stack. Building workflows in OneUptime likely involves defining the sequence of operations that should occur based on certain triggers or conditions. These workflows can help automate processes, such as incident management, alerting the right people at the right time, and more.

When More Incident Commanders are Better

It has been lightly revised and reposted with his permission from the original article on Medium. Leading major incident responses can be extremely stressful. You have to quickly gather an ad-hoc team, figure out what went wrong, identify a fix and make sure this doesn't make things worse, all the while with senior leadership breathing down your neck. Are we having fun yet? Many people think having a dedicated incident commander role will solve the problem.

Captain's Log: Diving into our scheduling design

On-call scheduling is tricky. Like, really tricky. It was one of the scariest parts when we decided to build a modern alerting system earlier this year. We knew we couldn't cut any corners on Day One of our release because it needed to be a fully loaded feature for someone to realistically use our product (and replace an incumbent). This meant including windowed restrictions, coverage requests, and simple to complex rotations.

On-Call Management Models

In today's fast-paced digital landscape, incident management is crucial for maintaining operational excellence. During this process, on-call management models play a critical role in promptly addressing and resolving incidents. On-call management involves the organization of teams to ensure prompt response and resolution of incidents and is necessary to streamline incident resolution, ensure 24/7 availability, and allow for fair and transparent on-call rotations.

The Unplanned Show, Ep. 22: CSOps at PagerDuty with Arturo Suarez Martin

Even with the best monitoring in the world, some customer-impacting issues still go undetected and are ultimately reported by customers. In this episode, we'll hear from PagerDuty's Senior Director of Global Support, Arturo Suarez Martin, about the journey that PagerDuty has been on to tighten feedback loops between Customer Support and Engineering and mitigate the risk of poor customer experiences.

Ping Command: A Comprehensive Guide to Network Connectivity Tests

The ping network test, a core utility since the 80s, plays a crucial role in confirming connectivity between IP-networked devices. In this guide, we'll delve into what the ping command is, how to run a ping network test, common IP addresses to ping, interpreting results, and troubleshooting errors.

Events vs. Alerts vs. Incidents

Event. Alert. Incident. These terms are bandied about, often interchangeably, in IT operations management. Broadly speaking, they all refer to situations where something is potentially amiss and needs to be investigated and resolved. Each of these three words does, however, have a distinct definition. Because they are used in scenarios where clear communication and timeliness are critical, it’s important to understand the differences and use them appropriately.

Reducing the burden of incident response on your teams

In this webinar, a panel of engineering leaders, including Chris Evans, CPO at incident.io, share how they reduce the burden of incident response for their teams. They advocate for a culture of shared responsibility across the board, offering practical strategies to educate the business about engineering practices during the chaos of an outage.

Learn the Incident Response Life Cycle - Best Practices and Strategies

No company plans for a security breach, major outage, or other cyber incident, but they happen. When an incident occurs, having a standardized, regulated method of managing the fallout is critical. This is where the incident response life cycle comes in ‍