Operations | Monitoring | ITSM | DevOps | Cloud

March 2024

Sponsored Post

Enterprise Incident Management: Guide & Best Practices

In today's rapidly evolving technological landscape, incident management has become a critical discipline for enterprises to ensure uninterrupted operations and an optimal customer experience. Effective incident management involves a systematic approach to promptly detecting, responding to, and resolving incidents.

What are Blameless Retrospectives? How Do You Run Them?

In most engineering organizations, everyone agrees that in complex systems, failure is inevitable. It’s possible to prevent the recurrence of certain incidents, reduce their impact, or shorten the time to resolution. However, it’s impossible to avoid them altogether. In the past, we asserted failures are a result of people’s mistakes. It was all about “the bad apple theory,” focused on finding the “guilty party” and removing them to prevent future failures.

Incident Response Team | Roles & Responsibilities Defined

When your organization faces outages, errors, security breaches, and other incidents, you need to have a plan in place to take appropriate actions as needed. However, you also need a capable team of experts filling critical roles and responsibilities to execute those actions and effectively collaborate to resolve issues quickly. An incident response team, therefore should be developed in a way that avoids skills gaps in expertise.

Incident Management Automation - What You Should Know

Automated incident management is the process of automating incident response to ensure that critical events are detected and addressed in the most efficient and consistent manner. In incident management, time is of the essence and the primary benefit of automated incident management is speed. With automation, you can accomplish time-consuming tasks much quicker. This brings down the incident response time and allows the team to focus their attention on matters that require their expertise.

A better Grafana OnCall: Seamless workflows with the rest of Grafana Cloud

Incident response and management (IRM) doesn’t happen in a vacuum. Your ability to respond to issues in a timely manner depends greatly on how well your on-call engineers can use their IRM tooling and observability tools together to understand what changed and why.

PagerDuty Study Reveals Security Concerns Are Slowing Adoption of GenAI Among the World's Largest Companies

98% of top tech execs paused their corporate genAI initiatives to establish policies. Execs say that a trusted technology partner is key to incorporating genAI into their organizations.

Turn tickets into actionable alerts with ilert integration for HaloPSA and HaloITSM

At ilert, we are dedicated to providing an effortless, seamless connection between our incident management platform and other popular tools that empower teams to excel in operations. We're excited to introduce two new integrations from the Halo suite: HaloITSM and HaloPSA.

How to Keep Observability Alive in Microservice Landscapes through OpenTelemetry

The concept of observability has become a cornerstone for ensuring system reliability and efficiency in modern software engineering and operations. Observability, beyond its traditional scope of logging, monitoring, and tracing, can be intricately defined through the lens of incident response efficiency—specifically by examining the time it takes for teams to grasp the full context and background of a technical incident.

Giving Power Back To The Engineers: A Fireside Chat with MyFitnessPal

The real secret to mastering engineering operations is putting engineers in the driver's seat. On March 26th at 10 am, Chris Karper, Sr. Director of Engineering at MyFitnessPal, joins Chief Reliability Officer, Lee Atchison to discuss how MyFitnessPal is overcoming incidents by giving power back to the engineers. They'll explore how Chris has navigated MyFitnessPal through its technological advancements, growth of the team, and the maturity of its incident management program.

7 Key Takeaways from HIMSS 2024

The Healthcare Information and Management Systems Society (HIMSS) conference serves as a beacon for the healthcare industry, showcasing the latest innovations and trends that shape the future of healthcare. In 2024, HIMSS once again brought together industry leaders, innovators, and stakeholders to explore the transformative potential of technology in healthcare. In this blog, we will delve into the significant trends, challenges, and insights that have surfaced during our three days at HIMSS in Orlando.

Break silos: Three steps to full-context ops

Every day, operators receive mountains of alerts to sift through. Prioritizing alerts based on impact and severity can seem impossible. And constantly evolving IT environments increase complexity by orders of magnitude. Knowing which alerts to prioritize is extremely difficult, especially without the critical context to make those alerts actionable.

Finding the common ground with executives in incidents

I spotted this thread on Reddit, discussing the pains of executives dropping into incidents, and the corresponding impact it can have on the incident response process. Being an SRE community, it was a little more of a one-sided account of the situation. So let’s look a little closer, and dive into what it takes to make incidents better for responders and executives alike.

Creating an Efficient IT Incident Management Plan: A Guide to Templates and Best Practices

In today's digitally-driven landscape, businesses rely heavily on their IT infrastructure to maintain operations smoothly. However, with this reliance comes the inevitability of encountering disruptions such as server outages, security breaches, or software malfunctions. Left unchecked, these incidents can have detrimental effects on productivity and revenue. This is where a well-designed Incident Management plan becomes indispensable.

The Debrief: Meet our VP of Engineering-Norberto Lopes

Recently, we introduced our very first VP of Engineering, Norberto Lopes, to incident.io. As with all of our new joiners, we thought it would be helpful for folks to get acquainted with who exactly he is! So in this episode of The Debrief, we'll do exactly that. We sat down with Norberto to ask about his background, what he was doing before incident.io, what motivated him to join the company, and a whole lot more.

xMatters Support - Change Intelligence

Because digital services can experience thousands of changes per day, it’s critical to intelligently surface change information in a way that’s meaningful and actionable for resolvers. By presenting relevant changes within the context of an incident, resolvers can identify recently changed services, gain greater insight into potential root causes, and immediately take action to mitigate and resolve the issue. Let’s take a look at Change Management in xMatters.

SLOs and Customer Experience: Uniting Engineering Excellence with Customer Satisfaction

In the contemporary landscape of fast paced IT and Digital services, where every click, tap, or swipe represents a potential interaction with a customer, the importance of optimizing the customer experience cannot be overstated. Service Level Objectives (SLOs) stand at the intersection of engineering excellence and customer satisfaction, serving as the guiding principles that drive the delivery of exceptional digital experiences.

Replace Imprivata Cortext with OnPage

Healthcare organizations require a secure clinical communication and collaboration system that ensures care teams are well-equipped to effectively communicate, coordinate, and maximize collective knowledge to deliver high-quality patient care successfully. This system should prioritize patient privacy and data security while facilitating seamless information exchange among healthcare professionals across various departments and locations.

Use full context to unite observability and ops teams

IT teams are the invisible engines powering every modern organization. Yet they battle constantly to ensure the availability and reliability of applications and services across fragmented, hybrid-cloud infrastructures. In particular: Fragmented tools, siloed workflows, and inconsistent manual processes create an IT nightmare. Despite investing millions in observability and ITSM platforms, teams face alert fatigue, reactive incident response, and persistent outages.

Software Deployment: 5 Things that Can Go Wrong

Software deployment, a critical process in software development, refers to all the activities that make a software system available for use. It’s the stage where all the hard work of creating software culminates into something tangible that users can interact with. But before we delve into its complexities, let’s first understand the basics of software deployment.

Set up a maintenance window on ilert mobile app

ilert's maintenance windows feature allows users to schedule downtime for alert sources and services. This ensures that on-call responders won't receive alerts from alert sources during maintenance and service, and status page subscribers will be informed about planned and ongoing service maintenance. In this video, you will learn how to use this feature on ilert mobile app.

Brand new: Zenduty's On-call Schedules Revamp | Hear it from the team who built it

We're about to drop a major revamp to one of your most used Zenduty features. Get ready to experience scheduling like never before! Join our YouTube Premiere Live to see the new on-call schedules that'll make your on-call life smoother and better! P.S. Zenduty is a revolutionary incident management platform that gives you greater control and automation over the incident management lifecycle.

How AI will shape the future of risk management

By Eric Boger, VP Risk Intelligence It has become increasingly evident that the complexities and challenges that defined the risk landscape of 2023 will almost certainly persist throughout 2024 and beyond. Enterprises will continue to grapple with a relentless and intricate risk landscape; rather than facing isolated threats, they are confronted with a complex web of interconnected challenges.

Build More Resilient Operations with PagerDuty Incident Management

Mitigating business risk is a key enterprise priority. To avoid unnecessary exposure to the business, technical teams need a proactive approach to managing incidents. While this is a well-known challenge, it’s also much easier said than done. Over the years, many organizations have cobbled together their own bespoke processes for managing different types of incidents.

Amplify Your Response Team's Impact: Introducing Squadcast's Additional Responders

At Squadcast, we're continually striving to empower our users with the tools they need to handle incidents swiftly and effectively. Today, we're thrilled to announce the launch of our latest feature: Additional Responders. This feature marks a significant step forward in enhancing collaboration and coordination during incident response.

10 steps to proactive IT infrastructure monitoring

You can elevate your IT infrastructure monitoring with AIOps. AIOps offers full-stack visibility, enhancing IT infrastructure monitoring efforts. This lets you transform the familiar monitoring landscape by turning the chaos of constant alerts into a proactive approach to problem-solving. IT infrastructure monitoring challenges typically relate to the complexity of backend systems, especially when it comes to cloud platforms. For example, consider the following.

FireHydrant is now AI-powered for faster, smarter incidents

Over the last five years we’ve seen our customers run 583,954 incidents more efficiently thanks to a shared workspace, powerful Runbook automations, and auto-captured data. Yet despite a great deal of progress, incident efficiency hasn’t achieved peak potential. We talk to a lot of folks that are still stuck in the muck: new responders struggle to get up to speed quickly, incident commanders wade through post-incident drudgery, and knowledge silos prevent comprehensive improvements.

Optimizing On-Call for Incident Management: Preventing Team Burnout with Rootly On-Call

Rootly On-Call streamlines incident management with automated scheduling, noise reduction, and centralized documentation. It mitigates on-call fatigue with features like flexible overrides, shift visibility, and shadow rotations, enhancing team well-being and preventing burnout.

MTTR Demystified: Mean Time to Recovery, Repair, or Respond?

You might have heard of MTTR or MTBF. They are all important factors that make up incident management. Incident management refers to all the managerial processes behind bringing a site back to its uptime when it suddenly encounters any unplanned fault. And that is precisely why managing them is important. We must keep our site up-to-date so that downtimes are reduced, and customers can access any information with the least wait time.

Drag. Drop. Done | xMatters

Everbridge xMatters automates workflows to eliminate business-impacting digital events, leveraging analytics, automation, and AI to improve response time and resolution. We keep digital businesses running, reducing the frequency, duration, and associated cost of critical service disruptions. Build operational resilience and automate all the way to resolution with Everbridge xMatters.

Design Details: On-call

On your bedside table sits a piece of software designed to wake you up. It loves bothering you when something goes wrong — and making it your responsibility to sort it out Meet the new incident.io On-call app. We designed it this way: to be as interruptive as possible. Whether you’re watching telly, at the gym, or as mentioned, fast asleep, it’ll get you. Got called even though you’re in silent mode? Great! We’ve done our job properly.

Strategies for Scaling Systems Reliably by Bob Lee

I was out there in sunny Austin this February, speaking at Civo Navigate 2024. The event was jam packed with amazing talks, and it was great meeting so many people with long and fascinating careers in engineering and Site Reliability. I had the privilege of meeting Bob Lee, who currently leads DevOps at Twingate — a cloud-based service that provides secured remote access, and poised to replace VPNs.

Bob Lee - Lead DevOps Engineer at Twingate

I was out there in sunny Austin this February, speaking at Civo Navigate 2024. The event was jam packed with amazing talks, and it was great meeting so many people with long and fascinating careers in engineering and Site Reliability. I had the privilege of meeting Bob Lee, who currently leads DevOps at Twingate — a cloud-based service that provides secured remote access, and poised to replace VPNs.

ROI Demystified: A Deep Dive into What ROI Truly Means for Your Business

The term ROI (Return on Investment) often gets thrown around without a thorough understanding of its implications. Many see it merely as a financial metric, but in reality, ROI encompasses much more than monetary gains. In this comprehensive exploration, we delve into the true essence of ROI, its multifaceted nature, and how it impacts every aspect of your business strategy.

The Role of the SRE in the Incident Management Process

In the world of modern businesses, where IT systems play a major role in all types of businesses, the role of the Site Reliability Engineer (SRE) has become central to managing the effectiveness and reliability of the entire business. SREs are the bridge between the rapid deployment of software and systems and the stable operation of those systems in a production environment. They ensure that reliability and performance criteria are defined and are met.

The Debrief: How to level up your incident management program with Jeff Forde of Collectors

Today, incident management is a core part of organizations, both big and small. But what if you don't have an established incident management program, where do you start? Or what if you already have a program, but you're looking to optimize it a bit? Where do you start in that case? Consider another situation: What if you're an established organization with years of incident management experience—what are some things that you can do to take things to the next level?

The engineering on-call experience: misconceptions, lessons learned, and how to prepare

The on-call experience is sometimes a dreaded one for software engineers. Those late-night alerts and frantic Slack messages, after all, don’t exactly sound pleasant. But what’s an on-call shift really like? Is that perception of constant fire-fighting and 3 AM wake-up calls actually realistic? Michael Mandrus and Owen Smallwood, both senior software engineers here at Grafana Labs, wanted to set the record straight.

From Deploy to Commit: Building the Ultimate Development Pipeline - A Comprehensive Guide

‘Manual deployment is (should be) a sin.’ Well, calling manual deployment a sin may sound strong, but consider this: building the ultimate development pipeline demands a focus on automation. Although the selection of a deployment method depends on the specific needs and requirements of a project or environment, can you really deny the power of automated deployment? There's a better way.

How AIOps improves IT service assurance and optimization

ITOps and DevOps teams face many challenges. Their responsibilities are extensive, from navigating complex IT environments at scale to quickly addressing performance issues and minimizing downtime and outages. Enhancing your organization’s IT service assurance requires you to ensure the reliability, performance, and availability of IT services.

How to deal with alert fatigue head-on

Everyone experiences stress at work—thankfully, it’s a topic folks aren’t shying away from anymore. But for on-call engineers, alert fatigue is a phenomenon closer to home. Unfortunately, like stress, it can be just as insidious and drastically impact those it affects. First discussed in the context of hospital settings, this phrase later entered engineering circles.

How Squadcast's Snooze Incidents Promotes Focussed On Call Shifts

Dealing with a flood of incidents, each with varying degrees of urgency, can be a daily struggle for Incident Response teams. Suppose a low-priority alert pings while you're tackling a critical incident. This pulls your focus away from the urgent issue. This constant alert bombardment can: How do engineers ensure that high-severity issues take precedence? Don't they want to avoid being bothered or bombarded with notifications while addressing critical matters? They sure do.

The Debrief: How to level up your incident management program with Jeff Forde of Collectors

Today, incident management is a core part of organizations both big and small. But what if you don't have a program in place...where do you start? Or what if incident management is already a key part of your org, but you're looking to optimize it—where do you kick things off in that case? Consider another situation: What if you're an established organization with years of incident management experience—what are some things that you can do to take things to the next level?

Improving your on-call schedule with runbooks

Incidents are a stressful time for your team: your service isn't working the way you expect and your customers/stakeholders want to know what's going on. The last thing you want to do is let your team improvise everything when it comes to responding to incidents. Google's own SRE book has great overall tips for incident management, part of which involves "develop(ing) and document(ing) your incident management procedures in advance", which this article dives into.

Advice for building an incident management program

On this weeks' episode of The Debrief, we chatted with Jeff Forde, an Architect on the Platform Engineering team at Collectors. With a background spanning finance, healthcare, and various product-led startups, Forde has honed his expertise in DevOps, site reliability, and platform engineering. Beyond his professional life, he's also a dedicated volunteer first responder and certified fire instructor in Connecticut, offering him a unique perspective on managing incidents of all typesz.

Navigating IT Incidents - The Role Of The Status Page

At any moment, a small failure at any point in your complex web of IT systems can trigger an outage. As such, proactively establishing a method of clear and timely end user communication is the crux of effective incident response. For large organizations, these moments of downtime not only carry a massive opportunity cost, but also test the resilience of their operations.

How IT monitoring software and AIOps drive efficiency

Embracing digital transformation means increasing your reliance on a variety of IT systems, applications, and networks. Organizations are adopting advanced solutions like IT monitoring software and Artificial Intelligence for IT Operations (AIOps) to manage this complexity. These tools provide real-time insights into IT ecosystem health and performance, using AI and machine learning to support proactive decision-making and automation.

IT Incidents and the Role of Incident Response Teams (IRTs)

The digital world comes with advantages and inherent risks. These IT incidents, which can encompass cyberattacks, system outages, and data breaches, can have a devastating impact. Beyond financial losses, IT incidents disrupt business operations, damage reputations, and erode customer trust. During an outage, having a well-prepared Incident Response Team (IRT) is essential to reduce downtime and improve response times.

Next-Gen Incident Management: Blueprints for High-Powered Incident Response

Join us for an exclusive webinar designed for IT Operations leaders, SREs, DevOps & software engineering leaders, featuring Jim Gochee, CEO of Blameless, Ken Gavranovic, COO of Blameless, and Nick Mason, Principal Sales Engineer at Blameless. Uncover the technical scaffolding essential to propel your incident management strategy forward, faster. Dive deep into the core technical components vital for a robust incident response framework, and discover firsthand how Generative AI can dramatically save hours for your team during critical incidents.

Get started with BigPanda Open Integration Manager

In today’s fast-paced digital landscape, effectively managing alerts and deriving actionable insights from data is crucial for organizational success. BigPanda’s platform stands out as a comprehensive solution designed to tackle these challenges head-on, offering a suite of features that streamline alert management and drive operational efficiency.

Recent Outage of Meta and Google Ads: How to Prevent Potential Loses

On Tuesday, March 5th, Facebook, Instagram and Google Ads experienced widespread outages that lasted for nearly two hours, affecting thousands of users worldwide. More than 550,000 reports poured in from Facebook users, and Instagram received 92,000 similar complaints, as reported by Reuters. As Meta stated on their newest platform, Threads: ”Earlier today, a technical issue caused people to have difficulty accessing some of our services.

3 questions to ask of any DevOps tool in 2024

Is your DevOps tool stack out of control? I feel like every day, I talk to someone who feels this pain. The technological golden age of the past few years created a lot of niche tools, but now that CFOs and boards alike are demanding budget restraint, many of these tools are being scrutinized. The reality of the situation is that it’s not good enough for a tool to do one thing anymore.

5 Easy Ways to Reduce Work-Related Stress for SRE Professionals

It's completely normal to feel a little overwhelmed and stressed out at work these days. Technology has collaboration moving at the speed of light, and time away from screens is at an all-time low, blurring the lines between work and personal time. Plus, it's hard to ignore the multitude of tech outages that have been making headlines lately, leaving teams anxiously on edge. When you are a professional with on-call cycles, the potential of outages adds another level of complexity to the mix.

The Debrief: Introducing incident.io On-call

This is on-call as it should be. The secret's out. The world can finally know. incident.io On-call is here. Naturally, a lot of you may be wondering: why and why now. So to help answer those questions, we sat down with Chris and Pete, two of our co-founders here at incident.io to get a bit of background on this project: This episode will not only get you excited about this huge week, it'll get you pumped for what's ahead for on-call.

The Usual Suspects of IT Incidents

🔍 Unlock the secrets behind IT incidents with our latest video, "The Usual Suspects of IT Incidents and Why Status Pages Help"! 🚀 In the fast-paced world of technology, encountering IT incidents is inevitable. Join us on this insightful journey as we delve into the common culprits behind these disruptions and explore why having a robust status page is the key to maintaining transparency and efficiency.

The Unplanned Show, Episode 28: Cloud-native Security with Andrés Vega

What do new requirements to document and disclose security compliance mean for organizations? In this episode, we'll sit down with Andrés Vega, Technical Leader for the Security Technical Advisory Group at the Cloud Native Computing Foundation to hear about what's changed... and what's always been a good idea.

March 2024 Update - Design update, Stand-ins via mobile App, Configurable shift reminders and reports as well as customizable data retention

With our SIGNL4 March Update, we are speeding up and have once again completed some innovations for you. This time, we have further developed our design and color scheme slightly and made changes for better readability. In our mobile app, you can now also quickly and easily set up a stand-in, should a person unexpectedly be absent from duty. Furthermore, in certain SIGNL4 plans, the data retention period can now be flexibly adjusted to the respective company requirements.

We've launched incident.io On-call

It’s 3am. You wake up to a blaring alarm, the sound burned into your soul from countless sleepless nights. You reach for your phone, ‘press 4 to acknowledge’ and bleary eyed, you open your laptop, grab a coffee and get to work. The next hour is a whirlwind—bringing services back online, keeping colleagues in the loop, maintaining a list of action items, updating a status page that will be seen by millions of customers. Potentially for the fifth time this month.

Solve financial services ITOps challenges with AIOps

The financial services industry is experiencing a profound shift. Customers now demand a flawless experience across all touchpoints, including online platforms, mobile devices, ATMs, and physical branches. Any lapse in performance or reliability in these channels can lead to dissatisfaction. Moreover, the competition is intensifying as technology-focused companies, more nimble and innovative than traditional counterparts, are continuously disrupting the market.

DORA vs. DORA!

There was recently some confusion in the office that I thought was worth researching and addressing. Depending on who you are talking to, you may hear the acronym DORA in one of two contexts. (OK, three if you’re talking to a preschooler!) It might be in relation to DORA metrics–that is, a set of metrics associated with DevOps Research and Assessment.

Trade-off Between Reliability and Feature Velocity

The pressure to constantly innovate and release new features can often clash with the need for a stable and reliable product. While there might be some temporary cutbacks in testing time to achieve high feature velocity, ensuring reliability doesn't have to be an afterthought. We reached out to industry experts to gather their insights on ensuring reliability during phases that demand high feature velocity. Here's what they had to say.

The Debrief: AI can help you never forget incident follow-up actions again

Noting follow-up actions is really important at the end of the incident response process. The problem is that it can be really easy to overlook certain actions or forget to do them entirely. With Suggested Follow-ups, this is now a thing of the past. In this episode, you'll hear from Rob, the project lead for our latest Suggested Follow-ups feature, to get a peek behind the curtain.

Deliver Better Customer Experiences with PagerDuty for Customer Service

Want to deliver better customer experiences and meet your SLAs? PagerDuty for Customer Service Operations helps organizations connect the right teams at the right time, address urgent tickets, efficiently scale their 24-7 customer support model, and enhance cross-functional collaboration.

The Unplanned Show, Ep. 29: Major Incident Management with Davis and Chris

Not all incidents are created equal. How do you handle major incidents so that they don't spiral into a chaotic mess, incinerating productivity across too many teams? How do you prevent major incidents and learn from the ones you've had? "Major Incident Management" has been a practice for a long time, but as companies depend even more on digital services and revenue channels, while trying to do more with the same or less, something has to change.