Operations | Monitoring | ITSM | DevOps | Cloud

August 2023

A better Grafana OnCall: web-based scheduling, mobile app, email support

Does anyone really enjoy being on-call? That looming dread over what could go wrong? The alarms in the middle of the night when everything does in fact go wrong? Of course not! But that doesn’t mean on-call shifts need to be a giant bundle of anxiety and exhaustion. This is something near and dear to our hearts at Grafana Labs, since the majority of our engineers participate in on-call shifts.

What's New: Enhanced PagerDuty Analytics for Faster Insights and Smarter Recommendations

Data has become the lifeblood of businesses, empowering organizations to make more informed decisions, drive innovation, and gain a competitive edge. McKinsey touts the benefits of adopting data-supported capabilities, referring to the various ways data is utilized to enable and enhance the functioning of an organization.

Democratize Automation with AI-Generated Runbooks

Operational efficiency is as critical within the IT and engineering teams as any other part of the business. Automating repetitive tasks and reducing escalations within and to these teams is of immense value. While automation saves time and boosts productivity, the complexity of developing automation can be a limiting factor and bottleneck. Generative AI is a paradigm shift here, in that it brings consumer-style simplicity to assisting in the development of enterprise-grade automation.

Incident Management Today: Benefits, 6-Step Process & Best Practices

Disruptive cybersecurity incidents become more and more commonplace each day. Even if nothing is directly hacked, these incidents can harm your systems and networks. Navigating cybersecurity incidents is a constant challenge — the best way to stay ahead of the game is with effective incident management.

Reimagining Retrospectives

The Blameless retrospective is one of the most often discussed and rarely executed components of the SRE practice. Getting real value from the retrospective process takes time, focus and the right approach. This webinar features Ken Gavranovic and author of Architecting For Scale Lee Atchison, where they discuss the blueprint for high-performing engineering teams to maximize the value of retrospectives.

10 Reasons AlertOps is the Preferred PagerDuty Competitor

PagerDuty recently made changes to their pricing plans by moving rules-based noise suppression features out of their Professional Plan into the Event Intelligence add-on module. AlertOps includes rules-based noise suppression features beginning in the Premium Plan. AlertOps plans offer more competitive noise suppression features vs PagerDuty plans.

Pagerduty Pricing:10 Reasons Alertops Value Your Money

Customers are not sure exactly which features they need to use when they sign up. After using PagerDuty, they discover they need to upgrade to another plan to get some features that they need. When compared to PagerDuty Pricing, AlertOps pricing plans are simpler and less confusing.

Why the Blameless Mission Matters Today

Blameless was founded over 5 years ago, in a world that looked very different than the world today. We were the first mover in the incident management space, setting the standards for what these tools should achieve. These days, concerns about reliability, incidents, and toil have hit the mainstream. Why have we seen the tech world enter an era where reliability is priority #1? Why do we believe that the Blameless mission matters more today than ever before?

Latest Developments in Site Reliability Engineering, 2023

Gartner recently published its Hype Cycle for Site Reliability Engineering, 2023, (July 2023) report. OnPage was inspired by this report to share its prediction about the future of site reliability engineering. In this blog, OnPage will review evolutionary tools that can improve site reliability engineering practices.

How to configure Grafana Incident with Microsoft Teams

Grafana Incident, the powerful incident response tool that is part of the Grafana IRM suite in Grafana Cloud, comes with a range of integrations out of the box, including Zoom and Google Meet spaces, GitHub and JIRA issues, and even a Google Doc template for post-incident review documents. One of the key features in Grafana Incident is the chatbot integration, which previously only supported Slack.

The Unplanned Show, Episode 11: Donnie Berkholz on ITIL, DevOps and Platforms

In this episode, Donnie breaks down where ITIL came from and where it’s starting to go, and why that’s useful for teams that are trying to adopt DevOps practices in ITIL-oriented organizations. Donnie gives some great examples of building empathy and bringing the ITIL teams along for automating changes and decentralizing Sev 2 incident management. He also lays out his core philosophies on Platform Engineering and how to justify the effort.

Three Teams That Can Use AIOps to Work Smarter, Not Harder

There isn’t a boardroom today that isn’t asking what AI and generative AI in application can help drive efficiency and accelerate their business. For organizations looking to capitalize on ML and automation to improve their efficiency during incidents, AIOps is a tangible, proven application thatproves to be an exciting opportunity for ITOps teams. As we’ve seen across market landscape evaluations, there are a number of ways that solutions can be implemented.

A Practical Guide to Incident Communication

Even the best software fails sometimes. How quickly those failures get addressed, and how your teammates and customers feel about you after the fact, comes down to how well you communicate with them. Users, customer success managers, Ops team members, IT, security, engineering leadership, even the executive team. Each has a vested interest in resolving engineering incidents quickly. All need to be updated with the right information at the right time.

How to use Key-Based Deduplication in Squadcast | Deduplication Rules | Squadcast

Key Based Deduplication is an efficient way to avoid duplicate entries when processing incoming Events alongside existing Incidents. It generates a Deduplication Key using a user-defined template specific to events from an Alert Source. This key helps identify and group duplicates. This video explains how does Key Based Deduplication work and how to set it up effectively.

Helm Dry Run: Guide & Best Practices

Kubernetes, the de-facto standard for container orchestration, supports two deployment options: imperative and declarative. Because they are more conducive to automation, declarative deployments are typically considered better than imperative. A declarative paradigm involves: The issue with the declarative approach is that YAML manifest files are static.
Sponsored Post

Managing On-Call Rotations: Navigating Incident Management from Chaos to Calm

Navigating On-Call rotations can often feel like taming a storm of alerts and constant disruptions, leaving teams overwhelmed and stressed. Hence there is a need to streamline On-Call rotations and leverage concerned software to restore order and peace. In this guide, you'll explore practical tips, best practices, and smart strategies to transform your Incident Management process. Let's get to a more efficient On-Call experience.

Demo Roundup: What's new in the PagerDuty Operations Cloud, August 2023

Customer-impacting issues detected and reported by customers anywhere from 20% to 90%+! In this episode of our quarterly demo roundup, we'll see how to quickly take action on a customer-reported issue, with the help of #GenerativeAI and more great new capabilities in the PagerDuty Operations Cloud. Six of PagerDuty’s product managers give live demos.

Everbridge Business Operations

Everbridge Business Operations helps businesses prepare for, and respond to, critical events, protecting facilities and business operations. Built on Everbridge’s industry-leading critical event management (CEM) platform, businesses can detect potential risks that might impact business operations and orchestrate a response in seconds across teams and digital/physical systems.

Everbridge People Resilience

Everbridge People Resilience solutions help businesses prepare for, and respond to, critical events, keeping people healthy, safe, and productive wherever they work or travel around the globe. Built on Everbridge’s industry-leading Critical Event Management (CEM) platform, businesses can detect potential threats that might impact your people, and orchestrate a rapid response across teams and digital/physical systems.

The Unplanned Show, Episode 10: Mitra Goswami on Generative AI

In this episode, Mitra shares a bunch of valuable insights in how to successfully adopt generative AI, from selecting use cases that deliver value, having foundational data infrastructure in place, to having design and privacy guidelines. Grab a paper and pen and take some notes!

Streamlining incident response: the power of integration in engineering tools

In the ever-evolving world of software development, incidents are bound to happen. Whether it's an unexpected server crash, a critical bug impacting user experience, or a security breach, handling incidents swiftly and effectively is crucial for maintaining a seamless user experience and preserving business reputation. That's where incident response tools come in — to help you automate, document, communicate, and mitigate.

More than downtime: the opportunity costs of poor incident management

In my last blog post, I wrote about the explicit costs of incidents — the ones you can easily track based on dollars lost. But the cost of incidents goes beyond the time spent resolving them. While we’re spending our time managing incidents (that includes mitigating and retrospectives), we’re incurring a large opportunity cost in terms of releasing the next big thing.

New features summer wrap-up: Evolving ChatOps, AI-assisted Incident Comms, and Time-based alert grouping

It is time to sum up the product updates that we introduced during summer 2023. As always, our focus has been on minimizing limitations in the incident response process and accelerating the workflow from acknowledgment to resolution. We invite you to contribute to the ilert roadmap by submitting your feature and improvement ideas here.

The Iceberg of Engineering Incident Costs

I've long been fascinated with the metaphor of an iceberg to describe a problem who’s true magnitude is obscured beneath the surface. If you’re not familiar with this phenomenon, when ice freezes it decreases in density. This allows the solid ice to float, partially, atop the water with only a small fraction of it exposed. In fact, icebergs hold nearly 90% of their mass hidden below the water.

Advancements in Real-Time Health System Technologies, 2023

The OnPage team is pleased to inform you that we’ve been acknowledged in the Gartner® Hype Cycle™ for Real-Time Health System Technologies, 2023 report, as a Sample Vendor in the Clinical Communication and Collaboration category. As per the Gartner report, “This Hype Cycle includes technologies pivotal to the real-time health system vision.

3 New Updates to the PagerDuty Scheduling Experience

With the acceleration of cloud and digital transformation initiatives, enterprises are under pressure to adopt more agile, DevOps practices to be responsive to the business. But the increased complexity of digital systems and reliance on digital business only makes the cost of incidents more expensive.

Incident Management: A Complete Introduction

In the dynamic landscape of IT operations, incidents are bound to occur. Incident management is a structured and proactive approach to address and resolve these unexpected events promptly and effectively. It forms a crucial component of IT service management (ITSM), ensuring smooth operations and minimizing the impact of incidents on an organization’s productivity and customer experience.

10 Observability Tools in 2023: Features, Market Share and Choose the Right One for You

Understanding what's happening within your systems is a necessity. Have you ever wondered how experts keep an eye on systems to make sure everything's running smoothly? That's where observability tools come in! Observability tools are like helpers that give you a peek inside your tech. In this blog, we will talk about observability tools and how they can be used in different situations so it's easier for you to choose the right one for your organization.

PagerDuty Recognized in 12 2023 Gartner Hype Cycle Reports

While most of the world knows us for on-call management, we’ve been hard at work expanding the PagerDuty Operations Cloud to other areas like AIOps, Process Automation and Customer Service Operations (CSOps). Underscoring our commitment to redefining digital operations management for our customers, our commitment to R&D and delivering the best products and platform has resulted in PagerDuty being recognized in 12 distinct 2023 Gartner Hype Cycle reports across nine unique categories.

More than downtime: the explicit costs of poor incident management

A cold fact of SaaS Life™ is that you can’t make money when your product or website doesn’t work — and those lost dollars add up fast. Downtime, SLA breach paybacks, compliance fines, and other explicit costs are the easiest to quantify and they’re what most people think of when they think about incidents.

Reduce MTTR with Grafana, Grafana k6, and Prometheus: Inside DHL's observability stack

Each year, more than 296 million packages are shipped around the world via DHL and their premium service, Time Definite International. And at DHL Express Switzerland, a local unit of the international logistics and shipping company, the IT team provides solutions for tracking customs clearance progress, analytics, mobile and optical character recognition (OCR) scanning, and warehouse management on every package that moves through Switzerland.

CloudOps: Transforming IT Operations in the Cloud

CloudOps, or Cloud Operations, is quickly becoming the standard for managing IT operations in the cloud computing ecosystem. By transforming traditional IT operations to harness the full potential of the cloud, businesses are experiencing greater automation, collaboration, agility, and resilience. This article is a deep dive into the concept of CloudOps, its core components, the advantages it offers, and the steps necessary to implement it effectively within an organization.

Welcome To xMatters - Ep4 - Initiating Incidents

Everyone makes mistakes. So, it is important that when they do, we can act quickly, resolve the problem, and understand what went wrong to reduce the chances of it happening again. When your business is suddenly impacted by an unforeseen event, it’s important that you can efficiently report the problem and call for help as soon as possible. With xMatters, you can initiate incidents quickly and target specific groups with the vital information they need.

But It's Not Our Fault! When Third-party Incidents Affect Your Service

Very few SaaS products exist completely independently. Between cloud service providers, payment processors, content delivery networks, and more, chances are you rely on external systems to keep your product working. When these systems fail, it can leave you feeling pretty helpless. In some cases you might have fallback options, but oftentimes all you can do is wait for recovery and clean up the fallout.

Azure Monitoring Agent: Key Features & Benefits

In today's rapidly evolving digital landscape, businesses increasingly rely on cloud computing and infrastructure to support their operations. As organizations migrate their workloads to the cloud, robust monitoring and management tools are paramount to ensure optimal performance, security, and efficiency. In response to this demand, Microsoft Azure has introduced the Azure Monitoring Agent (AMA), a powerful and versatile solution designed to enhance the monitoring capabilities of Azure resources.

How To Write Incident Postmortems

Writing a public postmortem regarding an outage is essential to maintaining transparency and accountability when things go wrong in a service or system. The purpose of writing a postmortem is to analyze and document an incident or event that has occurred, usually with a focus on identifying its root causes, understanding what went wrong, and outlining steps to prevent similar issues from happening in the future.

The Unplanned Show, Episode 8: Platform Engineering with Martin Van Son

In this episode, Martin Van Son provides a simplified definition of platforms in this context: a way for internal users to request anything from environments to deployments. The platform engineering comes in because someone needs to own stitching together and automating away all the complexity involved to complete that action. In the end, both the consumers and the creators save time. Furthermore, platform engineers have an opportunity to encode best practices and cost saving measures that are often forgotten when users are left to their own devices.

New OnPage + ConnectWise Incident Alerting Workflow

OnPage has combined the power of voicemail transcription with keyword-based triggers to identify and prioritize after-hours incidents. The new OnPage + ConnectWise workflow enhances incident alert management for IT and Managed IT clients by drastically decreasing incident response times. By streamlining after-hours on-call communication, OnPage's critical alerting platform has revolutionized the on-call IT industry.

Rootly Raises $12 Million from Renegade Partners, Google Gradient Ventures, & XYZ Ventures

We are excited to announce that we have raised a $12M round of financing led by Renegade Partners with participation from Google Gradient Ventures (Google’s AI-focused venture fund) and XYZ Ventures. This brings our total funding to date to $15.2M ($20M CAD) alongside our other existing investors Y Combinator and 8VC.

July 2023 newsletter: Changelog-The Deluxe Edition

🎵 Gotta give the people, give the people what they want! 🎵 You've been asking. And we've been listening. Over the past few weeks, we've been shipping frequently requested features to help you bring your incident management to the next level. It may be the dog days of summer, but let's ignore that, yeah? Just take a look at this recent changelog. Note that this is the biggest one we've ever published.

From On-call to Non-call: Resolving Incidents Before They Even Happen

Artificial intelligence has captured the attention of the world, with tools like ChatGPT and large language models (LLMs) driving the conversation. But you don’t need to wait for the future or new features powered by LLMs to start working smarter—the tech industry has been investing in intelligent, automated tools for years and they’re ready for production now. In this talk, you’ll learn how the engineering teams at Toyota Connected use tools like Datadog Watchdog, Anomaly Detection, and Workflows to make our lives easier and keep our platform stable.

Tools and Trends in Site Reliability Engineering according to Gartner's 2023 Hype Cycle

Gartner recently published its Hype Cycle for Site Reliability Engineering, 2023, report. This blog reviews the future of site reliability engineering based on Gartner’s Hype Cycle. Additionally, the OnPage team is pleased that Gartner mentioned OnPage as a sample vendor in the Automated Incident Response category.

Exploring distributed vs centralized incident command models

Recently in our Better Incidents Slack channel, there’s been some chatter around how people structure dedicated incident commanders at their company: distributed or centralized. The way I see it, there are two types of commanders: the temporary, distributed role — a hat that an on-call engineer or an engineering manager puts on during an incident. Then there’s the centralized, full-time role, where someone is the designated incident commander (or one of a few) for all incidents.

BigPanda's Resources for Navigating Change Through the AI Revolution

AI has revolutionized the way we engage online in 2023. From Chat GPT and AI Art Generators to healthcare, finance, and business, you can hardly read the news without reading the latest proclamation of how AI is poised to change every aspect of our lives. AI has brought fundamental changes to how we live and work, and we’re still scrambling to understand the impacts of these changes. Especially where their work is concerned, change can be difficult for people to embrace.

Getting Started with PagerDuty

In this video you will achieve a baseline understanding of what PagerDuty does and how to configure your PagerDuty account. To dive deeper into the PagerDuty platform, select relevant topics in our complimentary on-demand e-learning center at university.pagerduty.com. The PagerDuty Operations Cloud is essential infrastructure that detects and diagnoses disruptive events, mobilizes the right team members to respond, and automates workflows across your digital operations - so that your business moves forward, faster. Get started now!

What's missing from your incident management workflow

The first fifteen minutes of an incident set the tone for the rest of the resolution process. But what makes the difference between a rapid response and a stressful scramble—clear ownership—hasn't always been easy to ascertain. In this article, we’ll cover how Cortex, an internal developer portal, can be your team’s source of truth to accelerate the incident management process, and reduce MTTR.

Synced for Success: OnPage & Slack for Incident Response

As the post-pandemic world finds its footing again, a resilient spirit drives the revival, propelling businesses to embrace a new era of technological innovation. Notably, IT teams are swiftly adopting the digital transformation of their processes, particularly in incident response. From virtual collaboration tools and remote IT support to automated incident management, teams have found innovative ways to ensure seamless business continuity while delivering IT services with minimum downtimes.

Scaling Up to Keep Costs Down: Automation for Web Application Incident Management

Any organization that’s keeping up with today’s sharp rise in business demands (or better yet, getting ahead of the game) is doing so by getting innovative and jumping at the chance to do things differently. They’re not relying on the old ways or trying to use their existing toolbox. Instead, organizations are looking to the newest technologies and means of adding efficiency to as many day-to-day functions as possible.

Evolution of Site Reliability - Incidentally Reliable with Manoj Sebastian

Catch Manoj Sebastian(ex-Flipkart, Amazon, Atlassian, Intuit, Yahoo) talk about The Evolution of SRE through 20 years, Incident Response and Post Incident Culture at Big Tech and the Future of Reliability with AI ramping up at full speed. The freshest podcast for Site Reliability Engineers, hosted by Vishwa and Shubham from Zenduty.

incident.io: A scalable incident management solution built for enterprises

For enterprise businesses, a lot is riding on the efficiency of their incident response. These organizations have large customer bases, complex products, and many incidents. They also have loads of incident responders across various roles, making it difficult to coordinate internally.

Unveiling Squadcast's Enhanced Status Pages

Meet Kevin and Mai (again): Navigating the Troublesome Waters of Platform Downtime. Kevin is a Site Reliability Engineer (SRE), constantly on the lookout for potential downtime that could impact their platform, kryptobro.com. Mai is his adept partner, ever-ready to troubleshoot. In their journey, the previous version of Squadcast Status Pages served as a helpful tool, but they soon found room for improvements.

Discover what's driving the recognition behind BigPanda's AIOps innovations

Every day, BigPanda is transforming the way our customers operate. Our advanced AIOps technology redefines incident management, prevents service disruptions, and elevates customer satisfaction – and I couldn’t be more thrilled to see industry experts take notice. I’m particularly proud to see BigPanda mentioned in nine of the highly esteemed 2023 Gartner Hype Cycle reports.

Demo Roundup: PagerDuty Operations Cloud for Kubernetes

In this demo, Corbin Mills shows how to use the PagerDuty Operations Cloud to streamline and automate how a node failure is resolved. You’ll see how he uses event orchestration (in PagerDuty AIOps) to enrich an alert with pod names, and automatically runs a job to check the Kube API status, so that a responder has instant context. AIOps is also grouping and suppressing alerts. Then you’ll see how the responder can run more health status checks without the need to SSH into the environment or interrupt a co-worker for access.

Kubernetes Incident Management Best Practices

Creating just any infrastructure on Kubernetes is not enough. There are so many basic configurations you could apply and create the infrastructure for your application for the time being and it might work just fine. The incident responses won’t always remain 100% reliable. You will run into newer potholes, and that’s okay.

Understanding Blameless Postmortems

Progress often accompanies unforeseen challenges and mishaps in organizations. Traditionally, these setbacks resulted in pointing fingers, hindering progress, and creating a negative work atmosphere. However, a "Blameless Postmortems" approach transforms how organizations respond to failure. In this blog, we will delve into the importance of cultivating a blameless postrmortem culture when faced with setbacks.

Introducing Squadcast's Key Based Deduplication

We are excited to share another feature update with all our valued customers! We have recently gone live with our Key Based Deduplication feature, enabling you to define dedup keys using customizable templates for configured alert sources. With this feature, you can automatically group similar incidents and effectively deduplicate alerts.
Sponsored Post

Best Practices for SaaS and Network Incident Management

Computer and network systems have (obviously) become vital to business operations. Occasionally, there are SaaS or network incidents and these systems do not operate as needed. Enterprises want to minimize the potential damage and get their systems back online ASAP. Integrated incident management and a strong End User Experience Management (EUEM) platform that provides synthetic and real-user monitoring is a foundation for meeting that objective.

Why you need an internal status page

When we launched incident.io Status Pages a few months ago, we stressed the importance of communicating clearly with your customers about ongoing issues. To help with this, we spent a lot of time carefully designing a status page that’s easy to understand for everyone - whether they come from a technical background, work in a different area, or just want to get on with their day.