Operations | Monitoring | ITSM | DevOps | Cloud

September 2023

Observability Pillars: Exploring Logs, Metrics and Traces

The ability to measure the internal states of a system by examining its outputs is called Observability. A system becomes 'observable' when it is possible to estimate the current state using only information from outputs, namely sensor data. You can use the data from Observability to identify and troubleshoot problems, optimize performance, and improve security. In the next few sections, we'll take a closer look at the three pillars of Observability: Metrics, Logs, and Traces.

Alternatives to SMS alerts

While SMS alerts are handy, they also tend to be tricky. Across 120+ countries, we continuously deal with compliances & regulations from Vendors, Government, and Phone carrier companies. Other alert channels similar to SMS are a lot less cumbersome with higher delivery rates. Let’s take a look at the available options to switch from SMS.

Unveiling Past Incidents: Accelerating Incident Resolution with Historical Context

Having the context of how similar issues were handled in the past can be invaluable. It can help incident responders grasp the nature of recurring problems, their causes, and effective solutions that have worked in the past. Introducing Squadcast’s Past Incidents feature that assists incident responders by presenting them with a list of similar past incidents related to the same service they are currently investigating.

Introducing Grafana OnCall shift swaps: A simpler way to exchange on-call shifts with teammates

A family member’s birthday, that concert you’ve waited all year to see, an impromptu weekend getaway with friends — there are a lot of reasons software engineers might want to switch on-call shifts. And rather than have to frantically send Slack messages to your teammates, wouldn’t it be nice to automate the process and quickly find the coverage you need?

Product Spotlight: Enhancing Incident Resolution with Blameless' Microsoft Teams Integration

In today's fast-paced digital landscape, swiftly responding to incidents is paramount for engineering teams. Downtime is not just costly; it can tarnish your organization's reputation. The pressure felt by engineering operations, DevOps, and SRE leaders to architect and run an effective incident response process is immense. Fortunately, over the last several years, effective engineering organizations have developed a standard toolkit for running a good incident response process.

The importance of testing emergency warning systems

On Oct. 4, 2023, the Federal Emergency Management Agency (FEMA) plans a nationwide mobile alert test which will send an emergency SMS to all cellphones in the United States. In coordination with the Federal Communications Commission (FCC), the national test will be administered at approximately 2:20 p.m. ET on Wednesday, Oct. 4. It will consist of two portions that will test Wireless Emergency Alerts (WEA) and Emergency Alert System (EAS) capabilities.

Better learning from incidents: A guide to incident post-mortem documents

If you’re just starting out in the world of incident response, then you’ve probably come across the phrase “post-mortem” at least once or twice. And if you’re a seasoned incident responder, the phrase probably invokes mixed feelings. Just to clarify, here, we’re talking about post-mortem documents, not meetings. It’s a distinction we have to make since lots of teams use the phrase to refer to the meeting they have after an incident.

Sponsored Post

Status Pages 101: Everything You Need to Know About Status Pages

Status Pages are critical for effective Incident Management. Just as an ill-structured On-Call Schedule can wreak havoc, ineffective Status Pages can leave customers and stakeholders, adrift, underscoring the need for a meticulous approach. Here are two, Matsuri Japon, a Non-Profit Organization and Sport1, a premier live-stream sports content platform, both integrate Squadcast Status Pages to enhance their incident response strategies discreetly. You may read about them later. Crafting these Status Pages demands precision, offering dynamic updates and collaboration.

Why automated Root Cause Analysis matters for driving down MTTR

Finding the root causes of IT anomalies can be challenging, but the rewards are worth it. By identifying the root cause or causes of an incident or critical failure, response teams can resolve incidents faster and determine the best steps to avoid having them recur. This can drive down both the frequency of service interruptions and their duration.

Clouds, caches and connection conundrums

We recently moved our infrastructure fully into Google Cloud. Most things went very smoothly, but there was one issue we came across last week that just wouldn’t stop cropping up. What follows is a tale of rabbit holes, red herrings, table flips and (eventually) a very satisfying smoking gun. Grab a cuppa, and strap in. Our journey starts, fittingly, with an incident getting declared... 💥🚨

Accelerate change alert discovery and incident resolution with Root Cause Changes

Today, the majority of organizations operate under a hybrid cloud structure. Due to this, operations are consistently met with daily infrastructure and software changes and updates, which are also the primary cause of incidents and outages. Long gone are the days when a tech stack could be represented by a single dependency model. Microservices, CI/CD, and containers across multi-cloud make it extremely difficult to track all the changes and connect them to incidents.

The Ultimate Guide to DORA Metrics for DevOps

In the world of software delivery, organizations are under constant pressure to improve their performance and deliver high-quality software to their customers. One effective way to measure and optimize software delivery performance is to use the DORA (DevOps Research and Assessment) metrics. DORA metrics, developed by a renowned research team at DORA, provide valuable insights into the effectiveness of an organization's software delivery processes.

How we've made Status Pages better over the last three months

A few months ago we announced Status Pages – the most delightful way to keep customers up-to-date about ongoing incidents. We built them because we realized that there was a disconnect between what customers needed to know about incidents, and how easily accessible this information was. For example: As we built them, we focused on designing a solution that powered crystal-clear communication, without the overhead — all beautifully integrated into incident.io.

Extend Incident Alert Management to ServiceNow ITSM (Two-way integration)

Discover how OnPage's incident alert management solution can be seamlessly extended to ServiceNow's ITSM solution to provide a more efficient and streamlined service delivery experience. The two-way integration ensures that high-priority alerts are given top priority and reach the right team member in a timely manner. And, that's not all -- IT teams gain synchronization across audit trails, alert statuses, and notes, eliminating the need for app hopping and providing all the necessary information in one location.

Enhance emergency alerts with Device-Based Geo-Fencing

In today’s fast-paced and interconnected world, the importance of efficient and effective public warning systems cannot be overstated. As we face a multitude of natural disasters, civil unrest, and health crises, the ability to swiftly communicate impending threats to the right individuals at the right time has become a matter of paramount importance.

Top 5 Resiliency Trends of 2023

In today’s world, resilience is no longer a conditioned desire or methodology to try but has become a necessity for sustained success in software development and IT operations. As DevOps and Agile teams keep moving forward to cross boundaries, come up with new methodologies, and drive innovation, it is now important to have the ability to quickly recover from failures, adapt to changing conditions, and maintain high performance under pressure.

Twelve Key Learnings from PagerDuty People Team's Generative AI HackWeek

Sometimes innovation requires ideas unconstrained by traditional structures and removed from day-to-day responsibilities. It was in this spirit that PagerDuty’s People HackWeek–a friendly competition to explore how generative AI might impact the future of HR–was born.

The balancing act of reliability and availability

As consumers, we expect the products and software we buy to work 100% of the time. Unfortunately, that’s impossible. Even the most reliable products and services experience some disruption in service. Crashes, bugs, timeouts. There are a ton of contributing factors, so it's impossible to distill disruptions down to a single cause. That said, technology is becoming more and more sophisticated, and so is the infrastructure that supports it.

The Unplanned Show, Episode 13: Jake Cohen and Generative AI for Automation

On the heels of the public beta opening for AI-generated runbooks in Runbook Automation, we asked Jake Cohen from product management about how this is different from generating code with something like chatGPT or various AI-powered code completion tools available. We get into prompt engineering, managing output quality, and privacy and security concerns.

A better Grafana OnCall: Delivering on features for users at scale

Enterprise IT is just a different animal. Whether it’s operating at scale, undertaking massive migrations, working across scores of teams, or addressing tight security requirements, engineers at these organizations can face different obstacles than their counterparts at smaller organizations and startups.

Transformation in Travel: Our Q&A with TUI's Head of Technology

The travel industry is experiencing an unprecedented surge in demand from people seeking adventure and eager to explore new destinations. Given an abundance of choice and the desire to have a personalized experience, customers are turning to tour operators to remove complexity from planning so they can focus on the holiday and not on the process of planning it.

TUI Powers Outstanding Digital Experience for Customers with the PagerDuty Operations Cloud

PagerDuty Operations Cloud is essential infrastructure for TUI, enabling agility and cost efficiency to deliver outstanding digital experiences for customers. With PagerDuty’s AI and automation capabilities, TUI has streamlined incident management—reducing downtime and boosting customer bookings. Hear more in this video from Yasin Quareshy, Head of Technology at TUI.

Implementing Zero Trust: A Practical Guide

According to the Harvard Business Review, 2022 saw more than 83% of businesses experiencing multiple data breaches. Ransomware attacks, in particular, were up 13%. With cyber security being such a hot topic for business owners, it’s no surprise implementing a zero trust policy has become so important. In this guide, we’ll cover how to implement zero trust and why it’s important for your business to do so. Let’s get started.

Mastering Incident Resolution: Process and Best Practices

For DevOps and IT teams, incident resolution is an important aspect of predicting, resolving, and documenting service disruptions. It refers to the part of the incident management process where responders restore the service to functioning. Modern technology has come a long way, but it’s not without flaws. When businesses suffer from cyber-attacks, system crashes, and network outages, it impacts the organization on many levels.

The connection between incident management and problem management

Sometimes, two concepts overlap so much that it’s hard to view them in isolation. Today, incident management and problem management fit this description to a tee. This wasn’t always the case. For a long time, these two ITIL concepts were seen as distinct—with specialized roles overseeing each. Incident management existed in one corner and problem management in the other. Then came the DevOps movement and the lines suddenly became blurred. So where do they stand today?

What Is GitOps and Will It Eliminate Incident Management?

Incident management is a critical aspect of IT service management (ITSM) that revolves around restoring normal service operations as swiftly as possible after an unplanned interruption or reduction in quality. Also referred to as “incidents,” these interruptions could range from a minor issue like a single user being unable to access a service to a significant problem such as a server crash or network outage affecting many users.

Inside Prezi's cost-saving switch to Grafana Alerting, Grafana OnCall, and Grafana Incident from PagerDuty

Alexander is Senior SRE at Prezi, a video and visual communications software company. As a team, the Prezi SREs provide multiple services within the company. One of those is the observability stack where Prezi heavily relies on Grafana. Companies are always evolving to run more smoothly, serve their customers better, and operate in a way that is cost-effective.

Streamlining Incident Management with our latest feature update: Merge Incidents

Hey folks! We‘re back with another nifty feature to your Incident Management tool arsenal. You now have the ability to merge incidents with a few clicks! With this latest update you can reduce the noise while dealing with a complex incident by merging incidents across services under a parent incident. Typically this can occur when multiple incidents stem from the same underlying issue or root cause.

Journey from Junior to Senior SRE: Key Insights and Strategies

As Site Reliability Engineering (SRE) continues to grow in popularity, many professionals are looking for ways to advance from junior to senior roles. While there is no one-size-fits-all approach, the transition from junior to senior SRE is marked by a gradual increase in experience and a set of key skills. In this blog, we will explore the valuable insights and strategies shared by experienced SREs.

10 Benefits of Effective Incident Communication

In today's digital landscape, most people understand that no system is perfect and data is never 100% safe. Incidents are bound to happen. How people learn about those incidents often influences their reactions. Mishandled incident communication can have drastic consequences for your company. For starters, it can drag out the incident response and harm your bottom line.

What's the Difference Between an Agile Retrospective and an Incident Retrospective?

Blameless Chief Operating Officer Ken Gavranovic recently sat down with Lee Atchison, a renowned expert in system reliability, to discuss the topic of conducting effective incident retrospectives. You can watch their engaging, informative discussion below, or read on for our overview of the greatest hits from their talk. ‍ Agile development and incident management are the backbones of any tech-driven development cycle. At the heart of these practices lies the art of retrospectives.

Empowering Hyper Local Resilience - Everbridge + Samdesk Podcast

Organizations today face a myriad of threats in the form of civil unrest, cybersecurity, severe weather events, and more. Visibility into emerging events and potential threats extremely early in the crisis lifecycle enables security teams to take proactive measures to protect lives, reputation, and reduce liability. Proactive intelligence is critical to optimize preparation, response efficacy, and speed recovery.

Seven Models of Cloud Native Applications

In today's cloud-driven landscape, organizations are transitioning from legacy monolithic systems to agile, scalable, and secure cloud-native solutions. Some are even forging new cloud-native applications. However, the concept of cloud-native design remains subjective, lacking a universal blueprint. This blog aims to provide clarity and guidance for designing precise cloud-native applications and container deployment.

More than downtime: the cultural drain caused by poor incident management

The costs of lackluster incident management are truly far-reaching. We’ve learned they go beyond explicit costs, like lost revenue and labor expenses. And that they go beyond the opportunity cost of engineers being diverted from building revenue-building features. The final area of incident cost that’s often overlooked is cultural drain.

OnPage's Automation in I&O Optimization Predictions (Inspired by Gartner Hype Cycle for I&O Automation, 2023)

The release of the Gartner® Hype Cycle™ for I&O Automation, 2023 has inspired us here at OnPage to provide our insights on the latest trends in I&O optimization. In this blog, OnPage will predict the widespread adoption of technologies that can further automation efforts and thus contribute to I&O optimization.

Sponsored Post

The Future of ITSM: Exploring the Potential of AI-Powered Service Management

IT Service Management (ITSM) is such that it constantly evolves, introducing new technologies and tools. But if you have noticed recently, there have been some constants. And one of the most promising developments is leveraging Artificial Intelligence (AI) to power IT service management. However, the fact that AI has the potential to revolutionize ITSM is not exactly breaking news. But what continues to slip under the radar of many ITOps teams is how to unlock AI's true potential. To know this, there's a dire need to understand the already critical and soon-to-be popular use cases.

The power of Everbridge 360

Everbridge 360™ represents our relentless dedication to provide customers with the most comprehensive and unified interface to manage critical events across one single platform so they can know earlier, respond faster, and improve continuously. More effectively manage critical events, minimize communication delays, and improve overall organizational resilience through the industry’s most advanced and unified dashboard.

How to Set Up an IT War Room

IT issues can happen at any time and significantly impact an organization. Hence, it's essential to have a plan to handle these issues quickly and efficiently. And one way to do this is to create an IT war room. An IT war room is a dedicated space for teams to collaborate and resolve issues. Establishing an IT war room enhances an organization's capacity to swiftly and efficiently address IT problems, ultimately reducing their impact on the business.

Enhancing Incident Management: Seven Integrations to Complete Your Ticketing Systems

Squadcast offers some powerful integrations to simplify Incident Management processes and make your work easy. These integrations enhance Incident Management processes and complete your ticketing systems, ensuring seamless collaboration and timely issue resolution.

Practical guidance for getting started as a site reliability engineer

At the beginning of May, I joined incident.io as the first site reliability engineer (SRE), a very exciting but slightly daunting move. With only some high-level knowledge of what the company and its systems looked like prior to this point, it’s fair to say that I didn’t have much certainty in what exactly I’d be working on or how I’d deliver it.

Incident Priority Matrix: From Chaos to Clarity

IT leaders often find themselves under pressure to support business outcomes while also trying to manage help requests. An incident priority matrix makes the incident management process much more seamless. It helps companies handle priority incidents within reasonable resolution times while ensuring other concerns are met. In this blog post, we delve deep into the concept of the Incident Priority Matrix, its significance, and how it can transform your incident management processes.

Multi-Org takes FireHydrant for enterprise to the next level

Too often, complexity means confusion — and confusion is your worst enemy when it comes to efficient incident response. We recently found that poor incident management practices (like confusion about what to do or how to escalate an incident) can cost companies as much as $18 million a year.

Hospital Discharge Best Practices

Establishing an effective hospital discharge process is a crucial part of a patient’s stay and can significantly impact the success of their recovery. Patients, families, and subsequent care providers require a detailed education on continued treatment, aftercare processes, and required medications, to avoid any complications that may surface during recovery.

Failure Metrics & KPIs for IT Systems

The game in enterprise IT is this: delivering amazing services to your customers while also reducing costs. That means the time it takes to respond to an incident is critical. Incidents can ruin service delivery and destroy your budget. Certain incidents almost surely deliver a poor customer experience. Response times, you hear? Yep, we’re talking about MTTR, but that’s not all.

How generative AI is increasing cyber risk & what to do to make sure you're ready

Generative AI is all the buzz these days with the popularity of platforms and tools such as ChatGPT, Bard, Scribe, Jasper, and others experiencing exponential growth. This is a technology that has come to the fore with the force of a runaway train that’s bringing us head long into the future at the speed of light. It is transforming everything we do from writing code to making travel plans. And cybersecurity is no exception.

How to Ace Your Services with PagerDuty

It’s finals week for the US Open, one of the most celebrated sports events in the world. Tennis is my favorite sport to watch as I’m fascinated by the strength, composure and endurance each player displays while standing by themselves on the court, sometimes during incredibly long matches – the current record is 11h05.

Reliably receive a call when an organ donor is matched

Within the broader context of organ transplantation, time is of the essence. Lives hang in the balance, waiting for that life-changing call announcing a matched donor organ. For organ transplant recipients, the waiting game is often a test of patience and resilience. However, with the advent of modern technology, a solution has emerged to alleviate this uncertainty – OnPage.

Streamlining Incident Investigation

Honeycomb Customer Success Manager Josh Levin explains how to troubleshoot production incidents using Honeycomb's telemetry data: metrics, traces, and logs. While these data forms have separate interfaces, you can investigate seamlessly within Honeycomb. Josh highlights the key role of the "retriever" service in data ingestion and querying and demonstrates cross-validating tracing data with metrics to spot anomalies in pod deployments and resource usage, presented in a separate dataset. He also uses effective log filtering and searching for keywords like "update status.".

OnPage-ServiceNow Bi-Directional Integration

Discover how OnPage's incident alert management solution can be seamlessly extended to ServiceNow's ITSM solution to provide a more efficient and streamlined service delivery experience. The two-way integration ensures that high-priority alerts are given top priority and reach the right team member in a timely manner. And, that's not all -- IT teams gain synchronization across audit trails, alert statuses, and notes, eliminating the need for app hopping and providing all the necessary information in one location.

Enhancing Code Blue Workflow for Improved Survival Rates

In critical healthcare scenarios, swift response is the linchpin to saving lives. Enter code blue workflows – a set of protocols that guide healthcare teams in high-stress scenarios. When a patient’s life is at stake due to cardiac arrest, respiratory failures, or other life-threatening conditions, these workflows ensure a rapid, synchronized response.

6 Best Practices for Seamless Notifications with International SMS

There’s no denying it: in today’s interconnected world, Application-to-Person (A2P) SMS notifications have become an integral part of our daily lives. Whether it’s receiving crucial banking alerts, getting updates from our favorite retailers, or even surfacing a notification from PagerDuty when your service is down–SMS keeps us informed and connected. But have you ever wondered about the intricacies behind this seemingly straightforward technology?

Starting with Incident management career

Businesses and organisations are increasingly reliant on technology for their operations, the significance of alerting platforms has become paramount. Alerting platforms encompass the processes that enable organisations to acknowledge, respond, and to reduce various types of incidents that can impact their services. Incident alerts enable prompt responses,at the right time and minimise potential damage.

Building Trust with our Customers with PagerDuty for PagerDuty: Crisis Response Management Operations

A critical partner in your supply chain just went down. An earthquake just hit your main operations hub. Breaking news about your organization just hit social media. Bad news first—there’s always another crisis or existential threat to your organization on the horizon. If you don’t have an established Crisis Response process and team in place, you’re running a high risk of failure.

SLO Driven Incident Response: Service Level Objectives for Effective Incident Management | Squadcast

In today's tech-driven landscape, effective Incident Management is vital for seamless service and customer satisfaction. This webinar explores ways to uncover the role of Service Level Objectives (SLOs) in structuring incident response processes while acting as a compass, guiding incident prioritization and resolution to minimize customer impact and downtime. The webinar will help you demystify SLOs, their data-driven role in incident decision-making, and how to prioritize incidents to lessen customer impact by identifying critical incidents.

Grafana Incident auto-summary: AI in Grafana Cloud

Check out a fun demo of Grafana Incident auto-summary, which uses generative AI to suggest a helpful synopsis that captures key details from your incident timeline with a single click. Grafana Incident auto-summary marks the first feature enabled by the new OpenAI integration in Grafana Incident. Simply bring your own OpenAI API key to get started in Grafana Cloud.

Manage incidents, real-time alerts, and oncall from Microsoft Teams

Welcome to Spike.sh’s Microsoft Teams bot! At the heart of every successful team lies efficient communication and swift problem resolution. That’s precisely what our bot brings to the table – a dynamic toolset that empowers you to tackle incidents seamlessly. Features: Our new Microsoft Teams bot alerts are not only prompt but also smartly updated as the situation develops. It achieves this by seamlessly integrating incident management into Microsoft Teams, providing you with real-time alerts the moment an incident surfaces.