Operations | Monitoring | ITSM | DevOps | Cloud

May 2024

Driving Technical Delivery: Balancing Speed and Quality in Enterprise Platforms

Enterprises face a constant challenge: how to deliver technical solutions quickly without compromising on quality. In the race to innovate and stay ahead of the competition, the pressure to accelerate delivery can sometimes overshadow the importance of maintaining high standards of quality and reliability. However, striking the right balance between speed and quality is crucial for the long-term success and sustainability of enterprise platforms.

PagerTree Team Admin QuickStart Guide

In this quick start guide, we will cover the basics of getting started as a team admin within PagerTree. Transcript: In this Team Admin QuickStart guide, we will explore the basics of team management in PagerTree. Team admins are responsible for managing teams within PagerTree. In the Team Page, admins can edit current teams, on-call schedules, and escalations policies. When editing teams They can assign and remove members as well as assign team admins.

Accelerate incident investigations with Bits AI, Datadog's generative AI co-pilot

Learn how Datadog’s generative AI assistant, Bits AI, can help organizations accelerate incident investigations with auto-generated summarization to get you up to speed quickly, fetch information about past related events, update teams and statuses all through Slack.

How to consolidate your incident response stack with PagerDuty

PagerDuty helps organizations manage the entire incident lifecycle to respond faster and more effectively while reducing costs. Move from manual, reactive incident management to an automated, proactive approach, making the incident response process more efficient and resilient.

What's New at OnPage: Enhanced Phone App and Security

Welcome to the latest OnPage phone app update! Our dedication to enhancing our product and streamlining customer workflows remains unwavering. In our continuous quest for improvement, we’re thrilled to unveil the latest enhancements to our application. We’ve listened intently to your feedback and are excited to announce a significant modernization of our phone application, showing our commitment to meeting your evolving needs.

Exoskeletons not robots

In this clip, Pete explains why we've taken the approach of "exoskeletons, not robots" when building with AI. It’s fair to say that AI is here to stay. So, as companies grapple with this reality, they’re putting their best foot forward to build AI features that really make a difference for their customers. But should you be building these features if there’s no obvious fit in your product? And even if there is, are you making sure to stay true to your product principles?

PagerTree Account Admin QuickStart Guide

In this quick start guide, we will cover the basics of getting started as an account admin within PagerTree. Transcript: In this quickstart guide, we will show you the basics of an account admin in PagerTree. Before watching this video, it is suggested to read and watch the Architecture Guide to build a strong foundation for your understanding of PagerTree and how it works. Here is a brief overview of the alert workflow.

Accelerate root-cause analysis with AIOps

The digital landscape is evolving constantly — as is its complexity. Organizations need more efficient and effective ways to sort through high volumes of IT noise to identify the root cause of incidents. In a recent webinar with BigPanda CIO Jason Walker and Waste Management Principal Architect Udo Strick, Joe Connelly — director of monitoring, observability, and service reliability at Chipotle Mexican Grill — shared his perspective on.

Maximizing Uptime: Four Essential System Monitoring Best Practices

System uptime is a fundamental necessity for every organization that gives importance to the customer experience and satisfaction. A single minute of downtime can trigger a cascade of negative consequences, impacting everything from revenue streams to customer loyalty. So, why exactly is system uptime important? Downtime translates to lost revenue, frustrated users, and operational disruption.

Building AI features? Don't forget your product principles

It’s fair to say that AI is here to stay. So, as companies grapple with this reality, they’re putting their best foot forward to build AI features that really make a difference for their customers. But should you be building these features if there’s no obvious fit in your product? And even if there is, are you making sure to stay true to your product principles? The reality is that deciding to build AI into your product isn’t a decision you make on a whim.

Install OneUptime with Docker Compose

Welcome to our step-by-step tutorial on how to install OneUptime using Docker Compose! In this video, we'll guide you through the entire process of setting up OneUptime on your system using Docker Compose. OneUptime is a powerful tool that helps you monitor your websites and services, ensuring they're always up and running.

PagerTree Team Admin Quickstart Guide

In this quick start guide, we will cover the basics of getting started as a team admin within PagerTree. Transcript: In this Team Admin quickstart guide, we will explore the basics of team management in PagerTree. Team admins are responsible for managing teams within PagerTree. In the Team Page, admins can edit current teams, on-call schedules, and escalations policies.

The importance of psychological safety in incident management

When an incident strikes, it often brings a whirlwind of stress for everyone involved—from the teams directly handling the issue to the stakeholders making crucial decisions. Imagine support teams on high alert, customers anxiously awaiting resolutions, and executives probing for answers to steer the company through turbulent times. This mounting pressure can make a challenging situation nearly unmanageable, especially when faced with problems that are new or unexpected.

Post-Incident Reviews: Turning Failures into Learning Opportunities

Incidents are inevitable. From software failures to service disruptions, unexpected events can disrupt the smooth functioning of systems and processes, causing frustration for users and impacting business operations. However, what separates successful organizations from the rest is not the absence of incidents, but rather their approach to handling and learning from them.

Reliability for the Books - Incidentally Reliable with Niall Murphy

Catch Niall Murphy (Co-Founder of Stanza Systems) talk about graceful degradation, what startups are getting wrong about reliability and how well-thought user-experiences can communicate credibility to current and potential customers. Exclusively on The Incidentally Reliable podcast — made by SREs for SREs, hosted by Zenduty.

First PagerDuty Plugin for Backstage Community Meetup

Watch the first virtual meetup for the PagerDuty plugin for Backstage. This informal gathering is for plugin users and contributors. Learn why PagerDuty continues to invest in this open-source project, which aims to solve significant challenges for software development and engineering teams. Developer Advocate and project maintainer Tiago Barbosa presents success metrics, reviews the work accomplished so far, and discusses the future feature roadmap openly.

PagerDuty Community Live Demo Webinar: Mastering Change Events for Proactive Incident Management

Developer Advocate Mandi Walls and Solutions Consultant Taz Ishraque explore the power of Change Events in the PagerDuty Operations Cloud. Watch and learn: How PagerDuty's Change Events API and integrations streamline the transmission of critical updates How Change Correlations enhance incident triage How to accelerate incident resolutions, reduce context switching and help teams focus on innovative work instead of firefighting.

Clinical troubleshooting with Dan Slimmon

It’s no secret that teamwork is one of those things that, when done right, can make a world of a difference. So sometimes, when responding to a particularly complicated incident, it can be best to bring a team together to figure out what’s going on and work towards a fix. But it’s not enough to just jam a bunch of folks into a room and hope for the best. You need a framework in place to ensure that everyone stays focused, diagnoses the issue and resolves it as quickly as possible.

Navigating the Complexity of IT Operations: A Guide for Startups

Startups are the pioneers forging new paths and disrupting industries. At the heart of every startup's success lies its ability to navigate the complexities of IT operations effectively. In this blog, we delve into the intricacies of IT operations for startups, offering insights, strategies, and best practices to steer through the maze of technology with finesse.

The Importance of Rapid Incident Response

An Incident Response Plan prepares an organization to deal with a security breach or cyber-attack. It defines the procedures an organization should follow if it discovers a possible cyber-attack, enabling it to detect, contain, and resolve problems promptly. Organizations need an IR Plan to safeguard their data, networks, and services from harmful activity and equip their staff to behave strategically.

The Ultimate Guide To Incident Communication in 2024

In the digital realm, incidents such as service disruptions and security breaches are inevitable. Incidents affect your customers and stakeholders. Also, incidents pose significant challenges to IT, Ops, DevOps, and customer support teams. As we increasingly depend on digital tools and services, the demand for seamless performance escalates, highlighting the importance of effective incident communication.

What is clinical troubleshooting? #incidentmanagement #incidentresponse #sitereliabilityengineering

In this clip, Dan Slimmons explains what this clinical troubleshooting framework entails. It’s no secret that teamwork is one of those things that, when done right, can make a world of a difference. So sometimes, when responding to a particularly complicated incident, it can be best to bring a team together to figure out what’s going on and work towards a fix. But it’s not enough to just jam a bunch of folks into a room and hope for the best. You need a framework in place to ensure that everyone stays focused, diagnoses the issue and resolves it as quickly as possible.

Learning is an iterative process #incidentmanagement #incidentresponse #sitereliabilityengineering

In this clip, Viktor Stanchev explains why it's important to remember that learning is an iterative process. Whether you’re a seasoned vet when it comes to incident response, or just getting started out, it can be easy to fall into the trap of doing too much all at once. And it just makes sense. Incident response is one of those things that doesn’t have a single, perfect formula, so teams can be left doing a little bit of everything in an effort to get it right.

It's better to declare incidents early #incidentmanagement #sitereliabilityengineering

In this clip, Viktor Stanchev explains why it's better to declare incidents early rather than too late. Whether you’re a seasoned vet when it comes to incident response, or just getting started out, it can be easy to fall into the trap of doing too much all at once. And it just makes sense. Incident response is one of those things that doesn’t have a single, perfect formula, so teams can be left doing a little bit of everything in an effort to get it right.

Automatically update your status page when an alert is received

There are several ways to update ilert status pages. In this video, you'll learn how to do it using alert actions. We'll create a new alert action so that your status page automatically updates with a new status whenever an alert is received. Haven't tried ilert status pages yet? Get a public status page integrated with ilert alerting system for free.

Advanced Incident Management Strategies for Engineers

The business world is in constant flux, and the way we handle Incident Management (IM) needs to evolve alongside it. Incidents come in all priorities and urgencies, and while some can be addressed with any planning, others are simply unpredictable. That's why businesses can't afford to be caught off guard. The potential consequences of such incidents for businesses have never been greater. A single event can disrupt operations, damage reputations, and result in significant financial losses.

How generative AI facilitates ITOps modernization

IT teams need immediate and automatic access to machine data and institutional knowledge to move faster and make the right decisions. And they need context to identify incidents and understand how to resolve them. AIOps enables this by transforming noisy and fragmented operations data into actionable insights. This is the foundation of full-context operations. Full-context operations combines observability and other machine-generated data with historical, expert, and institutional knowledge.

Manage incidents seamlessly with the Datadog Slack integration

Modern, distributed application architectures pose particular challenges when it comes to coordinating incident management. DevOps, SREs, and security teams—often spread out across separate locations and time zones, and equipped with limited knowledge of each other’s services—must work quickly to collaboratively triage, troubleshoot, and mitigate customer impact.

Setup SSO with Azure Entra ID and OneUptime

In this informative and easy-to-follow tutorial, we walk you through the process of setting up Single Sign-On (SSO) with Azure Entra ID and OneUptime. We guide you step-by-step on how to enable SSO for an enterprise application that you’ve added to your Microsoft Entra tenant. We cover everything from signing in to the Microsoft Entra admin center as a Cloud Application Administrator, to configuring SSO in the tenant and the application.

What are some startups Solomon Hykes is rooting for?

What are some startups Solomon Hykes is rooting for? What's his most controversial opinion? Who are some community members that more people should follow? Discover the answers to these questions, and a lot more in the Incidentally Reliable Podcast with Solomon Hykes, live on all major platforms! Tune in as Solomon shares stories from the early days of Docker, Inc, the rollercoaster journey leading to 20 million active developers worldwide, the heavy crown of a tech leader and his vision to revolutionize CI/CD with Dagger today.

Speedrun to Signals: automated migrations are here

When we launched Signals to the world, we were excited to hear how our product resonated with many teams. But with that excitement came an understandable concern: how much time and effort will I have to put in to move from my existing provider to Signals? We hear you — that’s why we built the Signals Migrator tool. And we’re open sourcing it.

Practical lessons for AI-enabled companies

We went live with our first set of AI-enabled features a few months ago. Needless to say, we learned a lot along the way, as this was the first time we had experimented with generative AI. Here, I'll share some of what we've learned as we’ve grappled with using LLMs to power new products at incident.io. This will be most applicable to the application layer, AI-enabled but not AI companies.

Grafana Incident: new tools for faster, simpler incident response

At Grafana Labs, we’re committed to helping teams dramatically improve how they manage and respond to incidents. Through Grafana Incident Response & Management (IRM), we provide tools to empower teams, streamline processes, and enhance the effectiveness of incident management strategies—and we’re constantly looking for ways to make our solution even better.

Unveiling the power of AI in incident management

The emergence of AI opens new and innovative possibilities, simplifies operations, and boosts overall success. With AIOps, your technical organization can achieve unparalleled efficiency, productivity, and profitability. This cutting-edge technology leads us toward a brighter, more prosperous future with exciting opportunities to grow and thrive.

Understanding DORA: How to operationalize digital resilience

In an interconnected world, digital resilience is crucial for navigating crises and safeguarding financial and security assets. The European Union (EU), comprising 27 countries and 450 million people, recognizes the significance of digital resilience and has introduced regulatory mandates to fortify and align the digital ecosystem.

PagerDuty Appoints Eduardo Crespo, Vice President of EMEA

PagerDuty, Inc announces the appointment of Eduardo Crespo as vice president of EMEA. Crespo will lead PagerDuty's next phase of growth in the EMEA region bringing the PagerDuty Operations Cloud to enterprise customers across EMEA to solve their biggest digital challenges.

Why more low severity incidents can be a good thing #incidentmanagement

In this clip, Dennis Henry of Okta explains why having more low-severity incidents can be a good thing. In last week’s episode of The Debrief, we had on Colette Alexander, Director of Engineering at HashiCorp, to discuss some of the myths around incident response. In that conversation, one of the myths we spoke about was the idea that asking “why” is better than asking “how.” And how, in reality, asking "how" allows you to focus more on the contributing factors that led to an incident happening, whereas “why” tends to single out a person, which can lead to a lot of blame.

Mistakes happen for many reasons #incidentmanagement

In this clip, Dennis Henry of Okta explains why it's important to remember that mistakes happen for several reasons and don't have a single cause. In last week’s episode of The Debrief, we had on Colette Alexander, Director of Engineering at HashiCorp, to discuss some of the myths around incident response.

IRL to IAC: Your Environment to PagerDuty via Terraform

Figuring out how to represent your as-built environment in PagerDuty can be confusing for new users. There are a lot of components to PagerDuty that will help your team be successful managing incidents, integrating with other systems in your environment, running workflows, and using automation. Your organization might have a lot of these components – users, teams, services, integrations, orchestrations, etc.

Live event recap: Humanizing the on-call experience

There’s no two ways about it: on-call is stressful. But with humans at the center, it’s especially important to find ways to make it as manageable and empathetic as possible. In this webinar with our friends at ELC, incident.io VP of Engineering, Noberto Lopes, and Intercom Staff Product Engineer, Andrej Blagojević, discuss their own experiences with on-call, and how the process can be better.

Incident Management: 5 Best Practices for Seamless Operations

Website incidents happen at any time for any reason. Your website might stop responding to customers. Performance may slow down. Main pages start giving client or server errors. And when they do strike, it brings frustration and confusion to your customer, leading to lower trust and engagement.

Improve incident triage with AIOps to reduce downtime

Downtime is expensive, both to your budget and your brand reputation. As IT outage costs increase, it’s critical to identify and prioritize incidents quickly to minimize the impact on your organization. In a recent survey of more than 400 global IT professionals, Enterprise Management Associates found that unplanned downtime costs average $14,056 per minute. That’s an increase of nearly 10% from 2022.

Upskilling your Network Operations Center

Many organizations are heavily investing in AI and automation to remove the burden of manual work and operational efficiency. However to drive their wide scale adoption, they also need employees who can collaborate effectively with the technology. To bridge that gap, companies can use upskilling to retain talent, mitigate risks to the business, and allow employees to grow their careers.

Why "why" is the wrong question to be asking after incidents with Dennis Henry of Okta

In last week’s episode of The Debrief, we had on Colette Alexander, Director of Engineering at HashiCorp, to discuss some of the myths around incident response. In that conversation, one of the myths we spoke about was the idea that asking “why” is better than asking “how.” And how, in reality, asking "how" allows you to focus more on the contributing factors that led to an incident happening, whereas “why” tends to single out a person, which can lead to a lot of blame.