Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Observability vs Monitoring: What's the Difference?

Observability and monitoring: These terms are often used interchangeably, but they represent different approaches to understanding and managing IT infrastructure. If you are new to these terms or are often confused between the two, this blog is for you! In this blog, we'll explore the key concepts of observability and monitoring, their evolution in IT operations, their differences and similarities, and their importance in modern infrastructure.

SLO Driven Incident Response: Service Level Objectives for Effective Incident Management | Squadcast

In today's tech-driven landscape, effective Incident Management is vital for seamless service and customer satisfaction. This webinar explores ways to uncover the role of Service Level Objectives (SLOs) in structuring incident response processes while acting as a compass, guiding incident prioritization and resolution to minimize customer impact and downtime. The webinar will help you demystify SLOs, their data-driven role in incident decision-making, and how to prioritize incidents to lessen customer impact by identifying critical incidents.

Reimagining Retrospectives

The Blameless retrospective is one of the most often discussed and rarely executed components of the SRE practice. Getting real value from the retrospective process takes time, focus and the right approach. This webinar features Ken Gavranovic and author of Architecting For Scale Lee Atchison, where they discuss the blueprint for high-performing engineering teams to maximize the value of retrospectives.

Why the Blameless Mission Matters Today

Blameless was founded over 5 years ago, in a world that looked very different than the world today. We were the first mover in the incident management space, setting the standards for what these tools should achieve. These days, concerns about reliability, incidents, and toil have hit the mainstream. Why have we seen the tech world enter an era where reliability is priority #1? Why do we believe that the Blameless mission matters more today than ever before?

Why Resilience Engineering Needs To Be A C-Level Strategy & How To Get There

The consequences of downtime and data breaches can be devastating to organizations, leading to substantial financial losses and irreparable damage to a business’s reputation. If last week's outage by the Bank of England is anything to go by, after losing trillions of £’s per day due to downtime, resilience shouldn’t just be an afterthought for organizations.

Latest Developments in Site Reliability Engineering, 2023

Gartner recently published its Hype Cycle for Site Reliability Engineering, 2023, (July 2023) report. OnPage was inspired by this report to share its prediction about the future of site reliability engineering. In this blog, OnPage will review evolutionary tools that can improve site reliability engineering practices.

A Practical Guide to Incident Communication

Even the best software fails sometimes. How quickly those failures get addressed, and how your teammates and customers feel about you after the fact, comes down to how well you communicate with them. Users, customer success managers, Ops team members, IT, security, engineering leadership, even the executive team. Each has a vested interest in resolving engineering incidents quickly. All need to be updated with the right information at the right time.