Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

From Amazon to Apple: Key Strategies for Operational Excellence in Tech

Jim Gochee, CEO of Blameless with a history at New Relic and Apple, Ken Gavranovic, COO of Blameless and an Amazon Best Selling Author with experiences at Cox, Web.Com, and Unqork, and Lee Atchison, Chief Reliability Officer at Blameless, noted for his work on Amazon BeanStalk and as the author of "Architecting for Scale," with roles at AWS, HP, and New Relic, will guide this session.

8 Strategies for Reducing Alert Fatigue

Site Reliability Engineers (SREs) and DevOps teams often deal with alert fatigue. It's like when you get too alert that it's hard to keep up, making it tougher to respond quickly and adding extra stress to the current responsibilities. According to a study, 62% of participants noted that alert fatigue played a role in employee turnover, while 60% reported that it resulted in internal conflicts within their organization.

The Catchpoint 2024 SRE Report - Five Key Takeaways

Only emerging into the mainstream in the 2010s, SRE is a relatively new discipline in tech. It’s been rapidly adopted by a widening variety of organizations, implementing constantly evolving practices. For the last six years, Catchpoint has been running a survey to take the temperature of the latest developments and trends. Check out the full report here, and read on to see our analysis on five key takeaways.

Non-Abstract Large System Design (NALSD): The Ultimate Guide

Non-Abstract Large System Design (NALSD) is an approach where intricate systems are crafted with precision and purpose. It holds particular importance for Site Reliability Engineers (SREs) due to its inherent alignment with the core principles and goals of SRE practices. It improves the reliability of systems, allows for scalable architectures, optimizes performance, encourages fault tolerance, streamlines the processes of monitoring and debugging, and enables efficient incident response.

SLOs with Prometheus done wrong, wrong, wrong, wrong, then right

We have Carson Anderson, Sr. DevOps Engineer at Weave HQ, talking about how they implemented SLOs using Prometheus, what went wrong, and how they fixed it. This talk was given at "Last9 of Reliability" Discord community on 13th December. Talk Description: First thing's first: Yes, it really did take us 5 tries to implement our SLOs with Prometheus. While that may seem embarrassing, we are very happy to be able to share our SLO journey so that we can hopefully help you avoid the same mistakes.

Introducing Squadcast's Intelligent Alert Grouping and Snooze Notifications

Maintaining system reliability amidst a deluge of alerts remains a formidable challenge for complex infrastructure environments. To address this critical need, Squadcast is happy to introduce Intelligent Alert Grouping - designed and developed based on in-depth discussions and feedback from our enterprise customers. This innovative solution is designed to streamline Incident Management, ensuring that Incident Response teams can focus on what truly matters.

How Squadcast's Workflows Enhance Incident Management Automation?

One of the daily challenges for Incident Response teams is the pressure to resolve incidents swiftly and effectively. However, manual processes often hinder this objective, leading to delays, oversight, and potential miscommunication. In this blog, we’ll learn the practical aspects of workflow automation in Incident Management using Squadcast, exploring how it streamlines processes, eliminates manual tasks, and enhances overall efficiency.