Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Lessons in Incident Response I Learned While Waiting Tables

Before I stumbled into the tech industry (a story for another day), I spent several years in the customer service world as a server and front-of-house manager in restaurants. It was in these jobs that I first honed some critical skills that would later lead me on the path to incident response.

Incident vs Bug: Understanding the Key Differences

Incidents and bugs are two common occurrences that can disrupt the smooth operation of systems and applications. While these terms may seem similar, they represent distinct concepts with different implications. Understanding the nuances between incidents and bugs is crucial for effective incident management and proactive problem resolution.

Comparing Uptime Monitoring, Heartbeat Monitoring, and Synthetic Monitoring

In the quest for a high-velocity development environment, one fundamental question looms large: "How can you ensure an exceptional end-user experience when an array of engineers continually push and deploy code?" The unequivocal answer to this pivotal inquiry lies in the establishment of robust, straightforward, and well-defined monitoring practices.

IT Automation Powers SRE Practices as System Complexity, Consumer Demands Grow

Site Reliability Engineers (SREs) use automation and orchestration capabilities to scale security and performance, ensuring sites are reliable and efficient. Site Reliability Engineering (SRE) can be applied to a wide range of use cases and industries, where software systems and services are critical to business operations.

Autocorrelate Alerts With Squadcast's Key-Based Deduplication

With the increasing complexity of technology stacks and monitoring tools, managing incidents can become overwhelming, leading to alert noise, alert fatigue, and delayed responses. This is where Key-Based Deduplication comes to the rescue, streamlining incident handling and enhancing the effectiveness of your Incident Management platform.

When More Incident Commanders are Better

It has been lightly revised and reposted with his permission from the original article on Medium. Leading major incident responses can be extremely stressful. You have to quickly gather an ad-hoc team, figure out what went wrong, identify a fix and make sure this doesn't make things worse, all the while with senior leadership breathing down your neck. Are we having fun yet? Many people think having a dedicated incident commander role will solve the problem.