Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

Squadcast + Amazon EventBridge: Routing Alerts Made Easy

Amazon EventBridge is an AWS serverless event bus service making it easier to build event-driven applications. It uses events generated from your applications, integrated Software-as-a-Service (SaaS) applications, and other AWS services. It delivers a stream of real-time data from event sources to target services like AWS Lambda. You can also set up routing rules to determine the destination where you wish to send the data and build decoupled application architectures.

What is the Purpose of Observability? In a Word, Innovation

Asking an IT engineer or SRE to define the purpose of observability is kind of like asking someone to explain the purpose of life: There are lots of different opinions out there, and no way of proving any of them right or wrong. You could argue that observability is just a buzzword that refers to what used to be called monitoring.

Anomaly Detection

IT Operations has a wide spectrum of roles and responsibilities. The positions range from level 1 (L1) operators to Site Reliability Engineers (SREs) and everything in between. L1 operators, for example, are (often) almost exclusively reactive. They feed off the constant stream of incidents reported by clients and events that are reported by monitoring and alerting systems. This is in contrast to SREs, who work at the other end of the spectrum.

SRE Incident Management: Overview, Techniques, and Tools

In the world of a site reliability engineer (SRE), failure is not only an option, but also expected. Systems, web applications, servers, devices, etc., are all prone to performance issues and unexpected outages at some point. It is an unavoidable fact. These unexpected failures can lead to huge revenue losses, customer trust and depending on the industry, maybe fines. Fortunately, SRE incident management is one of the core practices used to limit the disruption caused by unexpected issues.