SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Managing On-Call Rotations: Navigating Incident Management from Chaos to Calm

Aug 25, 2023 By Chitra Bisht In Squadcast

Navigating On-Call rotations can often feel like taming a storm of alerts and constant disruptions, leaving teams overwhelmed and stressed. Hence there is a need to streamline On-Call rotations and leverage concerned software to restore order and peace. In this guide, you'll explore practical tips, best practices, and smart strategies to transform your Incident Management process. Let's get to a more efficient On-Call experience.

Read Post

Squadcast

Read more about Managing On-Call Rotations: Navigating Incident Management from Chaos to Calm

PromQL Macros in Levitate

Aug 25, 2023 By Prathamesh Sonpatki In Last9

Define PromQL Macros to standardize complex PromQL queries in Levitate.

Read Post

Last9

Read more about PromQL Macros in Levitate

GCP Managed Service For Prometheus vs. Levitate

Aug 24, 2023 By Prathamesh Sonpatki In Last9

A detailed comparison of Levitate and Google Managed Prometheus - Cost, Scale and Ease of Use.

Read Post

Last9

Read more about GCP Managed Service For Prometheus vs. Levitate

A case for Observability outside engineering teams

Aug 23, 2023 By Aniket Rao In Last9

Observability is being built by engineers for engineers. In reality, o11y is for all.

Read Post

Last9

Read more about A case for Observability outside engineering teams

We Need to Talk About the Hero Pattern Among SREs

Aug 22, 2023 By Hans Chung In Rootly

Let’s be honest. When you see an alert pop up on your phone, you aren’t thinking “according to section 12 of our most recent SRE handbook used at training 6 months ago I need to keep in mind who should be Incident Commander and who should be Ops Lead”. You’re an engineer at heart.

Read Post

Rootly

Read more about We Need to Talk About the Hero Pattern Among SREs

The Iceberg of Engineering Incident Costs

Aug 22, 2023 By Aaron Lober In Blameless

I've long been fascinated with the metaphor of an iceberg to describe a problem who’s true magnitude is obscured beneath the surface. If you’re not familiar with this phenomenon, when ice freezes it decreases in density. This allows the solid ice to float, partially, atop the water with only a small fraction of it exposed. In fact, icebergs hold nearly 90% of their mass hidden below the water.

Read Post

Blameless

Read more about The Iceberg of Engineering Incident Costs

Understanding the Rasmussen model for failures

Aug 18, 2023 By Nishant Modak In Last9

What does the Rasmussen model teach us about Site Reliability Engineering?

Read Post

Last9

Read more about Understanding the Rasmussen model for failures

10 Observability Tools in 2023: Features, Market Share and Choose the Right One for You

Aug 17, 2023 By Anjali Udasi In Zenduty

Understanding what's happening within your systems is a necessity. Have you ever wondered how experts keep an eye on systems to make sure everything's running smoothly? That's where observability tools come in! Observability tools are like helpers that give you a peek inside your tech. In this blog, we will talk about observability tools and how they can be used in different situations so it's easier for you to choose the right one for your organization.

Read Post

Zenduty

Read more about 10 Observability Tools in 2023: Features, Market Share and Choose the Right One for You

Impact of Kubernetes cluster maintenance on application availability

Aug 17, 2023 By Reliably In Reliably

#kubernetes #eks #chaosengineering
In this video, we will be exploring an interesting scenario that might happen in real life. Let's imagine we have an application running in a Kubernetes cluster inside EKS. If for any reason, two of our three nodes are cordoned and can't be scheduled anymore, what would happen to our users should the last node be cordoned as well? And what if we need to reschedule something?

View Video

Reliably

Read more about Impact of Kubernetes cluster maintenance on application availability

Checking your observability and communication platforms with Reliably

Aug 17, 2023 By Reliably In Reliably

#reliably #chaosengineering #honeycomb #slack #resilience
In this video, we will use a chaos engineering experiment, that we expect to fail, to verify our open tracing and communication platforms are correctly set up. Using the Honeycomb and Slack integrations provided by Reliably, we will send traces and messages and observe if they are triggered as expected.

View Video