Operations | Monitoring | ITSM | DevOps | Cloud

Chaos Engineering

Setting better SLOs using Google's Golden Signals

To many engineers, the idea that you can accurately and comprehensively track your application's user experience using just a few simple metrics might sound far-fetched. Believe it or not, there are four metrics that aim to do just that. They're called the four Golden Signals and should be a core part of your observability and reliability practices.

Chaos testing: Reliability for cloud-native apps

Reliability is a critical concern for software delivery teams. Every second of lackluster performance or service interruption comes with high costs. The consequences can extend beyond just monetary expenses and have a huge impact on a company’s reputation. In a survey conducted in 2022, participants reported that over 60% of digital infrastructure failures resulted in losses of $100,000 or more.

How to Break Stuff with Chaos Engineering and Chaos Mesh

In 2011, a Netflix engineering team introduced the concept of chaos engineering with its release of Chaos Monkey. This was initially an in-house tool developed to orchestrate fault injection that Netflix eventually made open source. However, the reliance of Chaos Monkey on Spinnaker, another Netflix engineering innovation, establishes some limitations.

SRE vs DevOps: Can they coexist or do they compete?

Systems fail, sometimes publicly and at great cost. Airlines have experienced system-wide ticketing outages, causing hundreds of flight cancellations and significant inconvenience to customers. Retailers have experienced website crashes on the busiest shopping days of the year, costing millions in lost revenue and customer goodwill. It is vital to understand both DevOps and SRE and the roles they play in preventing such outages.

What is a "service" in a microservices architecture?

The past ten years marked a significant change in how software teams build and deploy applications. We moved away from bulky, slow, monolithic applications toward lightweight, scalable, distributed service-based applications. Meanwhile, tools like Docker, Kubernetes, and other container platforms helped accelerate this process. Despite this sudden growth, a fundamental question remains: what exactly is a service, and how does it fit into a microservice architecture?

What are the four Golden Signals?

When it comes to building reliable and scalable software, few organizations have as much authority and expertise as Google. Their Site Reliability Engineering Handbook, first published in 2016, details their practices to maintain reliability as Google scaled. But when you have over a million servers running thousands of services across more than twenty data centers, how do you monitor them in a consistent, logical, and relevant way?

Four tests to measure and improve reliability: what matters and how it works

Legendary race car driver Carroll Smith once said, "until we have established reliability, there is no sense at all in wasting time trying to make the thing go faster." Even though he was referring to cars, the same goes for technology: no amount of code optimization or new features can replace stable systems. Unfortunately, much like race cars, it's hard to know that a system is unreliable until it blows a tire, the brakes stop working, or the steering wheel comes off the column.

How to add a Golden Signal to a service in Gremlin RM

In this video, we show you how to add a Golden Signal to a service. Gremlin uses your Golden Signals to ensure your services are still healthy and responsive during reliability tests. You can configure Golden Signals to use an existing monitor in your observability tools, such as Datadog, New Relic, or Prometheus. We recommend adding all four Golden Signals to each of your services to ensure comprehensive coverage.