Chaos Engineering

What is Reliability Management?

Oct 20, 2022 By Andre Newman In Gremlin

Measuring and improving the reliability of technical systems has always been challenging. As an industry, we've developed several practices to try and address reliability concerns, such as incident response, observability, and Chaos Engineering. This led SREs and service owners to measure reliability in a handful of ways.

Read Post

Gremlin

Read more about What is Reliability Management?

Setting better SLOs using Google's Golden Signals

Oct 11, 2022 By Andre Newman In Gremlin

To many engineers, the idea that you can accurately and comprehensively track your application's user experience using just a few simple metrics might sound far-fetched. Believe it or not, there are four metrics that aim to do just that. They're called the four Golden Signals and should be a core part of your observability and reliability practices.

Read Post

Gremlin

Read more about Setting better SLOs using Google's Golden Signals

Chaos testing: Reliability for cloud-native apps

Sep 30, 2022 By Jacob Schmitt In CircleCI

Reliability is a critical concern for software delivery teams. Every second of lackluster performance or service interruption comes with high costs. The consequences can extend beyond just monetary expenses and have a huge impact on a company’s reputation. In a survey conducted in 2022, participants reported that over 60% of digital infrastructure failures resulted in losses of $100,000 or more.

Read Post

CircleCI

Read more about Chaos testing: Reliability for cloud-native apps

How to Break Stuff with Chaos Engineering and Chaos Mesh

Sep 15, 2022 By Okoth Pius In Mattermost

In 2011, a Netflix engineering team introduced the concept of chaos engineering with its release of Chaos Monkey. This was initially an in-house tool developed to orchestrate fault injection that Netflix eventually made open source. However, the reliance of Chaos Monkey on Spinnaker, another Netflix engineering innovation, establishes some limitations.

Read Post

Mattermost

Read more about How to Break Stuff with Chaos Engineering and Chaos Mesh

SRE vs DevOps: Can they coexist or do they compete?

Sep 9, 2022 By Gremlin

Systems fail, sometimes publicly and at great cost. Airlines have experienced system-wide ticketing outages, causing hundreds of flight cancellations and significant inconvenience to customers. Retailers have experienced website crashes on the busiest shopping days of the year, costing millions in lost revenue and customer goodwill. It is vital to understand both DevOps and SRE and the roles they play in preventing such outages.

Get EBook

Gremlin

Read more about SRE vs DevOps: Can they coexist or do they compete?

What is a "service" in a microservices architecture?

Sep 2, 2022 By Andre Newman In Gremlin

The past ten years marked a significant change in how software teams build and deploy applications. We moved away from bulky, slow, monolithic applications toward lightweight, scalable, distributed service-based applications. Meanwhile, tools like Docker, Kubernetes, and other container platforms helped accelerate this process. Despite this sudden growth, a fundamental question remains: what exactly is a service, and how does it fit into a microservice architecture?

Read Post

Gremlin

Read more about What is a "service" in a microservices architecture?

What are the four Golden Signals?

Sep 2, 2022 By Andre Newman In Gremlin

When it comes to building reliable and scalable software, few organizations have as much authority and expertise as Google. Their Site Reliability Engineering Handbook, first published in 2016, details their practices to maintain reliability as Google scaled. But when you have over a million servers running thousands of services across more than twenty data centers, how do you monitor them in a consistent, logical, and relevant way?

Read Post

Gremlin

Read more about What are the four Golden Signals?

Four tests to measure and improve reliability: what matters and how it works

Sep 2, 2022 By Andre Newman In Gremlin

Legendary race car driver Carroll Smith once said, "until we have established reliability, there is no sense at all in wasting time trying to make the thing go faster." Even though he was referring to cars, the same goes for technology: no amount of code optimization or new features can replace stable systems. Unfortunately, much like race cars, it's hard to know that a system is unreliable until it blows a tire, the brakes stop working, or the steering wheel comes off the column.

Read Post

Gremlin

Read more about Four tests to measure and improve reliability: what matters and how it works

How to add a Golden Signal to a service in Gremlin RM

Sep 2, 2022 By Gremlin In Gremlin

In this video, we show you how to add a Golden Signal to a service. Gremlin uses your Golden Signals to ensure your services are still healthy and responsive during reliability tests. You can configure Golden Signals to use an existing monitor in your observability tools, such as Datadog, New Relic, or Prometheus. We recommend adding all four Golden Signals to each of your services to ensure comprehensive coverage.

View Video

Gremlin

Read more about How to add a Golden Signal to a service in Gremlin RM

How to add a Service to Gremlin Reliability Management (RM)

Sep 2, 2022 By Gremlin In Gremlin

This short demo video shows you how to add a Kubernetes service to Gremlin Reliability Management (RM). We'll walk you through selecting the parts of your infrastructure that make up your service, identifying processes for dependency detection, and adding your Golden Signals.

View Video