Operations | Monitoring | ITSM | DevOps | Cloud

Chaos Engineering

The case for Fault Injection testing in Production

Many organizations who are looking to introduce Fault Injection as a testing technique start with non-production environments, but don't always go back and reconsider that choice as they mature beyond initial assessment. However, there's a strong case for running these tests in your live systems. It's important to consider the trade-offs when choosing to test in production or non-production environments, as it can have far-reaching impacts on the efficacy and cost of improving the resilience of software.

How to find and test critical dependencies with Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Pop quiz—what are all of the dependencies your services rely on? If you’re like most engineers, you probably struggled to come up with the answer. Modern applications are complex and rely on dozens (if not hundreds) of dependencies. Many teams rely on spreadsheets, but manual processes like these break down over time. What if you had a tool that found and tracked dependencies for you?

How to use host redundancy to improve service reliability and availability

Cloud computing has made provisioning new servers easy, fast, and relatively cheap. Almost anyone can log into a cloud console, spin up a new server, and deploy an application. And if they need greater uptime, major cloud providers include all kinds of settings, services, and configurations to add fault tolerance and failover. So why is it that many services fail when a single server instance fails?

10 Most Common Kubernetes Reliability Risks

Reliability risks are potential points of failure in your system where an outage could occur. If you can find and remediate reliability risks, then you can prevent incidents before they happen. In complex Kubernetes systems, these reliability risks can take a wide variety of forms, including node failures, pod or container crashes, missing autoscaling rules, misconfigured load balancing or application gateway rules, pod crash loops, and more. And they’re more prevalent than you might think.

How dependency discovery works in Gremlin

Modern applications are rarely created entirely from scratch. Instead, they rely on a framework of pre-existing applications and services, each adding specific features and functionality. These dependencies empower teams to build and deploy applications more efficiently, but they bring their own set of challenges. Tracking, managing, and updating these dependencies is difficult, especially in large, complex applications where dependencies are likely managed by different teams.

Chaos engineering in an Azure environment: Confident enough to try it?

What could go wrong with your Azure environment? Netflix gave the world two beautiful gifts: a media streaming platform for the general public and a wonderful monkey for the tech community. Enough has been said about the media streaming part, so let's play (or work) with the monkey now. When Netflix let the world know about Chaos Monkey, the tech community took a minute to stand and applaud. Since then, it has been a standard to unleash intentional chaos just to see how robust our tech stacks really are.

How to make your services zone redundant

In January of 2020, an entire availability zone (AZ) in AWS’ Sydney region suddenly went dark. Multiple facilities lost power, preventing customers from accessing EC2 instances and Elastic Block Storage (EBS) volumes. Customers who didn’t have backup infrastructure in another zone had to wait nearly 8 hours before service was restored, and even then, some EBS volumes couldn’t be recovered. Major cloud provider outages are rare, but they happen nonetheless.

Measuring the impact of your reliability work with reports

Improving reliability is important, but how do you prove that your efforts are having an impact? A critical part of reliability work is having the tools to measure and track your progress. Gremlin supports this by providing several built-in reports, which update automatically and are available on-demand. This blog post is a quick introduction to Gremlin’s reporting capabilities.

Reducing cloud reliability risks with the AWS Well-Architected Framework

Designing and deploying applications in the cloud can be a labyrinthian exercise. There are dozens of cloud providers, each offering dozens of services, and each of those services has any number of configurations. How are you supposed to architect your systems in a way that gives your customers the best possible experience? AWS recognized this, and in response, they created the AWS Well-Architected Framework (WAF) to guide customers.