Operations | Monitoring | ITSM | DevOps | Cloud

Chaos Engineering

How to ensure your Kubernetes Pods and containers can restart automatically

As complex as Kubernetes is, much of it can be distilled to one simple question: how do we keep containers available for as long as possible? All of the various utilities, features, platform integrations, and observability tools surrounding Kubernetes tend to serve this one goal. Unfortunately, this also means there’s a lot of complexity and confusion surrounding this topic. After all, most people would agree that availability is important, but how exactly do you go about achieving it?

How to ensure your Kubernetes cluster can tolerate lost nodes

Redundancy is a core strength of Kubernetes. Whenever a component fails, such as a Pod or deployment, Kubernetes can usually automatically detect and replace it without any human intervention. This saves DevOps teams a ton of time and lets them focus on developing and deploying applications, rather than managing infrastructure.

How to test your systems for scalability and redundancy with Fault Injection

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Do you know if your services can tolerate losing a node? What about an entire availability zone? Or a region?‍ Large-scale outages aren’t unheard of. When you’re running critical services, it’s vital that those services can keep running even if an AZ or region fails. In addition to failing over, these services also need to scale quickly so traffic shifts don’t overwhelm your systems. How do you prove that a service is both scalable and redundant? The answer is with Fault Injection.

How to standardize resiliency on Kubernetes

There’s more pressure than ever to deliver high-availability Kubernetes systems, but there’s a combination of organizational and technological hurdles that make this ‌easier said than done. Technologically, Kubernetes is complex and ephemeral, with deployments that span infrastructure, cluster, node, and pod layers. And like with any complex and ephemeral system, the large amount of constantly-changing parts opens the possibility for sudden, unexpected failures.

Where to automate resilience testing in your SDLC

When organizations begin to deploy resilience testing or Chaos Engineering, there’s a natural question: can we integrate this with our CI/CD pipeline or release automation tools? After all, you’re likely running unit, performance, and integration tests already—is resiliency different? The short answer is yes—to both. Integration is possible, but resiliency is different, so automation is a nuanced conversation.

Resiliency is different on AWS: Here's how to manage it

There’s a common misconception about running workloads in the cloud: the cloud provider is responsible for reliability. After all, they’re hosting the infrastructure, services, and APIs. That leaves little else for their customers to manage, other than the workloads themselves…right?

Fault Injection in your release automation

One of the real successes of the Agile Software development movement has been the push to have regular, frequent deployments. This has manifested as build and deployment automation and the general adoption of CI/CD. As engineers automate more processes of their software release lifecycle, an important question is how to automate Quality Assurance, which includes resilience testing and, more specifically, Fault Injection.

How to find Kubernetes reliability risks with Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Most Kubernetes clusters have reliability risks lurking just below the surface. You could spend hours or even days manually finding these risks, but what if someone could find them for you? With Detected Risks, Gremlin automates the work involved in finding and tracking reliability risks across your Kubernetes clusters. Surface failed Pods, mismatched image versions, missing resource definitions, and single points of failure, all without having to run a single test.

How to scale your systems based on CPU utilization

CPU usage is one of the most common metrics used in observability and cloud computing. It’s for a good reason: CPU usage represents the amount of work a system is performing, and if it’s near 100% capacity, adding more work could make the system unstable. The solution is to scale - add more hosts with more CPU capacity, migrate some of your workloads to the new host, and split the traffic between them using a load balancer.