%term

Chaos Testing a PostgreSQL Cluster: How Kubernetes Can Restore Replica Failures (in 30 Seconds)

Jun 19, 2025 By Coroot In Coroot

🐧🐝 Use open source, automatic eBPF observability to quickly fix Patroni failures in your kubernetes cluster: https://t.ly/qBH9f

#devops #opensource #observability #kubernetes #postgresql

View Video

Coroot

Read more about Chaos Testing a PostgreSQL Cluster: How Kubernetes Can Restore Replica Failures (in 30 Seconds)

How to test your systems for scalability and redundancy with fault injection

Jun 13, 2025 By Gremlin In Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Do you know if your services can tolerate losing a node? What about an entire availability zone? Or a region? Large-scale outages aren’t unheard of. When you’re running critical services, it’s vital that those services can keep running even if an AZ or region fails. In addition to failing over, these services also need to scale quickly so traffic shifts don’t overwhelm your systems. How do you prove that a service is both scalable and redundant? The answer is with Fault Injection.

View Video

Gremlin

Read more about How to test your systems for scalability and redundancy with fault injection

How to be prepared for cloud provider outages

Jun 13, 2025 By Gavin Cahill In Gremlin

GCP’s recent outage on June 12th was a reminder of just how interconnected modern architectures are. The 2 hour and 28 minute outage affected dozens of companies and spanned 80+ Google services and products. But what was really illuminating was just how far the outage spread due to hidden dependency risks. Many companies that don’t run on GCP were startled to find their services suddenly affected because they had dependencies or depended on vendors that did use GCP.

Read Post

Gremlin

Read more about How to be prepared for cloud provider outages

How to set up chaos engineering in your CI/CD pipeline with CircleCI and Chaos Toolkit

May 16, 2025 By Kevin Kimani In CircleCI

Distributed architecture is increasingly being adopted in current software systems because it brings great scalability and flexibility, keeping them resilient under real-world conditions, Unfortunately, this new distribution also introduces new points of failure in the systems. Traditional testing methods are no longer enough; they focus only on whether a system works, not on whether it keeps working under stress or failure. That is where chaos engineering comes in.

Read Post

CircleCI

Read more about How to set up chaos engineering in your CI/CD pipeline with CircleCI and Chaos Toolkit

How to test Istio and other service meshes

May 8, 2025 By Gremlin In Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Service meshes bring applications together, but not always reliably. Even the most well-configured Istio deployment can have unexpected reliability risks that aren’t apparent until you’re already in production. Latency, single points of failure, poorly defined APIs—these problems can grow beyond a single service and impact the user experience for your entire application.

View Video

Gremlin

Read more about How to test Istio and other service meshes

How to find Kubernetes reliability risks with Gremlin

Apr 21, 2025 By Gremlin In Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Most Kubernetes clusters have reliability risks lurking just below the surface. You could spend hours or even days manually finding these risks, but what if someone could find them for you? With Detected Risks, Gremlin automates the work involved in finding and tracking reliability risks across your Kubernetes clusters. Surface failed Pods, mismatched image versions, missing resource definitions, and single points of failure, all without having to run a single test.

View Video

Gremlin

Read more about How to find Kubernetes reliability risks with Gremlin

Three key facts about serverless reliability

Apr 8, 2025 By Andre Newman In Gremlin

Serverless computing requires a significant shift in how organizations think about deploying and managing applications. No longer do Ops teams need to think about provisioning servers, installing operating system patches, and writing shell scripts to manage deployments. While serverless takes away much of this responsibility, one aspect still needs to be handled thoughtfully: reliability. In this blog, we’ll look at three important facts about serverless reliability that teams often overlook.

Read Post

Gremlin

Read more about Three key facts about serverless reliability

Ensuring your AI systems can scale to meet demand

Apr 1, 2025 By Andre Newman In Gremlin

The amount of traffic handled by AI systems can’t be overstated. Over half of all organizations in India, the UAE, Singapore, and China use AI, and traffic from generative AI sources jumped by 1,200% since July 2024. While demand for AI-powered workloads is steadily increasing overall, traffic to individual AI providers is much more unpredictable. User demand spikes and wanes unexpectedly, but like any service, users expect you to always be available and responsive.

Read Post

Gremlin

Read more about Ensuring your AI systems can scale to meet demand

How to keep track of what's running in your Gremlin team

Mar 13, 2025 By Gremlin In Gremlin

•Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Reliability testing is ongoing, and tracking that work can be difficult in large organizations. According to our own product metrics, teams run an average of 200 to 500 tests each day! With so much happening, it’s hard to keep track of everything going on—unless you use Gremlin.

View Video

Gremlin

Read more about How to keep track of what's running in your Gremlin team

Test serverless and application-level reliability with Failure Flags

Mar 13, 2025 By Gavin Cahill In Gremlin

It’s been a year and a half since Failure Flags was released. Since then, customers have used Failure Flags to run thousands of tests for applications running on serverless, container, and service meshes.

Read Post