%term

Reliability lessons from the 2025 AWS DynamoDB outage

Nov 7, 2025 By Gavin Cahill In Gremlin

On October 19th and 20th, 2025, the AWS region US-EAST-1 suffered a massive outage. What started with a 3-hour Amazon DynamoDB outage from a DNS issue led to an Amazon EC2 outage that lasted an additional 12 hours before normal service was restored. Over the course of the outage, there were over 17 million outage reports as companies like Snapchat, Roblox, Amazon, Reddit, Venmo, and more were impacted.

Read Post

Gremlin

Read more about Reliability lessons from the 2025 AWS DynamoDB outage

Running Chaos Engineering on GKE Autopilot Just Got Easier

Nov 5, 2025 By Ashutosh Bhadauriya In Harness

Harness Chaos Engineering now runs natively on GKE Autopilot. A simple allowlist configuration enables you to test resilience on Google's managed Kubernetes without sacrificing security or requiring workarounds. Google's GKE Autopilot provides fully managed Kubernetes without the operational overhead of node management, security patches, or capacity planning. However, running chaos engineering experiments on Autopilot has been challenging due to its security restrictions. We've solved that problem.

Read Post

Harness

Read more about Running Chaos Engineering on GKE Autopilot Just Got Easier

Systems need to be failure proof

Nov 4, 2025 By Gremlin In Gremlin

The best advice from Anish Behanan at @Capgemini about reliability? Every system needs to be failure proof.

View Video

Gremlin

Read more about Systems need to be failure proof

If you don't test in production, you're missing risks

Oct 31, 2025 By Gremlin In Gremlin

Testing in production can be scary, but it’s necessary to improve reliability. Check out this clip from when Gremlin Co-founder and CEO Kolton Andrus sat down with Stephen Townshend on the Slight Reliability podcast!

View Video

Gremlin

Read more about If you don't test in production, you're missing risks

Validating chaos experiments with GCP Cloud Monitoring probes

Oct 31, 2025 By Ashutosh Bhadauriya In Harness

GCP Cloud Monitoring probe let you transform your existing GCP metrics into automated pass/fail validation for chaos experiments, eliminating subjective observation in favor of objective measurement. With flexible authentication options (workload identity or service account keys) and PromQL query support, you can validate infrastructure performance against defined thresholds during controlled failure scenarios.

Read Post

Harness

Read more about Validating chaos experiments with GCP Cloud Monitoring probes

Field of Dreams DevOps doesn't scale

Oct 29, 2025 By Gremlin In Gremlin

Having trouble scaling reliability? Gremlin CEO Kolton Andrus talks about just building a great tool isn’t enough.

View Video

Gremlin

Read more about Field of Dreams DevOps doesn't scale

Monitoring Chaos Experiments with New Relic Probe in Harness

Oct 28, 2025 By Ashutosh Bhadauriya In Harness

New Relic probes in Harness Chaos Engineering let you automatically validate system performance against defined SLOs during chaos experiments, transforming subjective testing into objective, metrics-driven resilience validation. By querying New Relic metrics in real-time and comparing results against your success criteria, you can programmatically verify that your systems maintain acceptable performance levels even under failure conditions.

Read Post

Harness

Read more about Monitoring Chaos Experiments with New Relic Probe in Harness

Change engineering culture with Chaos Engineering

Oct 23, 2025 By Gremlin In Gremlin

How do you spur an engineering cultural shift with Chaos Engineering? Gremlin founder and CEO Kolton Andrus explains how—and how it changed the Gremlin platform.

View Video

Gremlin

Read more about Change engineering culture with Chaos Engineering

Scale Chaos Engineering with Automation and AI

Oct 23, 2025 By Gremlin In Gremlin

Chaos Engineering and Fault Injection testing have been proven to prevent outages, increase availability, and help companies avoid costly downtime. But without the right processes or tools, they require specialized knowledge, a deep understanding of systems, and manual effort for every test. To fully realize the benefits of Chaos Engineering, testing needs to be adopted across all engineering teams without causing a lift or investment that takes away from roadmap progress.

View Video

Gremlin

Read more about Scale Chaos Engineering with Automation and AI

How to test the reliability of a Point of Sale (POS) system

Oct 20, 2025 By Gavin Cahill In Gremlin

Point of Sale (POS) systems are the backbone of any retail store. A single outage can cost retail companies thousands of dollars each minute in lost sales, and even more if the outage happens during peak hours. If the outage goes on too long, it can cause even more costly damage as customers abandon carts and turn to competitors. In an industry where customer loyalty is worth its weight in gold, that brand damage can end up even more costly than the initial lost sales.

Read Post