Operations | Monitoring | ITSM | DevOps | Cloud

Reliability lessons from the 2025 AWS DynamoDB outage

On October 19th and 20th, 2025, the AWS region US-EAST-1 suffered a massive outage. What started with a 3-hour Amazon DynamoDB outage from a DNS issue led to an Amazon EC2 outage that lasted an additional 12 hours before normal service was restored. Over the course of the outage, there were over 17 million outage reports as companies like Snapchat, Roblox, Amazon, Reddit, Venmo, and more were impacted.

Running Chaos Engineering on GKE Autopilot Just Got Easier

Harness Chaos Engineering now runs natively on GKE Autopilot. A simple allowlist configuration enables you to test resilience on Google's managed Kubernetes without sacrificing security or requiring workarounds. Google's GKE Autopilot provides fully managed Kubernetes without the operational overhead of node management, security patches, or capacity planning. However, running chaos engineering experiments on Autopilot has been challenging due to its security restrictions. We've solved that problem.

Validating chaos experiments with GCP Cloud Monitoring probes

GCP Cloud Monitoring probe let you transform your existing GCP metrics into automated pass/fail validation for chaos experiments, eliminating subjective observation in favor of objective measurement. With flexible authentication options (workload identity or service account keys) and PromQL query support, you can validate infrastructure performance against defined thresholds during controlled failure scenarios.

Monitoring Chaos Experiments with New Relic Probe in Harness

New Relic probes in Harness Chaos Engineering let you automatically validate system performance against defined SLOs during chaos experiments, transforming subjective testing into objective, metrics-driven resilience validation. By querying New Relic metrics in real-time and comparing results against your success criteria, you can programmatically verify that your systems maintain acceptable performance levels even under failure conditions.

Scale Chaos Engineering with Automation and AI

Chaos Engineering and Fault Injection testing have been proven to prevent outages, increase availability, and help companies avoid costly downtime. But without the right processes or tools, they require specialized knowledge, a deep understanding of systems, and manual effort for every test. To fully realize the benefits of Chaos Engineering, testing needs to be adopted across all engineering teams without causing a lift or investment that takes away from roadmap progress.

How to test the reliability of a Point of Sale (POS) system

Point of Sale (POS) systems are the backbone of any retail store. A single outage can cost retail companies thousands of dollars each minute in lost sales, and even more if the outage happens during peak hours. If the outage goes on too long, it can cause even more costly damage as customers abandon carts and turn to competitors. In an industry where customer loyalty is worth its weight in gold, that brand damage can end up even more costly than the initial lost sales.