Operations | Monitoring | ITSM | DevOps | Cloud

Is your online gaming platform "Chaos Monkey"-proof?

Try to imagine a bunch of monkeys running around your data center, pulling cables, trashing routers and wreaking havoc on your applications and infrastructure. Ever more crucial in these days of heated competition between online gaming operators, is player experience. Continuity of operations is “Uber-Alles” and avoiding churn, due to service disruption, is the organizational mantra.

Grubhub and JPMC Shift Reliability Testing Left at Chaos Conf 2020

Get started with Gremlin's Chaos Engineering tools to safely, securely, and simply inject failure into your systems to find weaknesses before they cause customer-facing issues. Gremlin’s Chaos Conf is always an exciting event, bringing together leaders at the forefront of Chaos Engineering practices. This year was no exception, moving beyond defining Chaos Engineering to more advanced adoption and best practices discussions.

ObservabilityCON Day 4 recap: a panel discussion on observability (and its future), the benefits of Chaos Engineering, and an observability demo showcase

Over the past four days, Grafana Labs' ObservabilityCON 2020 brought together the Grafana community for talks dedicated to observability. We hope you enjoyed all of the sessions, which are available on demand now. (Link to them from the schedule on the event page). The conference wrapped up with predictions and advice from observability experts, lessons in failure, and Grafana Labs team members showcasing ways Grafana and other tools fit into an observability workflow.

Chaos Engineering: How to create an automated Chaos Gauntlet with Gremlin and Jenkins on AWS

In this video, we will demonstrate how to use Gremlin and Jenkins to create an automated Chaos Gauntlet. This will be done using Jenkins Pipelines and Stages to inject a controlled amount of failure with the Gremlin API. We then add a final stage that allows you to optionally halt the attack from the pipeline, rather than having to wait for the full duration of the attack.

Chaos Engineering: The Path to Reliability - Kolton Andrus

We’re all here for the same purpose: to ensure the systems we build operate reliably. This is a difficult task, one that must balance people, process and technology during difficult conditions. We operate with incomplete information, assessing risks and dealing with emerging issues. We’ve found Chaos Engineering to be a valuable tool in addressing these concerns. Learn from real world examples what works, what doesn’t, and what the future holds.

Identifying Hidden Dependencies - Liz Fong Jones

You don't need to write automation or deploy on Kubernetes to gain benefits from resilience engineering! Learn how Honeycomb improved the reliability of our Zookeeper, Kafka, and stateful storage systems through terminating nodes on purpose. We'll discuss the initial manual experiments we ran, the bugs in our automatic replacement tools we uncovered, and what steps we needed to progress towards continuously running the experiments. Today, no node at Honeycomb lives longer than 12 months, and we automatically recycle nodes every week.

Looking back on Chaos Conf 2020

It’s already been a week since we closed our third annual Chaos Conf! While we were forced to take the conference online, this meant that more of you could join us. Over 3,500 people signed up to help make this the world’s largest Chaos Engineering conference. That’s 5x more than 2019, and nearly 10x more than 2018! This is a testament to the growth of Chaos Engineering as a practice across many different industries and around the world.