Operations | Monitoring | ITSM | DevOps | Cloud

Chaos Engineering

Validating the resilience of your API gateway with Chaos Engineering

Get started with Gremlin's Chaos Engineering tools to safely, securely, and simply inject failure into your systems to find weaknesses before they cause customer-facing issues. API gateways are a critical component of distributed systems and cloud-native deployments. They perform many important functions including request routing, caching, user authentication, rate limiting, and metrics collection. However, this means that any failures in your API gateway can put your entire deployment at risk.

What is fault injection?

When reading about Chaos Engineering, you’ll likely hear the terms “fault injection” or “failure injection.” As the name suggests, fault injection is a technique for deliberately introducing stress or failure into a system in order to see how the system responds. But what exactly does this mean, and how does this relate to Chaos Engineering?

Chaos Engineering, Explained

Chaos engineering has definitely become more popular in the decade or so since Netflix introduced it to the world via its Chaos Monkey service, but it’s far from ubiquitous. However, that will almost certainly change over time as more organizations become familiar with its core concepts, adopt application patterns and infrastructure that can tolerate failure, and understand that an investment in reliability today could save millions of dollars tomorrow.

What is Chaos Engineering and How to Implement It

Chaos Engineering is one of the hottest new approaches in DevOps. Netflix first pioneered it back in 2008, and since then it’s been adopted by thousands of companies, from the biggest names in tech to small software companies. In our age of highly distributed cloud-based systems, Chaos Engineering promotes resilient system architectures by applying scientific principles. In this article, I’ll explain exactly what Chaos Engineering is and how you can make it work for your team.

Tyler Wells on building a culture of reliability at Twilio

What does reliability look like at a company that has thousands of employees and provides critical communication services to over 150,000 customers? We talked with Tyler Wells, Senior Director of Engineering at Twilio, to learn how he and his team created a culture of reliability at Twilio. He talked in depth about his experiences developing reliability goals, building reliability practices, and aligning engineering teams on these objectives.

Improve M&A success rates by testing for system reliability

Get started with Gremlin's Chaos Engineering tools to safely, securely, and simply inject failure into your systems to find weaknesses before they cause customer-facing issues. Coming out of recessions, merger and acquisition volume typically picks up as lower interest rates drop the cost of capital and Corporate Development teams begin executing on the strategies they’ve developed during the holding periods. This year has been no exception, with $350 billion spent on tech acquisitions to date.

Podcast: Break Things on Purpose | Ep. 11: Ryan Kitchens, Senior Site Reliability Engineer at Netflix

Get started with Gremlin's Chaos Engineering tools to safely, securely, and simply inject failure into your systems to find weaknesses before they cause customer-facing issues. We’re excited to kick off Season 2 of Break Things on Purpose next month. In anticipation of our next season, here’s a bonus show from our archives! Subscribe to Break Things on Purpose wherever you get your podcasts. Find us on Twitter at @BTOPpod or shoot us a note at podcast@gremlin.com!

What is Chaos Engineering and Why is it Important?

So, why would you deliberately try to break your services? Chaos engineering does just that – deliberately terminating instances in your production environment. Online video streaming service Netflix was one of the first organizations to popularize the concept with their Chaos Monkey engine.