Gremlin

San Jose, CA, USA
2016
  |  By Andre Newman
Encryption is a fundamental part of nearly every modern application, whether you’re storing data, sending data to customers, or sharing data between backend services. Most organizations have a data encryption strategy, and nearly every web page is using HTTPS, thanks to initiatives like Let’s Encrypt. But setting up encryption isn’t a one-time initiative. Over time, the certificates backing modern encryption expire and need to be replaced.
  |  By Andre Newman
Load balancers are some of the most important load-bearing (pun intended) components in cloud environments. They perform multiple critical tasks: network switching, packet inspection, and of course, routing. Most cloud-based load balancers focus on load balancing within a single zone, but what if you have resources spread across multiple zones?
  |  By Andre Newman
Reliability testing and observability are similar in one important way: engineering teams know they should be doing it, but they’re not sure how to start, or they don’t have the right resources, or they need to focus on competing priorities like feature development and incident response. In an ideal world, reliability and observability would be automated processes that configure, monitor, and run themselves.
  |  By Gavin Cahill
When it comes to reliability, cloud providers use a Shared Responsibility Model. In essence, they’ll keep the infrastructure reliable, while you’re responsible for architecting reliability into your systems. To help make this easier, they’ve published a variety of best practice guides, such as the AWS Well-Architected Framework. These lengthy documents are filled with recommendations to help you architect a more secure, more reliable system.
  |  By Andre Newman
The worst thing you could do after successfully deploying to a new environment is to accidentally delete critical infrastructure. Unfortunately, that happened to one Google Cloud customer when their private cloud subscription was accidentally deleted, resulting in nearly two weeks of downtime. This isn’t an isolated problem either: Microsoft Azure had a similar problem when a typo inadvertently deleted an entire SQL Server instance rather than a specific database.
  |  By Gavin Cahill
There’s a reason why observability and incident response practices have become standard across modern software development. Anyone wanting to minimize downtime and deliver reliable, available applications needs to have fully instrumented systems and playbooks so they can respond quickly and effectively to outages or incidents. But there’s another piece to the reliability puzzle: resilience testing.
  |  By Ryan Detwiller
Today, Gremlin is introducing Gremlin for AWS, a suite of tools to more easily find and fix the reliability risks that cause downtime on AWS. The cloud opens up a range of reliability challenges that didn’t exist before, especially for customers running distributed, mission-critical workloads. Teams experience the pain of failed migrations, frequent incidents, and reliability toil, but often struggle to modernize their approach to reliability as they modernize their infrastructure.
  |  By Andre Newman
Migrating to a new platform can often feel like navigating a maze of technical challenges, especially when the platform is as complex as Kubernetes. Kubernetes has a vast number of features designed to help with deploying and managing large applications, but learning how to use it effectively can be just as challenging as‌ moving your workloads over. This doesn’t mean it’s impossible, of course, and there are several strategies for easing this process.
  |  By Andre Newman
Microservices have forever changed the way we build applications. Tools like Docker and Kubernetes made microservice-based architectures widely accessible to software developers, and cloud platforms like Amazon EKS made deploying containers fast and inexpensive. They've also enabled even small engineering teams to deploy code faster, leverage fault tolerance and redundancy, scale more efficiently, and take full ownership of their services from development all the way into production.
  |  By Andre Newman
Redundancy is a core tenet of cloud computing. While major cloud platforms have high targets for reliability, they can still fail, and it’s important for teams to have a plan for when they do. But how can you build services that can withstand something as disruptive as a datacenter outage? In this blog, we’ll show you how to prepare for availability zone outages by proactively detecting services operating in a single zone.
  |  By Gremlin
You're going to spend time fixing reliability—but it's your choice whether it's during an outage or ahead of time on your schedule and for less costs. Which will you choose? "We all know when things go wrong, it cost us a million dollars and it was really bad. Let's have that never happen again. But when we say, I need every engineering team to spend one hour, one day a week on reliability, does everyone lose their mind, or is that a reasonable request? Can we amortize out the cost of that?
  |  By Gremlin
Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Fully-managed SaaS services offer incredible scalability and accessibility, but at a cost: they’re also single points of failure. If your application depends on a SaaS service and the service fails, guess who your customers will blame? We need to design applications to anticipate and work around managed service failures, but how do we do that without having to wait for the service to fail?
  |  By Gremlin
Thinking about integrating Gremlin into your existing pipeline? Look no farther than the Gremlin API. "The next step then was to build the right tooling such that the resiliency tests can be run from a pipeline. Gremlin's API first approach made it possible to do this in a very easy manner because everything that we could do from the UI and manually, we could replicate all of that through the API as well.
  |  By Gremlin
Are you recognizing the good work engineers do to prevent outages? "The people that are out there doing good work to prevent fires from ever occurring, we're not often recognizing them. We're not often rewarding them. And once things go wrong, someone comes in and fixes it. That's great. That's needed. But we're rewarding that behavior. And so it becomes a bit of people are motivated by what behavior you reward.
  |  By Gremlin
Did you know you can use the Gremlin API to integrate resiliency tests into your CI/CD pipeline? Our partner Nagarro has even made it part of their shift left package. "What we do is shift left and add a chaos stage to the pipeline. We have created the shift left accelerator package. It integrates with load tests and Gremlin APIs to set up the test scenario.
  |  By Gremlin
Gremlin for AWS is a suite of tools to more easily find and fix the reliability risks that cause downtime on AWS. The cloud opens up a range of reliability challenges that didn’t exist before, especially for customers running distributed, mission-critical workloads. Teams experience the pain of failed migrations, frequent incidents, and reliability toil, but often struggle to modernize their approach to reliability as they modernize their infrastructure. That’s where Gremlin for AWS can help.
  |  By Gremlin
Introducing Gremlin for AWS, a suite of tools to more easily find and fix the reliability risks that cause downtime on AWS. Gremlin for AWS helps teams prevent incidents, monitor and test systems for known causes of failure, and gain visibility into the reliability posture of their applications—with 90% less effort.
  |  By Gremlin
If you want to improve reliability, it has to be important from the top down. "As part of the CTO or leadership owning it, they need to tell folks that it's important in the product roadmap, in some of the development schedule, that we spend time on it, that the CEO is the person that holds people accountable, that they review the metrics, that they sit in the outages, that they understand the quality of the software.
  |  By Gremlin
Your reliability measurement can't just be a lagging indicator. "How do you know your company is doing well at reliability? A lot of people will just look at how many outages have you had in the last year and how much customer pain have you caused? I think that's one side of the coin. That's the reactive lagging indicator of the health of our system. To really be good at this, we need a way to understand the risks and the sharp points so that we have an idea of what we're getting into.
  |  By Gremlin
•Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Zone failures are rare, but they still happen. When an entire zone fails, many of the most common redundancy techniques fail. How do you avoid outages like these, especially if they affect an entire datacenter?
  |  By Gremlin
Systems fail, sometimes publicly and at great cost. Airlines have experienced system-wide ticketing outages, causing hundreds of flight cancellations and significant inconvenience to customers. Retailers have experienced website crashes on the busiest shopping days of the year, costing millions in lost revenue and customer goodwill. It is vital to understand both DevOps and SRE and the roles they play in preventing such outages.
  |  By Gremlin
Gremlin provides a variety of ways to test the resilience of your systems, which we call "attacks". Running different attacks lets you uncover unexpected behaviors, validate resilience mechanisms, and improve the overall reliability of your systems and services. This ebook explains each of Gremlin's attacks in complete detail, including what each attack does, how it impacts your systems, and the technical and business objectives the attack helps solve.
  |  By Gremlin
Learn the basics of Chaos Engineering: discover the tools, tests, and culture needed to create better software and prevent outages and downtime. This whitepaper provides a comprehensive introduction to the discipline of Chaos Engineering including why it is more needed than ever, how to get started, and best practices to maximize learnings and reduce risk.
  |  By Gremlin
By following this guide, you'll successfully increase your organization's reliability with minimal effort and risk. This document will serve as your guide to implementing Chaos Engineering and Gremlin within your organization. From educating your team on the principles of Chaos Engineering to running automated experiments, this guide will walk through each stage of the adoption process in order to ensure a smooth and successful rollout.
  |  By Gremlin
Amazon DynamoDB is fast, powerful, and intended for high availability. These are all valuable attributes in a data storage solution, but to be useful as advertised, it must be configured thoughtfully. Learn how to use Chaos Engineering to ensure DynamoDB performs the way you expect. In this guide, we cover: Amazon DynamoDB is one of the most popular NoSQL databases and is the data store of choice for many teams running production workloads in AWS.
  |  By Gremlin
MongoDB is designed for performance, scale, and high-availability. But, as with any software, you need to test your configuration to verify that it will work as advertised. Ensure that MongoDB performs the way you expect by using Chaos Engineering to test four key features. This guide includes four experiment tutorials to verify that MongoDB will perform reliably: In order to ensure you get the most out of MongoDB's rich features, including built-in data sharding and replication, it's crucial to test your configuration.
  |  By Gremlin
Win over and convince your coworkers and management to explore and adopt Chaos Engineering and Site Reliability Engineering (SRE). The playbook provides ideas and techniques that can be used to articulate the need and benefits to internal stakeholders in your organization. It also guides the initial implementation in a way that will lead to success and growth across the organization. Implementing something new like Chaos Engineering successfully is a good way to get promoted and help the organization succeed, and this guide is here to help you.

Gremlin aims to make the internet more reliable and prevent costly and reputation-damaging outages. Its failure-as-a-service platform empowers engineers to build more resilient systems through safe experimentation.

Downtime is expensive and can hurt your brand. Gremlin provides engineers with the framework to safely, securely, and easily simulate real outages with an ever-growing library of attacks. Turn failure into resilience with chaos engineering.

Build resilient infrastructure:

  • Resource Gremlins: Throttle CPU, Memory, I/O, and Disk.
  • State Gremlins: Reboot hosts, kill processes, travel in time.
  • Network Gremlins: Introduce latency, blackhole traffic, lose packets, fail DNS.

Test for application failure:

  • Test for failure in your code.
  • Fail or delay serverless functions.
  • Narrow the impact to a single user, device, or percentage of traffic.

Avoid downtime. Use Gremlin to turn failure into resilience.