Gremlin

https://www.gremlin.com/

San Jose, CA, USA

2016

Three serverless reliability risks you can solve today using Failure Flags

Oct 16, 2024 | By Andre Newman

Serverless platforms make it incredibly easy to deploy applications. You can take raw code, push it up to a service like AWS Lambda, and have a running application in just a few seconds. The serverless platform provider assumes responsibility for hosting and operating the platform, freeing you up to focus on your application. Naturally, this raises a question: if something goes wrong, who’s responsible?

Read Post

Best Practices for Testing Zone Redundancy

Oct 16, 2024 | By Sam Rossoff

The way the story goes is that in the old days Amazon used to cut power to data centers so they could see if their services were actually redundant across different data centers; and that they only abandoned this practice when EC2 customers started to complain (no matter how many times they were warned their instances might disappear without notice). This story may be apocryphal, but you don’t need to be worried about power loss outages in order to have a given data center go down.

Read Post

Interpreting your reliability test results

Sep 19, 2024 | By Andre Newman

Gremlin’s default suite of reliability tests analyzes critical functions of modern services: scalability, redundancy, and resilience to dependency failures. Services that pass this suite of tests can be trusted to remain available during unexpected incidents. But what happens when a service fails a test? How do you take failed test results and turn them into actionable insights? This blog aims to answer that question.

Read Post

Release Roundup August 2024

Sep 9, 2024 | By Andre Newman

Over the past year, the Gremlin team has focused on giving you more tools to adapt Gremlin to your organization’s reliability needs. We started with customizable reliability tests, and now, we’ve released customizable role-based access controls (RBAC). We’ve also made it easier to target specific availability zones when running Failure Flags experiments, and to run experiments behind a proxy. Keep reading to learn more! ‍

Read Post

Reliability recommendations when adopting Kubernetes

Sep 3, 2024 | By Andre Newman

Kubernetes just celebrated its tenth birthday. That’s 10 years of microservices, containers, service meshes, and many other paradigms that are now common to many developers’ toolkits.

Read Post

How to verify, document, and prove compliance with Gremlin

Aug 29, 2024 | By Gavin Cahill

Resilient and reliable IT systems have become a minimum requirement for modern businesses—a fact driven home by any number of high-profile outages over the past few years. Unfortunately, when those outages are in the financial sector, it can have far-reaching and incredibly damaging results.

Read Post

How to test AWS managed services with Gremlin

Aug 1, 2024 | By Andre Newman

Note In this blog, we use “managed service providers” to refer to companies that provide hosted computing services, not managed IT service providers (MSPs). ‍ When was the last time you thought about the reliability of your cloud dependencies? The biggest challenge with using cloud platforms and SaaS services is also its biggest strength: the provider controls everything.

Read Post

How role-based access control (RBAC) works in Gremlin

Jul 25, 2024 | By Andre Newman

Reliability testing and Chaos Engineering are essential for finding reliability risks and improving the resiliency of systems. Gremlin makes it easy to do so, but not every engineer needs access to the same experiments, systems, or services. That’s why we released customizable role-based access controls (RBAC), letting Gremlin customers control which actions your users can perform in Gremlin.

Read Post

Testing for expiring TLS and SSL certificates using Gremlin

Jul 16, 2024 | By Andre Newman

Encryption is a fundamental part of nearly every modern application, whether you’re storing data, sending data to customers, or sharing data between backend services. Most organizations have a data encryption strategy, and nearly every web page is using HTTPS, thanks to initiatives like Let’s Encrypt. But setting up encryption isn’t a one-time initiative. Over time, the certificates backing modern encryption expire and need to be replaced.

Read Post

How to load-balance across multiple availability zones for improved redundancy

Jul 11, 2024 | By Andre Newman

Load balancers are some of the most important load-bearing (pun intended) components in cloud environments. They perform multiple critical tasks: network switching, packet inspection, and of course, routing. Most cloud-based load balancers focus on load balancing within a single zone, but what if you have resources spread across multiple zones?

Read Post

Office Hours: How to test serverless applications using Failure Flags

Oct 10, 2024 | By Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Serverless applications are ideal for deploying scalable applications without having to manage infrastructure. However, this also makes it difficult to test their reliability. It’s easy to simulate a network outage or latency when you have direct access to the host that your software’s running on. What do you do when you only have control over the code?

View Video

Office Hours: Get better reliability on AWS with our new release

Sep 12, 2024 | By Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Cloud platforms make it easier than ever to deploy massively scalable, distributed workloads, but this is a double-edged sword. There are reliability challenges unique to the cloud that didn’t exist before. Failed migrations, recurring incidents, and reliability toil take their toll.

View Video

Achieving SLO Success with Golden Signals and Reliability Testing

Aug 28, 2024 | By Gremlin

The four Golden Signals are an easy and effective way to measure the most important aspects of a system, and when paired with a reliability management platform like Gremlin, they help you proactively meet your SLOs so you can meet your legal obligations and deliver the perfect customer experience.

View Video

5 essential resilience tests for a successful cloud migration

Aug 8, 2024 | By Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Migrating to the cloud usually means faster deployments and easier scalability, but it also means latency. Cloud applications communicate over distributed networks, and while these networks are fast, little bits of latency can quickly add up.

View Video

Are you testing for known reliability vulnerabilities?

Aug 1, 2024 | By Gremlin

Are you testing for known reliability vulnerabilities? "Risks have different priorities, but ultimately we want to be aware of those risks. Just like we want our security team to go scan for known vulnerabilities, our reliability team should be scanning for known vulnerabilities. And those are easy things we should go address. There's a second part of it, which is kind of just good engineering testing, which is: Hey, we have a set of test cases that we know need to pass.

View Video

Build reliability efforts into your regular engineering schedule

Jul 25, 2024 | By Gremlin

Improving reliability might seem daunting, but you'd be surprised how much impact you can have with a relatively light lift. "Reliability doesn't need to be everybody stopped the world for a month, kind of a tech debt thing. If we spent 20 minutes a week, we could actually save ourselves a ton of time over the course of the year. The business needs to be efficient and agile, but it's important that the reliability is there. And so we really need people to be able to react quickly, adapt, and do a little bit along the way.

View Video

How to balance reliability with other DevOps priorities

Jul 23, 2024 | By Gremlin

Reliability efforts do take up some bandwidth, but in the end it's worth it—as our customers find out when their outage costs go down. "Everyone has their own priorities that they're dealing with. Given unlimited time and money, absolutely everyone would want to build the best possible system that is the most secure, performant, resilient, and everything.

View Video

How to Build Resilience Throughout Your SDLC Lessons from a Top 10 Bank

Jul 19, 2024 | By Gremlin

Are your applications as reliable as you planned? How do you know? The only way to ensure systems are resilient to common failure conditions is to test them, yet many large enterprises struggle with the effort and expense to do so. In this webinar, Anantha Movva, a former head of SRE and Performance Engineering at one of the top 10 North American banks, will share how he drove Chaos Engineering and resilience testing adoption throughout his organization.

View Video

Software reliability and availability is the whole team's problem-not just a few engineers

Jul 18, 2024 | By Gremlin

Reliability is everyone's problem—not just the SRE team's. "It's not just the SRE's problem. It's everybody's problem. So the SREs, they can run point and they can help report and help us understand, but we also have to hold the teams accountable. Are the teams investing time in reliability? Are they finding and fixing issues? Are we giving them space? And I think that comes back to, does the business see the benefit and do we have a good way of quantifying the benefit to the business?"—Kolton Andrus, Gremlin CTO.

View Video

Spend a little time on software reliability now instead of a lot of time later

Jul 11, 2024 | By Gremlin

You're going to spend time fixing reliability—but it's your choice whether it's during an outage or ahead of time on your schedule and for less costs. Which will you choose? "We all know when things go wrong, it cost us a million dollars and it was really bad. Let's have that never happen again. But when we say, I need every engineering team to spend one hour, one day a week on reliability, does everyone lose their mind, or is that a reasonable request? Can we amortize out the cost of that?

View Video

More Videos

SRE vs DevOps: Can they coexist or do they compete?

Sep 9, 2022 | By Gremlin

Systems fail, sometimes publicly and at great cost. Airlines have experienced system-wide ticketing outages, causing hundreds of flight cancellations and significant inconvenience to customers. Retailers have experienced website crashes on the busiest shopping days of the year, costing millions in lost revenue and customer goodwill. It is vital to understand both DevOps and SRE and the roles they play in preventing such outages.

Get EBook

Getting Started with Gremlin Attacks

Mar 29, 2022 | By Gremlin

Gremlin provides a variety of ways to test the resilience of your systems, which we call "attacks". Running different attacks lets you uncover unexpected behaviors, validate resilience mechanisms, and improve the overall reliability of your systems and services. This ebook explains each of Gremlin's attacks in complete detail, including what each attack does, how it impacts your systems, and the technical and business objectives the attack helps solve.

Get EBook

Chaos Engineering: Finding Failures Before They Become Outages

Jul 25, 2020 | By Gremlin

Learn the basics of Chaos Engineering: discover the tools, tests, and culture needed to create better software and prevent outages and downtime. This whitepaper provides a comprehensive introduction to the discipline of Chaos Engineering including why it is more needed than ever, how to get started, and best practices to maximize learnings and reduce risk.

Get White Paper

How to Implement Chaos Engineering at Your Company

Jul 25, 2020 | By Gremlin

By following this guide, you'll successfully increase your organization's reliability with minimal effort and risk. This document will serve as your guide to implementing Chaos Engineering and Gremlin within your organization. From educating your team on the principles of Chaos Engineering to running automated experiments, this guide will walk through each stage of the adoption process in order to ensure a smooth and successful rollout.

Get White Paper

Chaos Engineering for DynamoDB

Jul 25, 2020 | By Gremlin

Amazon DynamoDB is fast, powerful, and intended for high availability. These are all valuable attributes in a data storage solution, but to be useful as advertised, it must be configured thoughtfully. Learn how to use Chaos Engineering to ensure DynamoDB performs the way you expect. In this guide, we cover: Amazon DynamoDB is one of the most popular NoSQL databases and is the data store of choice for many teams running production workloads in AWS.

Get White Paper

How to Convince Your Organization to Adopt Chaos Engineering

Jul 1, 2020 | By Gremlin

Win over and convince your coworkers and management to explore and adopt Chaos Engineering and Site Reliability Engineering (SRE). The playbook provides ideas and techniques that can be used to articulate the need and benefits to internal stakeholders in your organization. It also guides the initial implementation in a way that will lead to success and growth across the organization. Implementing something new like Chaos Engineering successfully is a good way to get promoted and help the organization succeed, and this guide is here to help you.

Get White Paper

Chaos Engineering for MongoDB

Jul 1, 2020 | By Gremlin

MongoDB is designed for performance, scale, and high-availability. But, as with any software, you need to test your configuration to verify that it will work as advertised. Ensure that MongoDB performs the way you expect by using Chaos Engineering to test four key features. This guide includes four experiment tutorials to verify that MongoDB will perform reliably: In order to ensure you get the most out of MongoDB's rich features, including built-in data sharding and replication, it's crucial to test your configuration.

Get White Paper

More Publications

Gremlin aims to make the internet more reliable and prevent costly and reputation-damaging outages. Its failure-as-a-service platform empowers engineers to build more resilient systems through safe experimentation.

Downtime is expensive and can hurt your brand. Gremlin provides engineers with the framework to safely, securely, and easily simulate real outages with an ever-growing library of attacks. Turn failure into resilience with chaos engineering.

Build resilient infrastructure:

Resource Gremlins: Throttle CPU, Memory, I/O, and Disk.
State Gremlins: Reboot hosts, kill processes, travel in time.
Network Gremlins: Introduce latency, blackhole traffic, lose packets, fail DNS.

Test for application failure:

Test for failure in your code.
Fail or delay serverless functions.
Narrow the impact to a single user, device, or percentage of traffic.

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin

Monthly Archive

Follow Us