Gremlin

https://www.gremlin.com/

San Jose, CA, USA

2016

How reliability differs between monolithic and microservice-based architectures

May 14, 2024 | By Andre Newman

Microservices have forever changed the way we build applications. Tools like Docker and Kubernetes made microservice-based architectures widely accessible to software developers, and cloud platforms like Amazon EKS made deploying containers fast and inexpensive. They've also enabled even small engineering teams to deploy code faster, leverage fault tolerance and redundancy, scale more efficiently, and take full ownership of their services from development all the way into production.

Read Post

How to build zone-redundant cloud instances and clusters

May 9, 2024 | By Andre Newman

Redundancy is a core tenet of cloud computing. While major cloud platforms have high targets for reliability, they can still fail, and it’s important for teams to have a plan for when they do. But how can you build services that can withstand something as disruptive as a datacenter outage? In this blog, we’ll show you how to prepare for availability zone outages by proactively detecting services operating in a single zone.

Read Post

Five ways Gremlin helps organizations meet DORA requirements

May 7, 2024 | By Ryan Detwiller

Enacted by the European Union, the Digital Operational Resilience Act (DORA) establishes new standards for digital operational resilience in the financial sector. DORA changes the financial sector's approach to digital security and resilience by imposing stringent Information and Communication Technology (ICT) risk management, incident reporting, third-party risk management, and regular testing.

Read Post

Three roles you need for reliability success

May 7, 2024 | By Gavin Cahill

It’s one thing to say that reliability is a priority for your organization, and a whole other thing to make actual, demonstrable improvements in the availability of your applications. Sadly, it’s common for organizations to invest time, money, and effort into improving reliability only to barely nudge the needle on incidents and downtime. But there are hundreds of companies successfully improving their reliability posture—and doing it at enterprise scale.

Read Post

How to build reliable services with unreliable dependencies

May 2, 2024 | By Andre Newman

In an earlier blog, we looked at slow dependencies and how they can impact the reliability of other services. While we explored what happens when dependencies are degraded, what happens when dependencies outright fail? What can you do when your application or service sends a request to another service, and nothing comes back? We’ll answer this question by using Gremlin to proactively test a service with multiple dependencies.

Read Post

How to make your services resilient to slow dependencies

Apr 24, 2024 | By Andre Newman

When discussing reliability, we tend to focus on the things that we have control over: applications, virtual machine instances, deployment patterns, etc. But this ignores a significant and ever-growing part of nearly all modern software: dependencies. Dependencies are services that provide extra functionality for other services and applications. For instance, many websites depend on databases, caches, payment processors, and similar services in order to function.

Read Post

Hitting reliability goals in the face of layoffs

Apr 23, 2024 | By Jeff Nickoloff

It’s never easy when layoffs hit your organization. In addition to the personal impact of losing friends and coworkers from your team, those who remain are left trying to achieve the same business goals with less people and resources. Unfortunately, layoffs and restructuring have become a common part of business. But you’re not alone. Your partners (including Gremlin) are here to help you navigate your new reality.

Read Post

How to ensure your Kubernetes Pods and containers can restart automatically

Apr 16, 2024 | By Andre Newman

As complex as Kubernetes is, much of it can be distilled to one simple question: how do we keep containers available for as long as possible? All of the various utilities, features, platform integrations, and observability tools surrounding Kubernetes tend to serve this one goal. Unfortunately, this also means there’s a lot of complexity and confusion surrounding this topic. After all, most people would agree that availability is important, but how exactly do you go about achieving it?

Read Post

How to ensure your Kubernetes cluster can tolerate lost nodes

Apr 12, 2024 | By Andre Newman

Redundancy is a core strength of Kubernetes. Whenever a component fails, such as a Pod or deployment, Kubernetes can usually automatically detect and replace it without any human intervention. This saves DevOps teams a ton of time and lets them focus on developing and deploying applications, rather than managing infrastructure.

Read Post

How to standardize resiliency on Kubernetes

Apr 10, 2024 | By Gavin Cahill

There’s more pressure than ever to deliver high-availability Kubernetes systems, but there’s a combination of organizational and technological hurdles that make this ‌easier said than done. Technologically, Kubernetes is complex and ephemeral, with deployments that span infrastructure, cluster, node, and pod layers. And like with any complex and ephemeral system, the large amount of constantly-changing parts opens the possibility for sudden, unexpected failures.

Read Post

How to run Chaos Engineering experiments in your CI/CD pipeline

May 10, 2024 | By Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Ad-hoc Chaos Engineering experiments are great for learning more about how your systems work, but they don’t tell you how your systems behave over time. As new features get deployed, environments change, and regressions get introduced, even the most resilient systems can gain reliability risks. QA and performance testing are already built into CI/CD - why not reliability?

View Video

Confident Cloud Migrations How a Top 5 Bank Ensured Reliability With AWS and Gremlin

Apr 29, 2024 | By Gremlin

In today's competitive landscape, migrating to the cloud brings substantial benefits, but the cloud’s new architectures and tools also bring new reliability risks and considerations. The challenge: Enterprises have to figure out how to capitalize on the benefits of the cloud while ensuring a seamless, reliable transition. This webinar offers a look at how to provide application reliability before, during, and after migrations with AWS and Gremlin.

View Video

Building Resilience in the Cloud With the AWS Well Architected Framework and Gremlin

Apr 29, 2024 | By Gremlin

Reliability and resilience in the cloud requires a different approach. Thankfully, the AWS Well-Architected Framework is a proven blueprint for cloud architects and engineering leaders seeking to design and operate resilient systems on AWS.

View Video

How to test your systems for scalability and redundancy with Fault Injection

Apr 11, 2024 | By Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Do you know if your services can tolerate losing a node? What about an entire availability zone? Or a region?‍ Large-scale outages aren’t unheard of. When you’re running critical services, it’s vital that those services can keep running even if an AZ or region fails. In addition to failing over, these services also need to scale quickly so traffic shifts don’t overwhelm your systems. How do you prove that a service is both scalable and redundant? The answer is with Fault Injection.

View Video

How to find Kubernetes reliability risks with Gremlin

Mar 15, 2024 | By Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Most Kubernetes clusters have reliability risks lurking just below the surface. You could spend hours or even days manually finding these risks, but what if someone could find them for you? With Detected Risks, Gremlin automates the work involved in finding and tracking reliability risks across your Kubernetes clusters. Surface failed Pods, mismatched image versions, missing resource definitions, and single points of failure, all without having to run a single test.

View Video

How to find and test critical dependencies with Gremlin

Feb 22, 2024 | By Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Pop quiz—what are all of the dependencies your services rely on? If you’re like most engineers, you probably struggled to come up with the answer. Modern applications are complex and rely on dozens (if not hundreds) of dependencies. Many teams rely on spreadsheets, but manual processes like these break down over time. What if you had a tool that found and tracked dependencies for you?

View Video

Kubernetes Reliability Risks: How to monitor for critical issues at scale

Dec 18, 2023 | By Gremlin

Learn how to automatically find and fix the most critical Kubernetes reliability risks in enterprise organizations. Recent research shows that nearly every organization has reliability risks in their Kubernetes clusters. Many of them are caused by simple misconfiguration, but they can have devastating consequences—including taking critical services offline. And while you could manually review every Kubernetes deployment, the speed and scale at which most organizations deploy to Kubernetes makes that impractical.

View Video

Building a Culture of Reliability: Why SREs Can't Do It Alone

Nov 3, 2023 | By Gremlin

Join Gremlin CTO and Founder Kolton Andrus to hear practical strategies for building a collaborative culture of reliability. High-velocity DevOps orgs and complex cloud-native architectures have made reliability harder than ever. Organizations are turning to SREs to make sure systems are reliable, but with so many stakeholders and competing priorities, many companies are still struggling to get ahead of the outages and incidents—SREs simply can't do it all by themselves.

View Video

What is Gremlin?

Oct 17, 2023 | By Gremlin

Today’s technology leaders are facing a reliability gap. Customers expect their apps to be fast and available. But with Devops and distributed systems driving more speed and complexity, it’s harder than ever to find and fix the reliability risks that can impact customer experience–before it’s too late. To close the Reliability gap, we need a reliability strategy. One that’s proactive, measurable, built-in and automated. We need a reliability platform.

View Video

Enterprise Chaos Engineering Certification Prep Session

Oct 3, 2023 | By Gremlin

Demonstrate your reliability expertise, increase your visibility, and advance your career with a Gremlin Enterprise Chaos Engineering certification. Chaos Engineering continues to grow in popularity and is rapidly becoming a job requirement for Engineering teams focused on reliability. In this webinar, Sr. Reliability Specialist Andre Newman goes over the mindset shifts, best practices, and key information you need to prep for your certification.

View Video

More Videos

SRE vs DevOps: Can they coexist or do they compete?

Sep 9, 2022 | By Gremlin

Systems fail, sometimes publicly and at great cost. Airlines have experienced system-wide ticketing outages, causing hundreds of flight cancellations and significant inconvenience to customers. Retailers have experienced website crashes on the busiest shopping days of the year, costing millions in lost revenue and customer goodwill. It is vital to understand both DevOps and SRE and the roles they play in preventing such outages.

Get EBook

Getting Started with Gremlin Attacks

Mar 29, 2022 | By Gremlin

Gremlin provides a variety of ways to test the resilience of your systems, which we call "attacks". Running different attacks lets you uncover unexpected behaviors, validate resilience mechanisms, and improve the overall reliability of your systems and services. This ebook explains each of Gremlin's attacks in complete detail, including what each attack does, how it impacts your systems, and the technical and business objectives the attack helps solve.

Get EBook

Chaos Engineering: Finding Failures Before They Become Outages

Jul 25, 2020 | By Gremlin

Learn the basics of Chaos Engineering: discover the tools, tests, and culture needed to create better software and prevent outages and downtime. This whitepaper provides a comprehensive introduction to the discipline of Chaos Engineering including why it is more needed than ever, how to get started, and best practices to maximize learnings and reduce risk.

Get White Paper

How to Implement Chaos Engineering at Your Company

Jul 25, 2020 | By Gremlin

By following this guide, you'll successfully increase your organization's reliability with minimal effort and risk. This document will serve as your guide to implementing Chaos Engineering and Gremlin within your organization. From educating your team on the principles of Chaos Engineering to running automated experiments, this guide will walk through each stage of the adoption process in order to ensure a smooth and successful rollout.

Get White Paper

Chaos Engineering for DynamoDB

Jul 25, 2020 | By Gremlin

Amazon DynamoDB is fast, powerful, and intended for high availability. These are all valuable attributes in a data storage solution, but to be useful as advertised, it must be configured thoughtfully. Learn how to use Chaos Engineering to ensure DynamoDB performs the way you expect. In this guide, we cover: Amazon DynamoDB is one of the most popular NoSQL databases and is the data store of choice for many teams running production workloads in AWS.

Get White Paper

How to Convince Your Organization to Adopt Chaos Engineering

Jul 1, 2020 | By Gremlin

Win over and convince your coworkers and management to explore and adopt Chaos Engineering and Site Reliability Engineering (SRE). The playbook provides ideas and techniques that can be used to articulate the need and benefits to internal stakeholders in your organization. It also guides the initial implementation in a way that will lead to success and growth across the organization. Implementing something new like Chaos Engineering successfully is a good way to get promoted and help the organization succeed, and this guide is here to help you.

Get White Paper

Chaos Engineering for MongoDB

Jul 1, 2020 | By Gremlin

MongoDB is designed for performance, scale, and high-availability. But, as with any software, you need to test your configuration to verify that it will work as advertised. Ensure that MongoDB performs the way you expect by using Chaos Engineering to test four key features. This guide includes four experiment tutorials to verify that MongoDB will perform reliably: In order to ensure you get the most out of MongoDB's rich features, including built-in data sharding and replication, it's crucial to test your configuration.

Get White Paper

More Publications

Gremlin aims to make the internet more reliable and prevent costly and reputation-damaging outages. Its failure-as-a-service platform empowers engineers to build more resilient systems through safe experimentation.

Downtime is expensive and can hurt your brand. Gremlin provides engineers with the framework to safely, securely, and easily simulate real outages with an ever-growing library of attacks. Turn failure into resilience with chaos engineering.

Build resilient infrastructure:

Resource Gremlins: Throttle CPU, Memory, I/O, and Disk.
State Gremlins: Reboot hosts, kill processes, travel in time.
Network Gremlins: Introduce latency, blackhole traffic, lose packets, fail DNS.

Test for application failure:

Test for failure in your code.
Fail or delay serverless functions.
Narrow the impact to a single user, device, or percentage of traffic.

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin

Monthly Archive

Follow Us