Operations | Monitoring | ITSM | DevOps | Cloud

Gremlin

Introducing Process Exhaustion: How to scale your services without overwhelming your systems

We rarely think about how many processes are running on our systems. Modern CPUs are powerful enough to run thousands of processes concurrently, but at what point do our systems become oversaturated? When you’re running large-scale distributed applications, you might reach this limit sooner than you'd expect. How can you determine what that limit is, and how does that affect the number and complexity of the workloads you deploy?

How to validate memory-intensive workloads scale in the cloud

Memory is a surprisingly difficult thing to get right in cloud environments. The amount of memory (also called RAM, or random-access memory) in a system indirectly determines how many processes can run on a system, and how large those processes can get. You might be able to run a dozen database instances on a single host, but that same host may struggle to run a single large language model.

Your reliability scorecard: How to measure and track service reliability

If your organization asked you to report on the reliability improvements you’ve made over the past 90 days, would you be able to pull up a report? If you’re like many engineers, this question might make you anxious. Reliability is a difficult metric to quantify in a meaningful way, let alone measure.

The case for Fault Injection testing in Production

Many organizations who are looking to introduce Fault Injection as a testing technique start with non-production environments, but don't always go back and reconsider that choice as they mature beyond initial assessment. However, there's a strong case for running these tests in your live systems. It's important to consider the trade-offs when choosing to test in production or non-production environments, as it can have far-reaching impacts on the efficacy and cost of improving the resilience of software.

How to find and test critical dependencies with Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Pop quiz—what are all of the dependencies your services rely on? If you’re like most engineers, you probably struggled to come up with the answer. Modern applications are complex and rely on dozens (if not hundreds) of dependencies. Many teams rely on spreadsheets, but manual processes like these break down over time. What if you had a tool that found and tracked dependencies for you?

How to use host redundancy to improve service reliability and availability

Cloud computing has made provisioning new servers easy, fast, and relatively cheap. Almost anyone can log into a cloud console, spin up a new server, and deploy an application. And if they need greater uptime, major cloud providers include all kinds of settings, services, and configurations to add fault tolerance and failover. So why is it that many services fail when a single server instance fails?

10 Most Common Kubernetes Reliability Risks

Reliability risks are potential points of failure in your system where an outage could occur. If you can find and remediate reliability risks, then you can prevent incidents before they happen. In complex Kubernetes systems, these reliability risks can take a wide variety of forms, including node failures, pod or container crashes, missing autoscaling rules, misconfigured load balancing or application gateway rules, pod crash loops, and more. And they’re more prevalent than you might think.

How dependency discovery works in Gremlin

Modern applications are rarely created entirely from scratch. Instead, they rely on a framework of pre-existing applications and services, each adding specific features and functionality. These dependencies empower teams to build and deploy applications more efficiently, but they bring their own set of challenges. Tracking, managing, and updating these dependencies is difficult, especially in large, complex applications where dependencies are likely managed by different teams.

How to make your services zone redundant

In January of 2020, an entire availability zone (AZ) in AWS’ Sydney region suddenly went dark. Multiple facilities lost power, preventing customers from accessing EC2 instances and Elastic Block Storage (EBS) volumes. Customers who didn’t have backup infrastructure in another zone had to wait nearly 8 hours before service was restored, and even then, some EBS volumes couldn’t be recovered. Major cloud provider outages are rare, but they happen nonetheless.