Operations | Monitoring | ITSM | DevOps | Cloud

Gremlin

The KPIs of improved reliability

For many businesses, prioritizing reliability is an ongoing challenge. Building reliable systems and services is critical for growing revenue and customer trust, but other initiatives—like building new products and features—often take precedence since they provide a clearer and more immediate return. That's not to say reliability doesn't have clear value, but proving this value to business leaders can be tricky.

How to test for expired TLS/SSL certificates using Gremlin

Transport Layer Security (TLS), and its preceding protocol, Secure Sockets Layer (SSL), are essential to the modern Internet. Encrypting network communications using TLS protects users and organizations from publicly exposing in-transit data to third parties. This is especially important for the web, where TLS secures HTTP traffic (HTTPS) between backend servers and customers’ browsers.

How reliability testing and load testing are complementary

How can you tell if your systems are reliable when under load? A common answer is to open your observability dashboards, wait for a high-traffic event (like Black Friday), and cross your fingers. While this approach is certainly effective, it's far from ideal. Without proactive reliability and load testing, we have no idea if a system will hold up to real-world usage patterns, which could mean a production outage at the worst possible time.

How to identify and map service dependencies

Modern applications are a web of interdependent services. As applications grow in size and complexity, and as more engineering teams adopt service-based architectures like microservices, this web becomes deeper and denser. Eventually, keeping track of the interdependencies between services becomes a complex and time-consuming task in and of itself. In addition, if any of these dependencies fails, it can have cascading impacts on the rest of your services and on the application as a whole.

Managing and improving reliability using Gremlin's Reliability Dashboard

Part of a successful reliability program is being able to monitor and review your progress toward improving reliability. Being able to run tests on services is a big part of it, but how can you tell you're making progress if you can only see your latest test results? There should be a way to track improvements or regressions in your reliability testing practice across your organization in a way that's easy to digest. That's where the Reliability Dashboard comes in.

Setting better SLOs using Google's Golden Signals

To many engineers, the idea that you can accurately and comprehensively track your application's user experience using just a few simple metrics might sound far-fetched. Believe it or not, there are four metrics that aim to do just that. They're called the four Golden Signals and should be a core part of your observability and reliability practices.

SRE vs DevOps: Can they coexist or do they compete?

Systems fail, sometimes publicly and at great cost. Airlines have experienced system-wide ticketing outages, causing hundreds of flight cancellations and significant inconvenience to customers. Retailers have experienced website crashes on the busiest shopping days of the year, costing millions in lost revenue and customer goodwill. It is vital to understand both DevOps and SRE and the roles they play in preventing such outages.