1979, a nuclear accident and SRE
Deep diving into the 'Normal accident' theory by Charles Perrow, and what it means for SREs.
Deep diving into the 'Normal accident' theory by Charles Perrow, and what it means for SREs.
Native support for OpenTelemetry metrics in Prometheus.
Kubernetes can be installed using different tools, whether open-source, third-party vendor, or in a public cloud. In most cases, default installations have limited monitoring capabilities. Therefore, once a Kubernetes cluster is running, administrators must implement monitoring solutions to meet their requirements. Typical use cases for Kubernetes monitoring include: Effective Kubernetes monitoring requires a mix of tools, strategy, and technical expertise. To help you get it right, this article will explore seven essential Kubernetes monitoring best practices in detail.
We’ve all seen it: a company experiencing a major incident and going radio silent, leaving their customers to wonder “Are they doing something about this?!”. If you’ve ever been on the inside of something like this, you know the answer is most likely yes, there are people working hard to put out the fire as quickly as possible. But when it comes to incidents, perception is reality for customers.
What is OpenTelemetry? Why is it important? Do SREs need to adopt OTel? An Explain It Like I'm 5.
OpenTelemetry vs. Prometheus - Difference in architecture, and metrics.
Visibility into the upstream and downstream dependencies of your services is key to maintaining a performant microservices environment. Application developers and SREs rely on this visibility to quickly trace issues back to the source, which is essential during incidents—when time is of the essence—throughout day-to-day operations, and as systems evolve and scale.
What is OpenTelemetry Collector, Architecture, Deployment and Getting started.
How JCB improves team structure, risk management, and application and platform development.
InfluxDB vs Thanos: Overview, Pros and Cons, and Differences.
Site reliability engineers manage a lot, and often in incredibly high-stakes environments. Remember that scene from "The Matrix" where Neo dodges bullets in slow motion? Of course you do. As an SRE, it can feel like you're the person getting hit by those bullets, frantically trying to investigate performance issues, automate away toil, and support the engineers around you, all before the next wave of attacks.
As new incidents emerge, there are often many unknowns about the size, severity, and cause of the problem. Sometimes it’s not clear if the problem is an incident at all. That’s where introducing a triage stage to your incident management process can help. In this post, we’ll look at the benefits of adding a triage layer to your incident management, and how Rootly’s Triage feature allows you to seamlessly transition from triage to real incident (or false alarm).
If all companies are software companies, all companies need better Observability to understand how performative their software is.
Comparing Prometheus vs. VictoriaMetrics (VM) - Scalability, Performance, Integrations.
Comparing Prometheus vs. Cortex - Scalability, Cost, Performance, Known Weaknesses.