Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

SRE Report Retrospectives - Have AIOps Predictions Held Up?

Welcome to a new blog series where we take a candid look at the predictions, insights, and bold claims we've made in previous SRE Reports and ask the uncomfortable question: How did we do? For the uninitiated, Catchpoint's SRE Report is our annual, practitioner-driven effort to capture the pulse of the global reliability community.

Pastries with SREs: Leveling up observability and donut dunkability

In this episode of Pastries with SREs, we explore what it really means to shift left with observability, moving from reactive firefighting to proactive performance. And yes, it starts with donuts. We unpack how SREs and IT Ops teams are often stuck reacting to incidents, battling alert fatigue and swivel-chair triaging. But what if you could pull in developers earlier, and give everyone a unified view of observability data?

Observability vs. Visibility: What's the Difference?

In modern IT systems—distributed services, cloud-native platforms, and dynamic networks—just knowing that something is “up” isn’t enough. Green checkmarks on dashboards don’t tell you why performance shifted, why latency crept in, or why a perfectly healthy-looking service suddenly failed. This is where the conversation around visibility and observability begins. They sound similar, but they solve very different problems.

OTel Naming Best Practices for Spans, Attributes, and Metrics

An incident’s in progress. Services are slow, customers are frustrated, and your dashboards… look fine. At least, until you search for payment metrics and get 47 different names for the same signal. Suddenly, the real issue isn’t latency — it’s inconsistency. The OpenTelemetry project recently published a three-part series on naming conventions to solve exactly this problem.

Docker Daemon Logs: How to Find, Read, and Use Them

Sometimes Docker behaves in ways that catch you off guard—containers don’t start as expected, images pause during pull, or networking takes longer than usual to respond. In those moments, the Docker daemon logs are your best reference point. These logs capture exactly what the Docker engine is doing at any given time. They give you a running account of system state, performance signals, and events that help you understand what’s happening beneath the surface.

Top 11 Java APM Tools: A Comprehensive Comparison

Are your Java applications running at their optimal performance, or is there room for improvement to make them faster and more efficient? With so many services depending on Java, keeping applications responsive and reliable is a core part of modern software engineering. This blog walks you through the leading Java Application Performance Monitoring (APM) tools, with a clear comparison to help you choose the right option for your needs.

How to Become an SRE Engineer

Site Reliability Engineering has emerged as one of the most sought-after careers in tech, combining software engineering expertise with operational excellence. SRE engineers ensure that critical systems remain reliable, scalable, and performant while enabling rapid feature development. With the global SRE job market projected to grow by over 25% in 2025, skilled professionals in this field command competitive salaries and enjoy diverse career opportunities across industries.

Monitor Kubernetes Hosts with OpenTelemetry

It’s 3 AM. API latency just spiked from 200ms to 2s. Alerts are firing, and users are frustrated. You SSH into the first server: top, free -h, iostat — nothing unusual. On to the next host. And the next. That’s how most of us learned to debug. The tools worked, and we got good at using them. But as infrastructure became distributed and dynamic, this approach started to break down. Modern monitoring needs more than SSH and top. It needs unified telemetry.

Key APM Metrics You Must Track

Application Performance Monitoring (APM) helps you understand how your software runs in production. When you track the right metrics, you see how requests move through your system, where slowdowns happen, and how resources are being used. With this knowledge, you can spot issues early and keep your applications reliable for your users. In this blog, we discuss the key APM metrics to monitor, grouped into categories, and why each one matters for performance and user experience.