%term

The latest News and Information on Service Reliability Engineering and related technologies.

SRE Report Retrospectives - Have AIOps Predictions Held Up?

Oct 7, 2025 By Leo Vasiliou In Catchpoint

Welcome to a new blog series where we take a candid look at the predictions, insights, and bold claims we've made in previous SRE Reports and ask the uncomfortable question: How did we do? For the uninitiated, Catchpoint's SRE Report is our annual, practitioner-driven effort to capture the pulse of the global reliability community.

Read Post

Catchpoint

Read more about SRE Report Retrospectives - Have AIOps Predictions Held Up?

Pastries with SREs: Leveling up observability and donut dunkability

Oct 6, 2025 By Elastic In Elastic

In this episode of Pastries with SREs, we explore what it really means to shift left with observability, moving from reactive firefighting to proactive performance. And yes, it starts with donuts. We unpack how SREs and IT Ops teams are often stuck reacting to incidents, battling alert fatigue and swivel-chair triaging. But what if you could pull in developers earlier, and give everyone a unified view of observability data?

View Video

Elastic

Read more about Pastries with SREs: Leveling up observability and donut dunkability

Observability vs. Visibility: What's the Difference?

Oct 3, 2025 By Faiz Shaikh In Last9

In modern IT systems—distributed services, cloud-native platforms, and dynamic networks—just knowing that something is “up” isn’t enough. Green checkmarks on dashboards don’t tell you why performance shifted, why latency crept in, or why a perfectly healthy-looking service suddenly failed. This is where the conversation around visibility and observability begins. They sound similar, but they solve very different problems.

Read Post

Last9

Read more about Observability vs. Visibility: What's the Difference?

OTel Naming Best Practices for Spans, Attributes, and Metrics

Oct 1, 2025 By Anjali Udasi In Last9

An incident’s in progress. Services are slow, customers are frustrated, and your dashboards… look fine. At least, until you search for payment metrics and get 47 different names for the same signal. Suddenly, the real issue isn’t latency — it’s inconsistency. The OpenTelemetry project recently published a three-part series on naming conventions to solve exactly this problem.

Read Post

Last9

Read more about OTel Naming Best Practices for Spans, Attributes, and Metrics

Docker Daemon Logs: How to Find, Read, and Use Them

Sep 30, 2025 By Faiz Shaikh In Last9

Sometimes Docker behaves in ways that catch you off guard—containers don’t start as expected, images pause during pull, or networking takes longer than usual to respond. In those moments, the Docker daemon logs are your best reference point. These logs capture exactly what the Docker engine is doing at any given time. They give you a running account of system state, performance signals, and events that help you understand what’s happening beneath the surface.

Read Post

Last9

Read more about Docker Daemon Logs: How to Find, Read, and Use Them

Top 11 Java APM Tools: A Comprehensive Comparison

Sep 29, 2025 By Anjali Udasi In Last9

Are your Java applications running at their optimal performance, or is there room for improvement to make them faster and more efficient? With so many services depending on Java, keeping applications responsive and reliable is a core part of modern software engineering. This blog walks you through the leading Java Application Performance Monitoring (APM) tools, with a clear comparison to help you choose the right option for your needs.

Read Post

Last9

Read more about Top 11 Java APM Tools: A Comprehensive Comparison

How to Become an SRE Engineer

Sep 27, 2025 By Alexandr Bandurchin In Uptrace

Site Reliability Engineering has emerged as one of the most sought-after careers in tech, combining software engineering expertise with operational excellence. SRE engineers ensure that critical systems remain reliable, scalable, and performant while enabling rapid feature development. With the global SRE job market projected to grow by over 25% in 2025, skilled professionals in this field command competitive salaries and enjoy diverse career opportunities across industries.

Read Post

Uptrace

Read more about How to Become an SRE Engineer

{unscripted} AI SRE

Sep 27, 2025 By Harness In Harness

Harness AI SRE is a comprehensive incident management system that uses AI to enable teams to detect, respond to, and resolve incidents efficiently. It integrates with various monitoring, alerting, and collaboration tools to provide a seamless incident resolution workflow.

View Video

Harness

Read more about {unscripted} AI SRE

Monitor Kubernetes Hosts with OpenTelemetry

Sep 26, 2025 By Anjali Udasi In Last9

It’s 3 AM. API latency just spiked from 200ms to 2s. Alerts are firing, and users are frustrated. You SSH into the first server: top, free -h, iostat — nothing unusual. On to the next host. And the next. That’s how most of us learned to debug. The tools worked, and we got good at using them. But as infrastructure became distributed and dynamic, this approach started to break down. Modern monitoring needs more than SSH and top. It needs unified telemetry.

Read Post

Last9

Read more about Monitor Kubernetes Hosts with OpenTelemetry

Key APM Metrics You Must Track

Sep 23, 2025 By Anjali Udasi In Last9

Application Performance Monitoring (APM) helps you understand how your software runs in production. When you track the right metrics, you see how requests move through your system, where slowdowns happen, and how resources are being used. With this knowledge, you can spot issues early and keep your applications reliable for your users. In this blog, we discuss the key APM metrics to monitor, grouped into categories, and why each one matters for performance and user experience.

Read Post