%term

The latest News and Information on Service Reliability Engineering and related technologies.

Hiring SREs in the AI era w/ Weights & Biases

Oct 14, 2025 By Rootly In Rootly

View Video

Rootly

Read more about Hiring SREs in the AI era w/ Weights & Biases

How OpenTelemetry Auto-Instrumentation Works

Oct 10, 2025 By Anjali Udasi In Last9

Most developers use auto-instrumentation as it’s meant to be used — run the Java agent, add NODE_OPTIONS, and telemetry starts flowing. When it stops, though, figuring out why can be tricky. Maybe the agent didn’t load, maybe there’s a framework version mismatch, or something else entirely. Understanding how auto-instrumentation works makes it easier to spot and fix these issues.

Read Post

Last9

Read more about How OpenTelemetry Auto-Instrumentation Works

15 PHP APM Tools Worth Using in 2025

Oct 10, 2025 By Faiz Shaikh In Last9

PHP powers a large swath of the web — from blogs to storefronts to APIs. But with microservices, third-party dependencies, and scaling complexity, performance can slip in subtle ways. Your app might mostly work, but small—noted delays, occasional spikes, or hidden bottlenecks build up. An APM tool helps you see inside the black box: which functions are slow, which DB queries are hogging time, which external calls are failing or stalling.

Read Post

Last9

Read more about 15 PHP APM Tools Worth Using in 2025

How to Scale Prometheus APM for Modern Applications

Oct 9, 2025 By Anjali Udasi In Last9

When developers monitor application performance, they pick one of two paths: traditional APM tools with distributed tracing and code profilers, or metrics-driven monitoring with Prometheus. The second approach — Prometheus APM — tracks the signals that matter most: request rates, error rates, latency, and resource utilization. No agents to install, no per-host pricing, just exporters and PromQL. For most teams, Prometheus APM is where monitoring starts.

Read Post

Last9

Read more about How to Scale Prometheus APM for Modern Applications

SRE Report Retrospectives - Have AIOps Predictions Held Up?

Oct 7, 2025 By Leo Vasiliou In Catchpoint

Welcome to a new blog series where we take a candid look at the predictions, insights, and bold claims we've made in previous SRE Reports and ask the uncomfortable question: How did we do? For the uninitiated, Catchpoint's SRE Report is our annual, practitioner-driven effort to capture the pulse of the global reliability community.

Read Post

Catchpoint

Read more about SRE Report Retrospectives - Have AIOps Predictions Held Up?

Pastries with SREs: Leveling up observability and donut dunkability

Oct 6, 2025 By Elastic In Elastic

In this episode of Pastries with SREs, we explore what it really means to shift left with observability, moving from reactive firefighting to proactive performance. And yes, it starts with donuts. We unpack how SREs and IT Ops teams are often stuck reacting to incidents, battling alert fatigue and swivel-chair triaging. But what if you could pull in developers earlier, and give everyone a unified view of observability data?

View Video

Elastic

Read more about Pastries with SREs: Leveling up observability and donut dunkability

Observability vs. Visibility: What's the Difference?

Oct 3, 2025 By Faiz Shaikh In Last9

In modern IT systems—distributed services, cloud-native platforms, and dynamic networks—just knowing that something is “up” isn’t enough. Green checkmarks on dashboards don’t tell you why performance shifted, why latency crept in, or why a perfectly healthy-looking service suddenly failed. This is where the conversation around visibility and observability begins. They sound similar, but they solve very different problems.

Read Post

Last9

Read more about Observability vs. Visibility: What's the Difference?

OTel Naming Best Practices for Spans, Attributes, and Metrics

Oct 1, 2025 By Anjali Udasi In Last9

An incident’s in progress. Services are slow, customers are frustrated, and your dashboards… look fine. At least, until you search for payment metrics and get 47 different names for the same signal. Suddenly, the real issue isn’t latency — it’s inconsistency. The OpenTelemetry project recently published a three-part series on naming conventions to solve exactly this problem.

Read Post

Last9

Read more about OTel Naming Best Practices for Spans, Attributes, and Metrics

Docker Daemon Logs: How to Find, Read, and Use Them

Sep 30, 2025 By Faiz Shaikh In Last9

Sometimes Docker behaves in ways that catch you off guard—containers don’t start as expected, images pause during pull, or networking takes longer than usual to respond. In those moments, the Docker daemon logs are your best reference point. These logs capture exactly what the Docker engine is doing at any given time. They give you a running account of system state, performance signals, and events that help you understand what’s happening beneath the surface.

Read Post

Last9

Read more about Docker Daemon Logs: How to Find, Read, and Use Them

Top 11 Java APM Tools: A Comprehensive Comparison

Sep 29, 2025 By Anjali Udasi In Last9

Are your Java applications running at their optimal performance, or is there room for improvement to make them faster and more efficient? With so many services depending on Java, keeping applications responsive and reliable is a core part of modern software engineering. This blog walks you through the leading Java Application Performance Monitoring (APM) tools, with a clear comparison to help you choose the right option for your needs.

Read Post