%term

The latest News and Information on Service Reliability Engineering and related technologies.

You Can't Fix What You Don't Measure: Observability in the Age of AI with Conor Bronsdon

Nov 5, 2025 By Rootly In Rootly

Only 50% of companies monitor their ML systems. Building observability for AI is not simple: it goes beyond 200 OK pings. In this episode, Sylvain Kalache sits down with Conor Brondsdon (Galileo) to unpack why observability, monitoring, and human feedback are the missing links to make large language model (LLM) reliable in production.

View Video

Rootly

Read more about You Can't Fix What You Don't Measure: Observability in the Age of AI with Conor Bronsdon

Grafana Tempo: Setup, Configuration, and Best Practices

Nov 4, 2025 By Anjali Udasi In Last9

As systems grow, understanding how a request moves across multiple services becomes harder. Traces help bring this picture together by showing the exact path a request takes, along with the timings that matter. Grafana Tempo is built for this kind of workload. It stores traces efficiently, works well with OpenTelemetry, and keeps the operational overhead low.

Read Post

Last9

Read more about Grafana Tempo: Setup, Configuration, and Best Practices

SRE vs DevOps vs Platform Engineering: What Are the Key Differences

Nov 4, 2025 By Randhir Kumar In Spike

Software delivery is more complex than ever. Teams need speed, reliability, and scalability to stay competitive. Site Reliability Engineering (SRE), DevOps, and Platform Engineering are three key disciplines that address these challenges. Though these terms are often used together, they are not the same and share distinct differences. In this blog, we’ll discuss each term individually, compare SRE vs. DevOps vs. Platform Engineering, and also show how they work together.

Read Post

Spike

Read more about SRE vs DevOps vs Platform Engineering: What Are the Key Differences

OTel Updates: Declarative Config - A Steadier Way to Configure OpenTelemetry SDKs

Nov 3, 2025 By Anjali Udasi In Last9

Application configs change over time, often in small ways that are easy to miss. They may start simple — a few environment variables, one exporter, nothing unexpected. As your instrumentation grows, you add rules for filtering health check spans, adjust sampling based on attributes, or introduce environment-specific resource settings. Each change makes sense on its own. But months later, the picture can look different across dev, staging, and production.

Read Post

Last9

Read more about OTel Updates: Declarative Config - A Steadier Way to Configure OpenTelemetry SDKs

Embracing failure and chaos to improve system reliability and SRE team performance

Nov 3, 2025 By Elastic In Elastic

In this interview with Alex Hidalgo, Field CTO at Nobl9 and author of Implementing Service Level Objectives (O’Reilly Media), we explore how traditional metrics like MTTR and MTTx can give a false sense of reliability. Alex shares how SRE teams can embrace failure, build psychological safety, and design systems that reflect the human factor behind uptime, outages, and real-world reliability.

View Video

Elastic

Read more about Embracing failure and chaos to improve system reliability and SRE team performance

We Built an SRE Agent With Memory And It's Transforming Incident Response

Oct 30, 2025 By Julia Nasser In PagerDuty

If you feel like your incidents are multiplying while your stack gets more complex by the week, you’re not alone. Event volumes keep climbing, signals live in a dozen tools, and human responders are stretched thin. That’s exactly why we built the PagerDuty SRE Agent—a vendor‑agnostic AI teammate that improves with every response to make the next one faster, smarter, and more reliable.

Read Post

PagerDuty

Read more about We Built an SRE Agent With Memory And It's Transforming Incident Response

Same code, same infra but your model is now broken #ai #devops

Oct 30, 2025 By Rootly In Rootly

View Video

Rootly

Read more about Same code, same infra but your model is now broken #ai #devops

Sidecar or Agent for OpenTelemetry: How to Decide

Oct 29, 2025 By Anjali Udasi In Last9

Getting telemetry out of a distributed system isn’t the hard part. Getting it out cleanly, without noise, drop-offs, or odd performance side-effects — that’s where things get interesting. Before you worry about processors or storage costs, you need a clear plan for where the OTel Collector should run. Most teams narrow this down to two options: a sidecar that sits next to each service, or a node-level agent that handles data for everything running on the node. Both patterns are solid.

Read Post

Last9

Read more about Sidecar or Agent for OpenTelemetry: How to Decide

OTel Updates: Consistent Probability Sampling Fixes Fragmented Traces

Oct 28, 2025 By Anjali Udasi In Last9

You're sampling 1% of traces in production. A payment request fails at 3 AM. Logs show an error in order-service, but the full picture isn't there because different services made different sampling decisions. order-service kept the trace; payment-service didn't. So you end up checking logs and timestamps across a few services to piece things together. This happens because the usual probability sampling approach makes a separate choice at each service boundary.

Read Post

Last9

Read more about OTel Updates: Consistent Probability Sampling Fixes Fragmented Traces

OpenTelemetry Spans Explained: Deconstructing Distributed Tracing

Oct 24, 2025 By Anjali Udasi In Last9

In a microservices architecture, a single user request can pass through multiple services before completing. When performance drops or an error occurs, tracing that journey is the only way to locate the source. Distributed tracing provides that visibility. At its core are OpenTelemetry Spans — units of work that capture what each service does during a request.

Read Post