%term

The latest News and Information on Service Reliability Engineering and related technologies.

Background Job Observability Beyond the Queue

Sep 15, 2025 By Anjali Udasi In Last9

Background jobs handle the critical work that happens outside the request path: processing payments, sending emails, generating reports, syncing data. They keep applications running smoothly, but the signals they produce look different from API endpoints. Most teams start with queue metrics—how many jobs are waiting and how quickly they complete. These metrics provide the foundation, but job health extends beyond throughput.

Read Post

Last9

Read more about Background Job Observability Beyond the Queue

What is Service Catalog Observability and How Does It Work?

Sep 12, 2025 By Faiz Shaikh In Last9

A service catalog gives teams a shared view of their systems—what services exist, who owns them, how dependencies are structured, and the SLAs that guide expectations. It’s an important part of development infrastructure because it helps everyone speak the same language about services. Service catalog observability builds on that foundation.

Read Post

Last9

Read more about What is Service Catalog Observability and How Does It Work?

APM for Kubernetes: Monitor Distributed Applications at Scale

Sep 10, 2025 By Anjali Udasi In Last9

When a payment service runs across 12 pods — each serving different customer segments — and an authentication layer spans three namespaces, performance issues can originate in both the application code and the orchestration layer. The challenge is linking request-level performance data with what’s happening inside the cluster: container CPU limits, pod scheduling decisions, and node-level events.

Read Post

Last9

Read more about APM for Kubernetes: Monitor Distributed Applications at Scale

The End of "Good Code"? AI, Throughput, and Reliability with CircleCI CTO Rob Zuber

Sep 10, 2025 By Rootly In Rootly

Is “good code” still the right measure of engineering success in an AI-driven world? In this episode of *Humans of Reliability*, Rob Zuber, CircleCI CTO, joins Sylvain to explore how coding assistants are reshaping developer workflows and changing what teams value. Rob shares what he’s seeing across CircleCI’s customer base: a clear boost in throughput, new bottlenecks shifting from code creation to code review, and the rise of “vibe coding,” where engineers trust AI-generated code they may not fully understand.

View Video

Rootly

Read more about The End of "Good Code"? AI, Throughput, and Reliability with CircleCI CTO Rob Zuber

The Art of Incident Management #sre

Sep 9, 2025 By Rootly In Rootly

Read our post: https://rootly.com/blog/the-art-of-incident-management-part-i

View Video

Rootly

Read more about The Art of Incident Management #sre

The Answer to SRE Agent Failures: Context Engineering

Sep 9, 2025 By Mezmo In Mezmo

AI agents for SREs were supposed to slash mean time to resolution and eliminate alert fatigue. Instead, most teams got expensive, unreliable tools that burn through tokens without delivering insights. But what if the problem isn't the AI models themselves? Recent benchmarking reveals the real bottleneck: context engineering. When we tested our context engineering approach against conventional methods, the results were dramatic: Scroll down for our benchmark results to see the full comparison.

Read Post

Mezmo

Read more about The Answer to SRE Agent Failures: Context Engineering

Connectivity Layer in Agentic AI w/ Alloy Automation #ai

Sep 8, 2025 By Rootly In Rootly

View Video

Rootly

Read more about Connectivity Layer in Agentic AI w/ Alloy Automation #ai

Kubernetes Monitoring Metrics That Improve Cluster Reliability

Sep 5, 2025 By Anjali Udasi In Last9

A Kubernetes cluster can generate more than 1,400 metrics out of the box. That’s a lot of numbers to sift through, especially when you’re troubleshooting a production slowdown in the middle of the night. The key is knowing which metrics tell you the most, with the least noise. These are the signals worth paying attention to when you need answers fast.

Read Post

Last9

Read more about Kubernetes Monitoring Metrics That Improve Cluster Reliability

What companies get wrong about LLM evals w/ Groq

Sep 4, 2025 By Rootly In Rootly

View Video

Rootly

Read more about What companies get wrong about LLM evals w/ Groq

What is APM Tracing?

Sep 3, 2025 By Faiz Shaikh In Last9

APM tracing records the complete execution path of a request as it travels through your system, including database queries, external API calls, cache lookups, message queue events, and inter-service requests. Each step is captured with precise start and end timestamps, duration, and context such as service name, operation name, and relevant attributes. This lets you pinpoint where latency or errors originate without piecing together metrics and logs manually.

Read Post