%term

The latest News and Information on Service Reliability Engineering and related technologies.

The Answer to SRE Agent Failures: Context Engineering

Sep 9, 2025 By Mezmo In Mezmo

AI agents for SREs were supposed to slash mean time to resolution and eliminate alert fatigue. Instead, most teams got expensive, unreliable tools that burn through tokens without delivering insights. But what if the problem isn't the AI models themselves? Recent benchmarking reveals the real bottleneck: context engineering. When we tested our context engineering approach against conventional methods, the results were dramatic: Scroll down for our benchmark results to see the full comparison.

Read Post

Mezmo

Read more about The Answer to SRE Agent Failures: Context Engineering

The Art of Incident Management #sre

Sep 9, 2025 By Rootly In Rootly

Read our post: https://rootly.com/blog/the-art-of-incident-management-part-i

View Video

Rootly

Read more about The Art of Incident Management #sre

Connectivity Layer in Agentic AI w/ Alloy Automation #ai

Sep 8, 2025 By Rootly In Rootly

View Video

Rootly

Read more about Connectivity Layer in Agentic AI w/ Alloy Automation #ai

Kubernetes Monitoring Metrics That Improve Cluster Reliability

Sep 5, 2025 By Anjali Udasi In Last9

A Kubernetes cluster can generate more than 1,400 metrics out of the box. That’s a lot of numbers to sift through, especially when you’re troubleshooting a production slowdown in the middle of the night. The key is knowing which metrics tell you the most, with the least noise. These are the signals worth paying attention to when you need answers fast.

Read Post

Last9

Read more about Kubernetes Monitoring Metrics That Improve Cluster Reliability

What companies get wrong about LLM evals w/ Groq

Sep 4, 2025 By Rootly In Rootly

View Video

Rootly

Read more about What companies get wrong about LLM evals w/ Groq

What is APM Tracing?

Sep 3, 2025 By Faiz Shaikh In Last9

APM tracing records the complete execution path of a request as it travels through your system, including database queries, external API calls, cache lookups, message queue events, and inter-service requests. Each step is captured with precise start and end timestamps, duration, and context such as service name, operation name, and relevant attributes. This lets you pinpoint where latency or errors originate without piecing together metrics and logs manually.

Read Post

Last9

Read more about What is APM Tracing?

A Single Hub for Telemetry: OpenTelemetry Gateway

Sep 1, 2025 By Anjali Udasi In Last9

The OpenTelemetry Gateway (OTel Gateway) is a centralized service that collects, processes, and routes telemetry data—metrics, traces, and logs—across your infrastructure. In a typical setup, each service pushes telemetry directly to an observability backend. While this approach works well for small environments, it becomes increasingly difficult to manage as systems grow.

Read Post

Last9

Read more about A Single Hub for Telemetry: OpenTelemetry Gateway

How to Choose the Right Incident Management Tool for Your Team

Aug 29, 2025 By Vishal Padghan In Squadcast

IT disruptions are inevitable. What separates a resilient organization from the rest is its ability to respond quickly, efficiently, and collaboratively to incidents. The cornerstone of such responsiveness? The right incident management tool. But with a market flooded with tools, each promising to revolutionize your workflows, how do you pick the one that truly fits your team's needs? In this blog, we'll break down the key factors to consider when selecting an incident management tool, ensuring you make an informed decision that enhances your team's effectiveness and reliability.

Read Post

Squadcast

Read more about How to Choose the Right Incident Management Tool for Your Team

A Practical Guide to Python Application Performance Monitoring (APM)

Aug 29, 2025 By Anjali Udasi In Last9

When your Python app starts slowing down, maybe queries are taking longer, memory keeps creeping up, or API calls are lagging—basic server metrics won’t tell you why. You need to see what’s happening inside the application itself. That’s the role of Application Performance Monitoring (APM). It gives you a breakdown of database queries, external API calls, memory usage, error rates, and more, so you can connect the dots between code and performance.

Read Post

Last9

Read more about A Practical Guide to Python Application Performance Monitoring (APM)

What is Database Monitoring

Aug 28, 2025 By Anjali Udasi In Last9

Database monitoring transforms from a reactive troubleshooting exercise into a proactive optimization strategy when you have the right tools and approaches in place. This blog shares practical ways to choose monitoring solutions, set up observability for different database platforms, and design workflows that scale in modern distributed systems.

Read Post