%term

The latest News and Information on Service Reliability Engineering and related technologies.

How AI Agents Are Redefining the SRE Role

Nov 25, 2025 By PagerDuty In PagerDuty

Even the best site reliability engineers (SREs) spend too much time doing reactive work—triaging incidents, gathering context, escalating to the right teams, and documenting what happened. That work is essential, but it’s not where an SRE’s highest value lies. These engineers are hired to build and maintain resilient systems, not play air-traffic control with every alert that hits their queue.

Read Post

PagerDuty

Read more about How AI Agents Are Redefining the SRE Role

Lessons from KubeCon: What "Best-of-Breed" AI SRE Really Requires

Nov 24, 2025 By Ilan Adler In Komodor

This year’s KubeCon underscored a real shift: AI SRE has gone mainstream. Of course, it’s not a surprise. Teams from high-growth startups to Fortune 500s are running more complex, cloud-native systems, shipping more AI-generated code, and facing rising expectations. Downtime is absolutely not an option and the work for on-call SREs has become unsustainable. The question isn’t whether AI SRE helps. It’s which one you can trust in production.

Read Post

Komodor

Read more about Lessons from KubeCon: What "Best-of-Breed" AI SRE Really Requires

7 Observability Solutions for Full-Fidelity Telemetry

Nov 24, 2025 By Anjali Udasi In Last9

You don’t have to choose between capturing every signal and keeping costs predictable. Modern observability stacks blend full-fidelity storage (time series or columnar systems like ClickHouse and Apache Druid), tail-based sampling for heavy traffic, and tiered storage (hot/warm/cold with S3-backed archives). This gives you full-fidelity incident forensics with the day-to-day cost profile of a sampled setup.

Read Post

Last9

Read more about 7 Observability Solutions for Full-Fidelity Telemetry

Mezmo + Catchpoint deliver observability SREs can rely on

Nov 24, 2025 By Mezmo In Mezmo

For SREs juggling multiple services, third-party dependencies, and constant alerts, a critical service slowdown can quickly turn into chaos. APM Dashboards may show everything is fine, yet users are still experiencing problems. That gap—between application telemetry and real-world performance—can turn a five-minute fix into a two-hour war room. ‍

Read Post

Mezmo

Read more about Mezmo + Catchpoint deliver observability SREs can rely on

Introducing Bits AI SRE, your AI on-call teammate

Nov 24, 2025 By Datadog In Datadog

Bits AI SRE is your AI on-call teammate, built to autonomously investigate alerts and coordinate incident response. Integrated with Datadog, Slack, GitHub, Confluence, and more, Bits analyzes telemetry, reads documentation, and reviews recent deployments to determine the root cause of alerts—often before you’ve even opened your laptop. In fact, if you're using Datadog On-Call, you can view Bits’s findings right from your phone—so you’re always one step ahead, no matter where you are.

View Video

Datadog

Read more about Introducing Bits AI SRE, your AI on-call teammate

Top 7 Observability Platforms That Auto-Discover Services

Nov 21, 2025 By Anjali Udasi In Last9

You can use an observability platform that automatically discovers your services and provides ready-to-use dashboards with minimal setup. If you're running a system where microservices come and go, containers shift around, or serverless functions scale up quickly, this kind of experience saves you a lot of time. You gain visibility as soon as something goes live, without requiring any additional steps on your part. In this blog, we talk about the top seven platforms that offer these capabilities.

Read Post

Last9

Read more about Top 7 Observability Platforms That Auto-Discover Services

How to Reduce Log Data Costs Without Losing Important Signals

Nov 20, 2025 By Anjali Udasi In Last9

You can cut your log costs by removing repetitive, low-value logs early and keeping only the parts that genuinely help you understand issues. Modern systems generate logs far faster than you expect. Even when your workload stays stable, infrastructure components, retries, and background workers continue producing a steady stream of repeated entries.

Read Post

Last9

Read more about How to Reduce Log Data Costs Without Losing Important Signals

It's Never Different This Time: LLM Reliability Without the Hype with Julien Simon

Nov 19, 2025 By Rootly In Rootly

In this episode, Julien Simon, longtime voice in the open-source ML world, reminds us that even in the era of GenAI, reliability fundamentals haven’t changed. Julien breaks down why calling “the same model” from different providers can produce wildly different results, how deployment choices introduce hidden variability, and why reliability teams need to think of LLM systems as distributed systems.

View Video

Rootly

Read more about It's Never Different This Time: LLM Reliability Without the Hype with Julien Simon

OTel Updates: Complex Attributes Now Supported Across All Signals

Nov 19, 2025 By Anjali Udasi In Last9

OpenTelemetry now supports maps, heterogeneous arrays, and byte arrays across all signals. Here’s where these new types shine — and where simple primitives still fit naturally. If you’ve been working with OpenTelemetry for a while, you’re likely familiar with the straightforward key-value approach to attributes. It’s simple, fast, and works well with how most telemetry backends store, index, and query data.

Read Post

Last9

Read more about OTel Updates: Complex Attributes Now Supported Across All Signals

What is AWS Fargate for Amazon ECS?

Nov 19, 2025 By Anjali Udasi In Last9

As cloud applications moved from VMs to containers and then to microservices, the amount of background work needed to keep everything running grew just as quickly. You gain speed and flexibility, but you also end up managing clusters, scaling rules, and capacity choices that don’t really add to the product you’re building. AWS Fargate steps in right there. It lets you run your ECS tasks without looking after any servers at all.

Read Post