%term

The latest News and Information on Service Reliability Engineering and related technologies.

99%+ Accuracy on a Moving Target: Model Deprecation and Reliability with Not Diamond

Dec 22, 2025 By Rootly In Rootly

Shipping systems powered by LLMs would be hard enough if the models stayed the same. But in reality, they don’t. Models get updated and deprecated at a pace traditional software wouldn’t. All while teams are still expected to hit reliability targets that look a lot like traditional SLAs.

View Video

Rootly

Read more about 99%+ Accuracy on a Moving Target: Model Deprecation and Reliability with Not Diamond

Last9 integration with TrueFoundry AI Gateway

Dec 18, 2025 By Sahil Khan In Last9

If you're using TrueFoundry to manage your LLM traffic, you can now send those traces directly to Last9 and view them alongside your existing infrastructure telemetry.

Read Post

Last9

Read more about Last9 integration with TrueFoundry AI Gateway

How agentic IT operations lay the foundations for SRE success at scale

Dec 15, 2025 By Manish Agarwal In BigPanda

When something breaks in a modern digital service, customers feel it instantly. Pages stall, requests time out, and carts are abandoned, while frustration grows long before a root cause is identified. What the world never sees is the engineering effort required to keep these systems healthy in the first place. Site Reliability Engineers (SREs) carry that responsibility every day.

Read Post

BigPanda

Read more about How agentic IT operations lay the foundations for SRE success at scale

How to Handle Cloud Monitoring Overload?

Dec 12, 2025 By Anjali Udasi In Last9

Reduce alert noise by 70% through intelligent aggregation, clear ownership boundaries, and filtering metrics that don't map to user-facing issues. Monitoring starts with a straightforward goal: understand your system's health and identify issues before users notice them. You set up metrics, create dashboards, and configure some alerts. At first, it works well. Over time, your stack gets bigger and more complicated. New services get added.

Read Post

Last9

Read more about How to Handle Cloud Monitoring Overload?

The Reality of GenAI in Production with Eduardo Ordax (AWS)

Dec 12, 2025 By Rootly In Rootly

GenAI demos are easy. Production is where everything breaks. In this episode, Eduardo Ordax, Principal GTM GenAI at AWS, breaks down what actually stops companies from shipping reliable AI systems, and why the real blockers have little to do with technology.

View Video

Rootly

Read more about The Reality of GenAI in Production with Eduardo Ordax (AWS)

OTel Updates: OpenTelemetry Proposes Changes to Stability, Releases, and Semantic Conventions

Dec 12, 2025 By Anjali Udasi In Last9

Over the past year, the Governance Committee ran user interviews and surveys with organizations deploying OpenTelemetry at scale. A few patterns came up consistently: Stability levels aren't always obvious. When you install an OTel distribution, some components might be experimental or alpha without clear markers. This makes it harder to evaluate what's production-ready. Instrumentation libraries sometimes wait on semantic conventions.

Read Post

Last9

Read more about OTel Updates: OpenTelemetry Proposes Changes to Stability, Releases, and Semantic Conventions

The War Room of AI Agents: Why the Future of AI SRE is Multi-Agent Orchestration

Dec 11, 2025 By Itiel Shwartz In Komodor

We’ve all been there. It’s 2 AM, your phone is buzzing with alerts, and you’re suddenly thrust into an incident war room with a dozen other bleary-eyed engineers. The production environment is on fire, customers are affected, and everyone’s trying to piece together what went wrong. But here’s what makes these moments fascinating from a systems perspective – it’s rarely just one person silently fixing the issue in isolation.

Read Post

Komodor

Read more about The War Room of AI Agents: Why the Future of AI SRE is Multi-Agent Orchestration

How to Track Down the Real Cause of Sudden Latency Spikes

Dec 9, 2025 By Anjali Udasi In Last9

Start with distributed tracing to find which service is slow, then use continuous profiling to see why the code is slow, and finally apply high-cardinality analysis to identify which users or conditions trigger the problem. It's 2 AM. Your phone buzzes. Users are reporting timeouts. The metrics dashboard shows p99 latency spiking from 200ms to 4 seconds, but everything looks normal—CPU at 60%, memory stable, no error spikes. A quick pod restart helps briefly, then latency climbs right back up.

Read Post

Last9

Read more about How to Track Down the Real Cause of Sudden Latency Spikes

New features: AI SRE, Merge alerts, and Status pages for thousands of services

Dec 8, 2025 By Daria Yankevich In iLert

As we head into the holiday season, the ilert team is doing the opposite of slowing down; we’re ramping up. Over the past weeks, we’ve shipped a wave of impactful improvements across alerting, AI-powered automation, mobile app, and status pages. From major upgrades that reshape how teams triage incidents to smaller refinements that remove daily friction, this release is packed with updates designed to make on-call and operations smoother, smarter, and faster. Let’s dive in.

Read Post

iLert

Read more about New features: AI SRE, Merge alerts, and Status pages for thousands of services

Komodor - The Autonomous AI SRE Platform

Dec 8, 2025 By Komodor In Komodor

Komodor is the leading Autonomous AI SRE Platform for cloud native infrastructure and operations. Powered by Klaudia Agentic AI, Komodor automatically visualizes, troubleshoots, and optimizes Kubernetes-based platforms at scale.

View Video