Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

Building Trust in the Machine: A Guide to Architecting Agentic AI for SRE

The promise of Artificial Intelligence in Site Reliability Engineering (SRE) is seductive: an autonomous system that never sleeps, instantly detects anomalies, and fixes broken infrastructure while humans focus on high-value work. However, the gap between a demo-ready chatbot and a production-grade Autonomous AI SRE is vast. In complex, noisy environments like Kubernetes, a “naive” implementation of Large Language Models (LLMs) is not just ineffective, it can be dangerous.

Komodor AI SRE vs. OSS AI Agent: A Technical Comparison of Agentic AI for Kubernetes Troubleshooting

Gartner predicts that AI agents will be implemented in 60% of all IT operations tools by 2028, up from fewer than 5% at the end of 2024. This acceleration has sparked an explosion of AI SRE solutions, from enterprise platforms to open-source alternatives, all promising faster root cause analysis and reduced MTTR.

Top 15 Application Performance Metrics for Developers and SREs in 2026

Every application tells a story of user intent, system behavior, and business impact. To truly understand how your application performs, you need to go beyond logs and errors. You need metrics that provide actionable visibility across your stack. Application performance metrics are the foundation for delivering high-quality digital experiences, and they empower DevOps teams, developers, engineers, and site reliability engineers (SREs) to respond faster, scale smarter, and continuously improve.

AI SRE in Practice: Resolving Node Termination Events at Scale

When a node terminates unexpectedly in a Kubernetes cluster, the immediate symptoms are obvious. Workloads restart elsewhere, services experience partial outages, and alerts fire across multiple systems. The harder question is why it happened and how to prevent it from recurring. This scenario walks through a node termination event where the entire node pool was affected, requiring investigation across infrastructure layers to identify root cause and implement lasting remediation.

Stop Flying Blind: Synthetic Monitoring, Host heat-maps, and Process-Level Visibility

January 2026 Release Here's a dirty secret about observability: most teams find out about outages from their customers. Not from their dashboards. Not from their alerts. From angry tweets and support tickets. The excuse is always the same: "We have metrics! We have dashboards! We even have that AI thing now!" And yet, somehow, your checkout endpoint has been returning 502s for forty-five minutes and you're learning about it from the VP of Sales who just got off a call with your biggest customer.

The SRE Report 2026: Defensible Ns

You shouldn’t have to understand the care behind this report, unless it’s missing. For the past eight years, this research has focused on all things related to reliability and resilience. How systems behave under stress. How teams respond when things break. And how the practices continue to evolve. Reaching the eighth edition of The SRE Report attests to that and gives me pause. You can read the full report here and you can find a summary of the key findings here.

SRE Report 2026: What surprised us, what didn't, and why the gaps matter most

This is the eighth edition of the SRE Report. Eight years of tracing reliability's arc, from uptime obsession to experience, from toil to intelligence, from systems to people. This year's report is also the first since Catchpoint joined LogicMonitor. We want to acknowledge their support in keeping this work going. They get what this report means to the reliability community, and that matters. We made a deliberate choice this year to say less.

AI SRE Update: Your Feedback Shaped Our Latest Release

A note from Lauren Nagel, Mezmo's VP of Product: At Mezmo, we believe the best observability tools aren't just built for users, they're built with them. Since the launch of Mezmo's AI SRE agent, we've listened and learned from our customers. The feedback and insights have been invaluable in helping our teams refine and enhance the experience. Today, we're excited to share our latest release, packed with improvements and powerful new capabilities that make our AI SRE even faster and more intuitive.

High Cardinality Metrics: How Prometheus and ClickHouse Handle Scale

TL;DR: Prometheus pays cardinality costs at write time (memory, index). ClickHouse pays at query time (aggregation memory). Neither is "better":they fail differently. Design your pipeline knowing which failure mode you're accepting. -- Every month, someone posts "just use ClickHouse for metrics" or "Prometheus can't handle scale." Both statements contain a kernel of truth wrapped in dangerous oversimplification.