%term

The latest News and Information on Service Reliability Engineering and related technologies.

Best Log Management Software for DevOps and SRE Teams in 2026: Feature and Cost Breakdown

Jun 3, 2026 By Libi Michelson In logz.io

TL;DR Picking the right log management platform in 2026 comes down to three things: how much operational overhead you can absorb, how much AI automation you need, and what you’re willing to spend.

Read Post

logz.io

Read more about Best Log Management Software for DevOps and SRE Teams in 2026: Feature and Cost Breakdown

Running AI at Enterprise Scale w/ Anthropic, Descope, Port, Rootly and Twingate

Jun 3, 2026 By Rootly In Rootly

The debate about whether AI can write production code is over. Companies are handing work to fleets of agents, and for many, they write most of the code that ships to production. The next challenge is everything that happens once an entire engineering organization runs this way, at full speed. Teams that generate code 10x faster still review it at human speed, and that mismatch is now the constraint. Code ownership is also becoming an issue, as developers learn to trust agentic processes a little too much. When an agent breaks production, who is responsible?

View Video

Rootly

Read more about Running AI at Enterprise Scale w/ Anthropic, Descope, Port, Rootly and Twingate

AI in SRE: Where and how Google is deploying agentic AI to improve operations

May 29, 2026 By Stevan Malesevic In Google Operations

With SRE AI, Google plans to fully adopt AI and agentic technologies, leveraging AI as a force multiplier while also maintaining control.

Read Post

Google Operations

Read more about AI in SRE: Where and how Google is deploying agentic AI to improve operations

Every pilot is ready for engine failure: are your engineers? w/ Hamed Silatani (Uptime Labs)

May 28, 2026 By Rootly In Rootly

Every pilot who's never had an engine failure is still ready for one. The same can't be said for most software engineers facing their first major incident. Hamed Silatani, co-founder and CEO of Uptime Labs, and former Head of Reliability Engineering at IG Group, has spent two decades watching engineers learn incident response the hard way: alone, under pressure, with no training.

View Video

Rootly

Read more about Every pilot is ready for engine failure: are your engineers? w/ Hamed Silatani (Uptime Labs)

AI SRE Agent: How Autonomous Incident Investigation Is Eliminating Manual Root Cause Analysis

May 27, 2026 By Mohana Ayeswariya J In Atatus

A critical production alert wakes you up: p99 latency just hit 4 seconds. You drag yourself to a terminal, open five dashboards, start correlating log timestamps with trace IDs, dig through 47,000 log lines across eight services, and 90 minutes later, you finally find the culprit: an N+1 database query introduced in a deployment that shipped four minutes before the spike started. An Atatus AI SRE Agent would have identified that root cause and drafted a remediation plan in 28 seconds. Not approximation.

Read Post

Atatus

Read more about AI SRE Agent: How Autonomous Incident Investigation Is Eliminating Manual Root Cause Analysis

Error Budget in SRE: The Complete Guide (2026)

May 20, 2026 By Nuno Tomas In isDown

An error budget is the acceptable amount of unreliability permitted by your SLO over a defined time window. It is not a target. It is not a stretch goal. It is a hard ceiling that, when breached, should trigger a pre-agreed organizational response — feature freezes, postmortems, or infrastructure investment. The formula is blunt: Error Budget = 1 - SLO Target Error Budget (time) = (1 - SLO Target) × Window Duration For a 30-day window: That last number should make you uncomfortable.

Read Post

isDown

Read more about Error Budget in SRE: The Complete Guide (2026)

Why SRE agents need orchestration, not just more tools

May 19, 2026 By Mezmo In Mezmo

Single agents are a useful starting point for SRE workflows. They are not where the architecture should end. The first version is simple enough: connect an LLM to a few tools, give it a system prompt, and point it at your infrastructure. It can summarize an alert, pull logs, answer questions, and draft a useful next step. Then the workflow gets real. You add GitHub for runbooks, Kubernetes for cluster state, PagerDuty for incident context, Prometheus for metrics, and Mezmo for telemetry.

Read Post

Mezmo

Read more about Why SRE agents need orchestration, not just more tools

What broke when engineering went fully agent-based

May 15, 2026 By Rootly In Rootly

Last year, we went fully agent-based at Rootly. Cursor, Claude Code, Codex, all of it. The productivity gains were real. However, Rigel, senior engineering manager at Rootly, started noticing a pattern emerging in his team.

View Video

Rootly

Read more about What broke when engineering went fully agent-based

LLM Observability: Lessons From MLOps w/ Maria Vechtomova (Cauchy)

May 14, 2026 By Rootly In Rootly

For nine years, Maria Vechtomova was shouting about monitoring. Nobody cared, until LLMs arrived. As co-founder of Cauchy, Databricks MVP, and one of the most followed voices in MLOps, Maria has watched the field evolve from hand-built experiment trackers to today's flood of observability tools, and her central claim might surprise you: globally, nothing has changed. The fundamentals are the same: track your code, data, and models so you can roll back when something breaks.

View Video

Rootly

Read more about LLM Observability: Lessons From MLOps w/ Maria Vechtomova (Cauchy)

Zero-Code OpenTelemetry for Vert.x

May 8, 2026 By Prathamesh Sonpatki In Last9

Drop a JAR on the JVM. Get distributed tracing, RxJava context propagation, log-trace correlation, and Vert.x internal metrics. No code changes. No Maven dependency. Java 8–21. Inside the design of last9/vertx-opentelemetry v2.3.4. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post