%term

The latest News and Information on Service Reliability Engineering and related technologies.

We built an SRE bot on AURA. Here's what we learned.

Jul 14, 2026 By Mezmo In Mezmo

PagerDuty fires. You open the incident. Title, timestamp, nothing else. Whatever context exists is in someone's head, in a Slack thread from two weeks ago, or in a runbook nobody has touched since the last reorg. We got tired of that. So we put an AURA agent behind a Slack bot and pointed it at our own production environment.

Read Post

Mezmo

Read more about We built an SRE bot on AURA. Here's what we learned.

Building AI SRE Agents, Part 1: Start Local, Break Things, Learn Fast

Jul 9, 2026 By Nir Adler In Komodor

The first stage of AI SRE maturity is a laptop, a throwaway cluster, and zero production access. Here’s how to set it up, and what to watch for. AI SRE (Site Reliability Engineering) agents are AI-powered systems that automate the most time-consuming parts of incident response: triaging alerts, correlating logs and metrics, generating root-cause hypotheses, and proposing remediation steps.

Read Post

Komodor

Read more about Building AI SRE Agents, Part 1: Start Local, Break Things, Learn Fast

Build an SRE Agent Harness for AIOps Without Context Blowout

Jul 9, 2026 By Mezmo In Mezmo

An agent harness for AIOps is the runtime layer that coding agents like Claude Code were never built to provide: context isolation, decision traceability, and gated execution for tools that touch production. Aura is Mezmo's open-source (Apache 2.0) agent harness, purpose-built for operations work rather than software development.

View Video

Mezmo

Read more about Build an SRE Agent Harness for AIOps Without Context Blowout

They stopped shipping features for half a year, now they're thriving

Jul 6, 2026 By Rootly In Rootly

When incidents pile up fast enough, every part of the company bleeds: support is fielding angry customers, AEs are on apology calls, and engineering is burning cycles on retrospectives instead of shipping. For Eran Kampf (VP of Engineering at Twingate, Co-founder Monday.com) where the product is the network, that was the moment he made a call most engineering leaders won't: stop all feature work for a quarter and fix reliability.

View Video

Rootly

Read more about They stopped shipping features for half a year, now they're thriving

How SRE Practices Improve Trust in Digital Finance and Healthcare Platforms

Jul 3, 2026 By OpsMatters In OpsMatters

Trust used to be a brand problem. Now it's an uptime problem, a latency problem, a data integrity problem, and sometimes a "why is the payment button spinning again?" problem. For digital finance and healthcare platforms, users don't separate the service from the system behind it. If the app fails, the business feels careless. If records lag, confidence drops. If a transaction disappears for even a few seconds, panic arrives fast.

Read Post

OpsMatters

Read more about How SRE Practices Improve Trust in Digital Finance and Healthcare Platforms

Could vs. Should: The First Year Managing an SRE Team

Jul 2, 2026 By Reid Savage In Honeycomb

As of today, I’ve drafted this post upwards of 10 times – it’s old enough that the version I first started working on was called “Reflections on 1 Year of SRE Management” (I’m currently at 2.5 years). But everything I learned during that first year became critical for the next.

Read Post

Honeycomb

Read more about Could vs. Should: The First Year Managing an SRE Team

How QA engineers use AI to keep up with agentic development

Jun 26, 2026 By Rootly In Rootly

QA Lead at Rootly explains how she's embraced AI to keep up with the pace of AI-driven feature development.

View Video

Rootly

Read more about How QA engineers use AI to keep up with agentic development

It's always DNS, even at Cisco: behind a weeks-long incident

Jun 26, 2026 By Rootly In Rootly

SRE Lead Ricard Bejarano (Cisco) and Jorge Lainfiesta (Rootly) sit down to talk about a recent intermittent incident that had the team scratching their heads.

View Video

Rootly

Read more about It's always DNS, even at Cisco: behind a weeks-long incident

High Cardinality in ClickHouse at Scale: What Actually Breaks

Jun 25, 2026 By Prathamesh Sonpatki In Last9

ClickHouse swallows high-cardinality telemetry at ingest, then breaks at query time weeks later. Here is what fails, and how we keep it fast in production. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about High Cardinality in ClickHouse at Scale: What Actually Breaks

Klaudia Under the Hood: How We Built an AI SRE That Actually Earns Trust

Jun 18, 2026 By Asaf Savich In Komodor

In reliability engineering, being ‘mostly right’ is a liability. An AI SRE that sometimes misses the root cause or gives a confident, wrong answer at 2:17 AM has no place in an enterprise cloud environment. In this context, silence is better than noise. That’s the bar Klaudia is built to clear: genuine reliability that you can trust in production. The kind of reliability that earns a place alongside your best engineers. Getting there requires more than just a capable model.

Read Post