%term

The latest News and Information on Service Reliability Engineering and related technologies.

The Journey to Production AI: Five Steps for SRE and Platform Teams

May 8, 2026 By Mezmo In Mezmo

In a recent webinar, The Journey to Production AI, Andre Elizondo walked through what separates a working agent demo from an agent worth trusting on a 2 a.m. page. Live polls during the session put numbers behind a pattern most platform teams already feel. ‍ ‍ Most teams are early. The ones who are further along did not get there by shipping a flashier demo. They got there by treating production AI as a platform problem.

Read Post

Mezmo

Read more about The Journey to Production AI: Five Steps for SRE and Platform Teams

New enhancements to PagerDuty's SRE Agent: triage faster without waking a human

May 6, 2026 By Ariel Russo In PagerDuty

AI promise and AI capabilities often diverge, with developers often reporting much faster code production, but not enough change in how incidents are handled. When the rate of change is faster than ever, but the rate of recovery from incidents isn’t moving, developers wind up stuck in firefighting mode. And, when these systems fail, it’s costly. According to PagerDuty’s State of AI-First Operations, over a third of surveyed companies report losing $500K per hour of downtime.

Read Post

PagerDuty

Read more about New enhancements to PagerDuty's SRE Agent: triage faster without waking a human

SRE Agent Enhancements for Autonomous Triage

May 5, 2026 By PagerDuty Inc. In PagerDuty

Triage just got turbocharged with our latest PagerDuty SRE Agent enhancements!

View Video

PagerDuty

Read more about SRE Agent Enhancements for Autonomous Triage

Stop ECS Containers From Collapsing Into One Service in OpenTelemetry

May 2, 2026 By Prathamesh Sonpatki In Last9

Why ECS containers collapse under service.name = aws_ecs and how to fix it for both EC2 launch type and Fargate, including the resource-vs-log-record pitfall that quietly breaks log filtering. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about Stop ECS Containers From Collapsing Into One Service in OpenTelemetry

Kubernetes Monitoring Tools: What Actually Works at Scale

May 2, 2026 By Faiz Shaikh In Last9

What actually works for Kubernetes monitoring at scale — not what looks good in a vendor demo with a five-pod cluster.

Read Post

Last9

Read more about Kubernetes Monitoring Tools: What Actually Works at Scale

How to Test SQS Workflows Locally with LocalStack and OpenTelemetry

Apr 30, 2026 By Prathamesh Sonpatki In Last9

LocalStack lets you run SQS, Lambda, and S3 locally in Docker — but there's a hidden trap: OpenTelemetry's default AWS propagator doesn't work with free LocalStack. Here's how to set up end-to-end local testing with working trace propagation. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about How to Test SQS Workflows Locally with LocalStack and OpenTelemetry

How to use an SRE agent to reduce downtime

Apr 30, 2026 By Sam Chun In PagerDuty

An alert in the middle of the night warns of a potential business failure. Manual incident response becomes more complex due to the overwhelming data from distributed and dynamic digital services. With an SRE agent, your engineering team can cut through alert clutter. They can sort through various signals quicker, decreasing burnout and achieving faster, more affordable resolutions. Operational resilience will see its next evolution with Agentic AI.

Read Post

PagerDuty

Read more about How to use an SRE agent to reduce downtime

End-to-End Trace Propagation Across SQS and Lambda with OpenTelemetry

Apr 29, 2026 By Prathamesh Sonpatki In Last9

SQS doesn't propagate trace context automatically. You instrument both sides, deploy, and get two disconnected traces. This post shows how to wire them into one waterfall — and the ESM format gotcha that silently breaks it every time. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about End-to-End Trace Propagation Across SQS and Lambda with OpenTelemetry

last9-genai: Closing the Conversation Gap in LLM Observability

Apr 28, 2026 By Prathamesh Sonpatki In Last9

OpenTelemetry's GenAI instrumentation gives you spans and token counts. It does not give you conversations, workflow cost rollups, or prompts visible in your dashboard. last9-genai is an OTel extension that fills those three gaps — without replacing your existing observability stack. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about last9-genai: Closing the Conversation Gap in LLM Observability

How to Exclude Health Check Endpoints from Python OTel Traces

Apr 28, 2026 By Prathamesh Sonpatki In Last9

Health check endpoints generate thousands of identical, useless spans per day. Here are two production-ready approaches to filter them from your Python OTel traces — and the correctness trap most implementations miss. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post