Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

The Journey to Production AI: Five Steps for SRE and Platform Teams

In a recent webinar, The Journey to Production AI, Andre Elizondo walked through what separates a working agent demo from an agent worth trusting on a 2 a.m. page. Live polls during the session put numbers behind a pattern most platform teams already feel. ‍ ‍ Most teams are early. The ones who are further along did not get there by shipping a flashier demo. They got there by treating production AI as a platform problem.

New enhancements to PagerDuty's SRE Agent: triage faster without waking a human

AI promise and AI capabilities often diverge, with developers often reporting much faster code production, but not enough change in how incidents are handled. When the rate of change is faster than ever, but the rate of recovery from incidents isn’t moving, developers wind up stuck in firefighting mode. And, when these systems fail, it’s costly. According to PagerDuty’s State of AI-First Operations, over a third of surveyed companies report losing $500K per hour of downtime.

Stop ECS Containers From Collapsing Into One Service in OpenTelemetry

Why ECS containers collapse under service.name = aws_ecs and how to fix it for both EC2 launch type and Fargate, including the resource-vs-log-record pitfall that quietly breaks log filtering. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

How to Test SQS Workflows Locally with LocalStack and OpenTelemetry

LocalStack lets you run SQS, Lambda, and S3 locally in Docker — but there's a hidden trap: OpenTelemetry's default AWS propagator doesn't work with free LocalStack. Here's how to set up end-to-end local testing with working trace propagation. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

How to use an SRE agent to reduce downtime

An alert in the middle of the night warns of a potential business failure. Manual incident response becomes more complex due to the overwhelming data from distributed and dynamic digital services. With an SRE agent, your engineering team can cut through alert clutter. They can sort through various signals quicker, decreasing burnout and achieving faster, more affordable resolutions. Operational resilience will see its next evolution with Agentic AI.

End-to-End Trace Propagation Across SQS and Lambda with OpenTelemetry

SQS doesn't propagate trace context automatically. You instrument both sides, deploy, and get two disconnected traces. This post shows how to wire them into one waterfall — and the ESM format gotcha that silently breaks it every time. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

last9-genai: Closing the Conversation Gap in LLM Observability

OpenTelemetry's GenAI instrumentation gives you spans and token counts. It does not give you conversations, workflow cost rollups, or prompts visible in your dashboard. last9-genai is an OTel extension that fills those three gaps — without replacing your existing observability stack. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

How to Exclude Health Check Endpoints from Python OTel Traces

Health check endpoints generate thousands of identical, useless spans per day. Here are two production-ready approaches to filter them from your Python OTel traces — and the correctness trap most implementations miss. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.