%term

The latest News and Information on Service Reliability Engineering and related technologies.

How to use an SRE agent to reduce downtime

Apr 30, 2026 By Sam Chun In PagerDuty

An alert in the middle of the night warns of a potential business failure. Manual incident response becomes more complex due to the overwhelming data from distributed and dynamic digital services. With an SRE agent, your engineering team can cut through alert clutter. They can sort through various signals quicker, decreasing burnout and achieving faster, more affordable resolutions. Operational resilience will see its next evolution with Agentic AI.

Read Post

PagerDuty

Read more about How to use an SRE agent to reduce downtime

End-to-End Trace Propagation Across SQS and Lambda with OpenTelemetry

Apr 29, 2026 By Prathamesh Sonpatki In Last9

SQS doesn't propagate trace context automatically. You instrument both sides, deploy, and get two disconnected traces. This post shows how to wire them into one waterfall — and the ESM format gotcha that silently breaks it every time. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about End-to-End Trace Propagation Across SQS and Lambda with OpenTelemetry

last9-genai: Closing the Conversation Gap in LLM Observability

Apr 28, 2026 By Prathamesh Sonpatki In Last9

OpenTelemetry's GenAI instrumentation gives you spans and token counts. It does not give you conversations, workflow cost rollups, or prompts visible in your dashboard. last9-genai is an OTel extension that fills those three gaps — without replacing your existing observability stack. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about last9-genai: Closing the Conversation Gap in LLM Observability

How to Exclude Health Check Endpoints from Python OTel Traces

Apr 28, 2026 By Prathamesh Sonpatki In Last9

Health check endpoints generate thousands of identical, useless spans per day. Here are two production-ready approaches to filter them from your Python OTel traces — and the correctness trap most implementations miss. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about How to Exclude Health Check Endpoints from Python OTel Traces

Argo Rollouts Canary Monitoring: Metrics, Gotchas, and Automated Gates with Last9

Apr 27, 2026 By Prathamesh Sonpatki In Last9

Argo Rollouts exposes Prometheus metrics on port 8090 — but the docs lie about which labels exist. Here's how to scrape them into Last9, build a canary dashboard, and use Last9 as an automated AnalysisTemplate gate, including the auth and base64 gotchas. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about Argo Rollouts Canary Monitoring: Metrics, Gotchas, and Automated Gates with Last9

SRE agent vs. traditional engineer: 7 key differences

Apr 27, 2026 By Sam Chun In PagerDuty

The role of a Site Reliability Engineer (SRE) is evolving. The focus has shifted from simply working harder during an outage; A new kind of teammate is here to help: the SRE Agent. But what are the key differences when you compare an SRE agent versus a traditional site reliability engineer? This isn’t just a superficial change. It signifies a fundamental alteration in how teams construct and sustain dependable services.

Read Post

PagerDuty

Read more about SRE agent vs. traditional engineer: 7 key differences

What is AI SRE? The Complete Guide to AI-Assisted Site Reliability Engineering

Apr 26, 2026 By Prathamesh Sonpatki In Last9

It's 2:47 AM. PagerDuty fires. You open a Slack alert and see: p99 latency spike on checkout-service. You SSH into the host, check dashboards in four tabs, grep logs for the last 20 minutes, and eventually find a slow query introduced in a deploy six hours ago. It took 34 minutes. You resolved it, w Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about What is AI SRE? The Complete Guide to AI-Assisted Site Reliability Engineering

Capturing HTTP Request and Response Bodies in .NET Traces with PHI Redaction

Apr 25, 2026 By Prathamesh Sonpatki In Last9

> Standard OTel.NET instrumentation captures headers, status codes, and timing — not request or response bodies. Here's how to add body capture to your traces while keeping PHI out of your observability backend. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about Capturing HTTP Request and Response Bodies in .NET Traces with PHI Redaction

Fixing Broken Traces in GCP Cloud Run: A Custom OpenTelemetry Propagator

Apr 24, 2026 By Prathamesh Sonpatki In Last9

GCP's load balancer silently rewrites your traceparent header, orphaning spans in any OTLP backend. Here's the custom propagator that fixes it. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about Fixing Broken Traces in GCP Cloud Run: A Custom OpenTelemetry Propagator

How it feels to run an incident with Investigations

Apr 23, 2026 By Article In Incident.io

We've been building the broader incident.io platform for several years now, and one thing we've learned is that UX matters more here than almost anywhere else. When an incident fires, there's no room for poorly designed interfaces or fumbling through features you haven't touched in a while — every second of the incident response lifecycle counts. The product has to be ergonomic: easy to pick up, easy to navigate, with the right things at your fingertips at exactly the right moment.

Read Post