%term

The latest News and Information on Service Reliability Engineering and related technologies.

Blameless Postmortem: Foundation of Site Reliability

Dec 23, 2025 By Nuno Tomas In isDown

When systems fail, the instinct to find someone to blame runs deep. But what if assigning fault actually makes your systems less reliable? A blameless postmortem culture transforms how teams learn from incidents, creating stronger systems and more effective incident response processes.

Read Post

isDown

Read more about Blameless Postmortem: Foundation of Site Reliability

Platform Engineering: Error Budgets Explained Simply #shorts

Dec 23, 2025 By Last9 - Monitoring for AI Native SDLC In Last9

Platform engineering provides powerful tools that handle a lot under the hood. Learn how to calculate your remaining error budget with a simple formula using real numbers and objective statements.

View Video

Last9

Read more about Platform Engineering: Error Budgets Explained Simply #shorts

Implementing SLOs: Our Scale Mistakes and Successes #shorts

Dec 23, 2025 By Last9 - Monitoring for AI Native SDLC In Last9

30 minutes of eating crow! Learn from our SLO mistakes at Weave. Discover pitfalls and shortcuts to doing it right the first time. Avoid our wrong, wrong, wrong, wrongs!

View Video

Last9

Read more about Implementing SLOs: Our Scale Mistakes and Successes #shorts

OpenTelemetry Metrics: Traces, Logs & Prometheus Integration #shorts

Dec 23, 2025 By Last9 - Monitoring for AI Native SDLC In Last9

OpenTelemetry aims to link metrics to traces and logs, offering OpenCensus users a seamless migration path. Work with existing protocols like Prometheus. Leverage existing tooling without learning something completely new.

View Video

Last9

Read more about OpenTelemetry Metrics: Traces, Logs & Prometheus Integration #shorts

OpenTelemetry: Components, SDKs, and Middleware Explained #shorts

Dec 23, 2025 By Last9 - Monitoring for AI Native SDLC In Last9

OpenTelemetry explained: standards, SDKs for various languages (Ruby, Python, Go), and middleware tools. Deploy these to pre-process data and send it to your destination.

View Video

Last9

Read more about OpenTelemetry: Components, SDKs, and Middleware Explained #shorts

OTel Updates: OpenTelemetry Deprecates Zipkin Exporters

Dec 22, 2025 By Anjali Udasi In Last9

OpenTelemetry is deprecating the Zipkin exporter specification. Zipkin now supports OTLP ingestion natively, so the custom exporter logic in OTel SDKs is no longer necessary.

Read Post

Last9

Read more about OTel Updates: OpenTelemetry Deprecates Zipkin Exporters

99%+ Accuracy on a Moving Target: Model Deprecation and Reliability with Not Diamond

Dec 22, 2025 By Rootly In Rootly

Shipping systems powered by LLMs would be hard enough if the models stayed the same. But in reality, they don’t. Models get updated and deprecated at a pace traditional software wouldn’t. All while teams are still expected to hit reliability targets that look a lot like traditional SLAs.

View Video

Rootly

Read more about 99%+ Accuracy on a Moving Target: Model Deprecation and Reliability with Not Diamond

Last9 integration with TrueFoundry AI Gateway

Dec 18, 2025 By Sahil Khan In Last9

If you're using TrueFoundry to manage your LLM traffic, you can now send those traces directly to Last9 and view them alongside your existing infrastructure telemetry.

Read Post

Last9

Read more about Last9 integration with TrueFoundry AI Gateway

How agentic IT operations lay the foundations for SRE success at scale

Dec 15, 2025 By Manish Agarwal In BigPanda

When something breaks in a modern digital service, customers feel it instantly. Pages stall, requests time out, and carts are abandoned, while frustration grows long before a root cause is identified. What the world never sees is the engineering effort required to keep these systems healthy in the first place. Site Reliability Engineers (SREs) carry that responsibility every day.

Read Post

BigPanda

Read more about How agentic IT operations lay the foundations for SRE success at scale

How to Handle Cloud Monitoring Overload?

Dec 12, 2025 By Anjali Udasi In Last9

Reduce alert noise by 70% through intelligent aggregation, clear ownership boundaries, and filtering metrics that don't map to user-facing issues. Monitoring starts with a straightforward goal: understand your system's health and identify issues before users notice them. You set up metrics, create dashboards, and configure some alerts. At first, it works well. Over time, your stack gets bigger and more complicated. New services get added.

Read Post