Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

Why High-Cardinality Metrics Break Everything

High-cardinality metrics are one of those ideas that sound obviously right - until you try to use them in production. In theory, they promise precision. Instead of averages and rollups, you get specificity: per-request, per-userid, per-container, per-feature insights. The kind of detail we all immediately want when something is on fire. And then things start breaking. Not immediately. Not loudly.But quietly.

7 Kubernetes Predictions for 2026 - AI Will Push SRE to its Limit

As AI workloads shift from training to massive-scale inference, SRE teams are about to feel even more pressure. GPU-heavy computing is breaking the assumptions today’s clusters were built on, while enterprises are beginning to trust autonomous operations and cost pressure is pushing consolidation across the cloud-infrastructure stack.

Blameless Postmortem: Foundation of Site Reliability

When systems fail, the instinct to find someone to blame runs deep. But what if assigning fault actually makes your systems less reliable? A blameless postmortem culture transforms how teams learn from incidents, creating stronger systems and more effective incident response processes.

99%+ Accuracy on a Moving Target: Model Deprecation and Reliability with Not Diamond

Shipping systems powered by LLMs would be hard enough if the models stayed the same. But in reality, they don’t. Models get updated and deprecated at a pace traditional software wouldn’t. All while teams are still expected to hit reliability targets that look a lot like traditional SLAs.