Operations | Monitoring | ITSM | DevOps | Cloud

Humans aren't fast enough for 4 9's

When thinking about Service Level Objectives (SLOs) and contractual Service Level Agreements (SLAs) for availability, I always like to put the percentages into concrete numbers. It’s easy to lose track of what’s meant when saying “99.95%” availability, and even more is lost when thinking how much harder it is to achieve 99.99% compared to 99.95%. On a monthly basis, and in concrete terms, 99.95% availability means you get 21 minutes and 55 seconds of downtime.

AURA in Practice: Mezmo's SRE bot, demo walkthrough

A walkthrough of the Slack-based SRE bot Mezmo's engineering team built on AURA, the open-source agent harness, running against Mezmo's own production tooling. Adrian Furlong shows the bot answering questions in a DM with tool calls visible inline, then in a shared channel where it reads the conversation before responding. He opens a fresh PagerDuty incident on camera. The webhook fires AURA, and within seconds, the agent posts a triage note back on the incident and a structured analysis in the dedicated incident channel.

Why Mandating AI Tools Backfires on Engineering Teams

Responsible AI adoption for engineering teams starts with culture, not compliance. In this GitKon talk, Rizel Scarlett (Tech Lead of Open Source DevRel at Block) shares how Block helped thousands of engineers actually want to use AI tools, including Goose, Cursor, Claude Code, and more, without mandates, vibe coding disasters, or security gaps.

"It works on my machine": why environment parity is still a platform problem in 2026

How many hours did your team spend last quarter debugging issues that only appeared in one environment? What would change if every environment were guaranteed to be identical? In 2026, environment inconsistency remains one of the most expensive bottlenecks in software development. Developers frequently spend more time debugging differences between infrastructure setups than they do on their own code.

Best Elixir APM Tools in 2026: A Developer's Guide

Last updated: May 2026 Elixir applications have performance characteristics that are genuinely different from Ruby or Python. The BEAM virtual machine handles concurrency through lightweight processes, supervision trees restart failed processes automatically, and Phoenix channels can hold tens of thousands of persistent connections on a single node. These are strengths, but they also mean that the performance problems you encounter are different from what most APM tools were built to detect.

The Best Kubernetes Monitoring Tools of 2026

Effective Kubernetes monitoring in 2026 is critical due to increased cluster scale and microservices complexity, demanding a shift toward unified observability (logs, metrics, and traces). The core focus is leveraging AI-driven features to automate anomaly detection, correlate diverse data, and significantly reduce Mean Time to Recovery (MTTR).

Why Alert Fatigue Solutions Still Miss the Root Cause

Alert fatigue solutions have never been better, but on-call engineers are still burning out. Threshold tuning, AI triage, and alert correlation reduce the noise, but every alert that clears filtering lands with the same incomplete telemetry and triggers the same manual investigation cycle. This post explains why the evidence gap survives every fix, and how runtime context changes that.

From vibe code to production-ready: observability for Next.js and Supabase apps

The way we build software has drastically changed over the past few years. What hasn’t changed is that this software ends up in front of real people: you, me, my mom. And when those users inevitably run into something broken, you as the application’s developer need to be equipped with the right tools, context and understanding of what broke, where it broke, and how to fix it as quickly as possible. Every day we’re inching closer to self-healing software.

New dcTrack Connector for Dell OpenManage Enterprise

Dell OpenManage Enterprise (OME) is a management and monitoring application that provides information about Dell servers, chassis, storage, and network switches across the enterprise network. Sunbird’s new Dell OME connector programmatically pulls data from Dell OME into dcTrack, driving automation and giving you a single pane of glass with complete information across all your Dell assets.