Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Log Management, Log Analytics and related technologies.

When your agents hallucinate at 2 am, it is not a model problem

The first time an AI assistant suggests "restart the service" during a live incident and nobody on the bridge can tell whether that suggestion came from a current runbook, a stale wiki page, or thin air, you stop caring about model benchmarks. You start caring about what the agent actually knew, where that knowledge came from, and whether you can trust the chain of reasoning behind it.

3 things you need to know about headless observability

If you're building agents trying to figure out the best way to actually make them successful in production, you're going to want to know about headless observability. Headless observability means an agent can access information about the health of your system through a CLI instead of clicking around dashboards. It's the data layer that going to unlock serious autonomy and allow you to scale with agentic workloads.

Builder in the loop: Henry Andrews on building AURA like production software

An interview series with the people building Mezmo’s open-source agentic harness for production operations. Builder in the loop is a Mezmo interview series focused on the engineers, product leaders, and operators shaping AURA, our open-source, MCP-native agentic harness for production operations. The goal is to get past the polished product layer and talk through the decisions that matter when AI starts interacting with real systems. What should agents be allowed to do?

What Is APM? A Guide to Application Performance Monitoring

A well-instrumented service tells your on-call engineer which deploy broke checkout, which span ate the latency budget, and which line to revert before the support queue fills up. Getting there depends on how cleanly your application performance monitoring layer turns telemetry into answers. The sections ahead walk through how APM works, the metrics and components worth tracking, the cloud-native challenges at scale, and how to evaluate APM tooling against your real workload.

What Is an Incident Commander? Role, Skills, and Best Practices

The fastest incident response teams treat coordination as a craft. Someone owns the call, drives the decisions, and keeps everyone moving in the same direction while the team puts the system back together. That person is the incident commander (IC), and getting the role right is what separates your 15-minute fix from a four-hour war room where nobody’s sure who’s making the call.

What Is Log Monitoring? Pipeline, Pitfalls, and Practices for 2026

Catching a cascading failure in the first 90 seconds is one of the better feelings in production engineering, and it almost always comes back to your log monitoring pipeline doing its job upstream of the alert. The teams that land there consistently treat log monitoring as a real-time detection layer in its own right, and the choices you make in that pipeline shape how every incident plays out for years.

Contributing Distributed Partition Ownership to the Azure Event Hub Receiver

If you're running OpenTelemetry collectors against Azure Event Hubs, distributed partition ownership and checkpointing just got significantly better. Your fleet now self-organizes. Failover is automatic. Restarts don't lose data. Here's how we got here.