Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Observabilty for complex systems and related technologies.

Building reliable dashboard agents with Datadog LLM Observability

This article is part of our series on how Datadog’s engineering teams use LLM Observability to iterate, evaluate, and ship AI-powered agents. In this first story, the Graphing AI team shares how they instrumented their widget- and dashboard-generation agents with LLM Observability to detect regressions and debug failures faster. Visibility into how large language model (LLM) applications behave in real time is essential for building reliable AI-driven systems at Datadog.

What is Runtime Context? A Practical Definition for the AI Era

TLDR: Runtime Context is live, execution-level access to a running production system. It lets engineers and AI agents ask precise questions of running code and get answers immediately, without redeploying or interrupting users. This is the new baseline for reliability.

"You Had One Job": Why Twenty Years of DevOps Has Failed to Do it

Let’s start with a question. What is DevOps all about? I’ll tell you my answer. In retrospect, I think the entire DevOps movement was a mighty, twenty year battle to achieve one thing: a single feedback loop connecting devs with prod. On those grounds, it failed. Not because software engineers weren’t good at their jobs, or didn’t care enough. It failed because the technology wasn’t good enough.

Cribl Search Pack for Outlook Email Activity

Email is still mission-critical, but most teams have very little visibility into what’s actually happening behind the scenes. In this video, I give a quick walkthrough of an inbox intelligence dashboard built on Cribl Search. It shows email volume, delivery health, and unusual activity at a glance, without digging through raw logs unless of course you like doing that.

OpAMP Explained: Why OpenTelemetry Needed an Agent Management Protocol (and How We Use It)

OpenTelemetry makes it easy to produce and transmit any type of telemetry. In production environments, this often means deploying the OpenTelemetry Collector as an intermediary to process, enrich, and route telemetry data. As systems scale, so does this infrastructure—sometimes to hundreds or thousands of Collectors spread across environments.

Why Observability Budgets Keep Growing Even When IT Is Asked to Cut Costs

Observability is the surprising budget line that isn’t shrinking. 96% of IT leaders expect observability budgets to hold steady or grow over the next 12 months. And 62% expect those budgets to increase regardless of broader IT budget cuts. Why? Because as infrastructure becomes more distributed and harder to manage, observability has shifted from a “nice to have” to a control point for cost, performance, and risk.