Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Practical AI-Enabled Observability for Agents and LLMs

You’re told to “go build agents” without clear guidance on what that actually means, how to do it well, or how to know if it is working. You are not a data scientist. You are a software engineer. In this talk, a Datadog AI product leader Shri Subramanian breaks down what changes when you move from building applications to building AI agents, and why familiar approaches like traditional testing and linear delivery fall short. We will explore how agent development shifts the focus from code alone to data, prompts, and evaluation, and why functional reliability matters just as much as operational reliability.

The Cost of Operating Without Truth

Enterprises have reached a point where the pace of modernization no longer depends on the number of tools they deploy or the volume of telemetry they collect. Progress depends on whether teams can form a consistent and verifiable understanding of what is happening inside the environment. Many organizations do not realize that the single greatest barrier to modernization is the absence of operational truth.

The Next Phase of Agentic AI

The Enterprise AI Survey conducted by Digitate in collaboration with Sapio Research states that the journey of enterprise automation and AI adoption has evolved significantly. The initial waves focused primarily on improving accuracy, efficiency, and reducing costs. Now, the next phase, Agentic AI, is transforming this shift from mere automation to dynamic collaboration.

New Plugins, Faster Writes, and Easier Configuration: What's New with the InfluxDB 3 Processing Engine

The Processing Engine is one of the most powerful features in InfluxDB 3. It lets you run Python code at the database—transforming data on ingest, running scheduled jobs, or serving HTTP requests—without spinning up external services or building middleware. You define the logic, attach it to a trigger, and the database handles the rest. Since launching the Processing Engine, we’ve been building out both the engine itself and the ecosystem of plugins that run on it.

Operating agentic AI with Amazon Bedrock AgentCore and Datadog LLM Observability: Lessons from NTT DATA

This guest blog post is by Tohn Furutani, SRE Engineer at NTT DATA. Over the past year, the conversation around generative AI has shifted from single-shot use cases—such as summarization, Q&A, and chat interfaces—to agentic AI systems that can make decisions based on context, plan multistep actions, invoke tools, and adapt as conditions change.

AI agent observability: The developer's guide to agent monitoring

Most "agent observability best practices" content reads like a compliance checklist from 2019 with "AI" pasted over "microservices." Implement comprehensive logging. Establish evaluation metrics. Create governance frameworks. Not a single line of code. No mention of what happens when your agent silently picks the wrong tool on turn 3 and you need to figure out why.

How to Set Up Your Monitoring System Alerts

You could have the most detailed metrics displayed on your dashboard, but if no one gets notified when things break, you’re just collecting data. Alerts help turn this passive monitoring into an active response. It’s like they tell you, “Hey, your error rate just spiked!” or “Your memory usage is through the roof,” even before your users start filing support tickets, or worse, give up on your tool entirely.

Query fair usage in Grafana Cloud: What it is and how it affects your logs observability practice

In Grafana Cloud we use a simple yet generous formula that lets you query up to 100x your monthly ingested log volume in gigabytes for free. This works for the vast majority of our customers, but if you aren’t careful and strategic with your usage, you could find yourself with an overage bill.

Traditional Automation vs. AIOps vs. Self-Healing Ops vs. Autonomous IT Explained

Autonomous IT becomes real when teams move from insight to governed action. Most IT teams still operate on an alert-first, human-coordinated model. When something breaks, alerts fire across multiple tools, engineers get pulled in, and the first part of the response goes to figuring out who owns the problem, which signals matter, and how far the impact has spread. Containment comes after that. That sequence made sense in slower, more isolated environments.