Operations | Monitoring | ITSM | DevOps | Cloud

Introducing Datadog Agent Builder: Build agentic workflows for alert response and remediation

Building automated workflows that adapt to real-world complexity can be a challenge. As systems scale and scenarios multiply, teams often end up hardcoding endless logic branches just to handle every potential outcome. That’s why we’re introducing Datadog Agent Builder, a powerful new tool that lets you create custom AI agents that are fully hosted by Datadog.

Datadog GPU Monitoring: Optimize and troubleshoot AI infrastructure

With Datadog GPU Monitoring, engineering and ML teams can monitor GPU fleet health across cloud, on-prem, and GPU-as-a-Service platforms like Coreweave and Lambda Labs. Real-time insights into allocation, utilization, and failure patterns make it easy to spot bottlenecks, eliminate idle GPU spend, and resolve provisioning gaps. By tying usage metrics directly to cost and surfacing hardware and networking issues impacting performance, Datadog helps teams make fast, cost-efficient decisions to keep AI workloads running reliably at scale.

Bringing Observability to Data

While observability practices have evolved in recent years, they have largely focused on application services and infrastructure. Yet it is data what powers our applications, businesses, and AI models. When data issues occur, the consequences can be far reaching, from poor product experiences to billing errors to misinformed AI outcomes. In this session, Jonathan Morin, Group Product Manager at Datadog, shares real-world examples of incidents and explains how data observability can address them, helping teams detect issues earlier, reduce costly downtime, and restore trust in their data.

The Hidden Bottleneck in Latency: GetYourGuide's Database Performance Journey

Fast front-end and back-end code alone won’t guarantee low end-to-end latency as hidden bottlenecks in the database can undermine even the best engineering efforts. In this session, Oleksii Serhiienko, Senior Site Reliability Engineer at GetYourGuide, will share how his team put database performance at the center of their monitoring strategy. He will highlight how they identified and fixed slow queries, uncovered load balancing issues that drove significant cost savings, and built monitoring practices that improved both reliability and investigation workflows.

Use Grok parsing to extract fields from logs | Datadog Tips & Tricks

When your logs don’t follow a standard format, it can be difficult to extract valuable information, like key-value pairs and nested JSON objects. Grok parsing lets you define flexible patterns that match unstructured log data so you can extract specific fields to query, filter, and visualize. In this video, you’ll learn how to: By refining your Grok parsers, you can make your logs more useful for analytics, dashboards, or alerts, and get even more value from your logs.

Sync your Backstage catalog with Datadog IDP

Backstage is a popular open source framework for building internal developer portals (IDPs) used by organizations to aggregate service metadata and create a single source of truth for their software developers. However, data stored in the Backstage Software Catalog can quickly become siloed and inaccessible from monitoring tools such as Datadog.

Eliminate unnecessary costs in your Amazon S3 buckets with Datadog Storage Management

Cloud object storage powers a wide range of workloads, from AI training datasets to customer-facing media libraries. As your data grows into the petabyte scale, managing storage costs and ensuring reliability requires fine-grained visibility. You need answers to questions like: Which specific teams, services, workloads, or datasets are driving spend? Which data is cold and should be archived? What fixes will have the biggest impact on cost and performance?

Observability and FedRAMP in Action: The VA's Mission to Deliver Reliable Digital Service

Ensuring digital services remain accessible, reliable, and secure is a high priority for any organization operating at scale. For the Department of Veterans Affairs (VA), this focus is central to its mission of providing quality care to veterans, their families, and caregivers. Often described as “the largest IT shop in the United States,” the VA manages 2.7 million pieces of equipment across a vast network of interconnected systems.

How feedback loops power progressive software delivery

Modern engineering teams face competing priorities. Developers are expected to deliver new features faster than ever, but users expect rock-solid reliability with every release. Shipping quickly can feel like you’re gambling with user trust. If you move too fast, you risk outages, but if you move too slowly, innovation stalls.

Import Snowflake, Salesforce, ServiceNow, and Databricks metadata into Datadog with Reference Tables

Engineering, operations, and security teams can struggle to make sense of their telemetry data in isolation. Logs, metrics, and events tell what is happening but are often missing critical metadata like who owns what, where it's coming from, or indicators of attack. These gaps in visibility slow down incident response, complicate cost control, and make business or security analytics much harder.