Operations | Monitoring | ITSM | DevOps | Cloud

Automate flaky test fixes with the Bits AI Dev Agent and Test Optimization

Flaky tests are a significant source of inefficiency that impacts many engineering teams. Along with failing your build, they interrupt your entire development flow, generate excessive CI/CD noise, and, critically, compromise developer trust in the test suite itself. Datadog Test Optimization enables you to manage test suites at scale by pinpointing the flakiest tests, analyzing their history across hundreds of runs, and automatically surfacing the root cause.

How To Calculate Your OpenAI Cost Per API Call (And Why It Matters Now)

OpenAI doesn’t bill per feature, per customer, or per transaction. It bills per token, across multiple models, with usage patterns that can change by the hour. As a result, two API calls that support the same feature can have very different costs. Without a clear way to translate token-level pricing into something product, engineering, and finance teams can reason about, AI spend becomes difficult to forecast and harder to control.

Supercharge your LLM Using Production Data Context

Are your LLM coding agents (like Cursor or Claude Code) hallucinating fixes because they don't know what's actually happening in production? In this video, Matt from Speedscale shows you how to bridge the gap between your local IDE and live production traffic using the Model Context Protocol (MCP). Most observability tools just give you telemetry. Speedscale’s MCP server gives your agent the "inner workings" of actual API calls and payloads, so it can check its assumptions against reality. No more "vibe-coding" and hoping it works; let your agent find the 500 errors and rate limits for you.

The 54% Improvement Playbook: How Top Performers Integrate GenAI into ITSM

Don't just read the report—learn how to replicate its most impressive results. In our 2025 State of ITSM Report, a select group of top-performing organizations achieved a staggering 54.3% reduction in resolution time by strategically integrating GenAI. This live session moves beyond the data to share their playbook. We'll provide a step-by-step guide on how to pair GenAI with foundational ITSM practices and demonstrate how to weave these tools into your team's daily workflows to achieve maximum efficiency.

Agentic AI Essentials: Examining the Hype Around Agentic AI

In the first article of our Agentic AI Essentials series, we’ll establish what makes agentic AI distinct. We’ll look at the process of tool calling and examine how agentic systems convert intelligence into action. We’ll also explore the human fears, pressures, and ambitions that fuel the hype around agentic systems. By sorting the signal from the noise, IT decision-makers can take the first step toward making sound decisions around agentic AI adoption.

Operational Risk Management in High-Stakes Decision Environments

In high-stakes environments, every choice carries weight. Whether it is a complex financial process, a real-time cybersecurity response, or a tightly regulated operational workflow, small missteps can rapidly evolve into major failures. Organizations increasingly rely on integrated riskmanagement strategies that blend human judgment with technology. The goal is simple: reduce uncertainty before it becomes costly. But the path to that goal is rarely straightforward.

Let Your LLM Debug Using Production Recordings

Modern LLM coding agents are great at reading code, but they still make assumptions. When something breaks in production, those assumptions can slow you down—especially when the real issue lives in live traffic, API responses, or database behavior. In this post, I’ll walk through how to connect an MCP server to your LLM coding assistant so it can pull real production data on demand, validate its assumptions, and help you debug faster.

AI SRE in Practice: Resolving GPU Hardware Failures in Seconds

When a pod fails during a TensorFlow training job, the investigation usually starts with the obvious questions. The answers rarely come quickly, especially when the failure involves GPU hardware that most engineers don’t troubleshoot regularly. This scenario walks through an actual GPU hardware failure and shows how AI-augmented investigation changes both the time to resolution and the expertise required to handle it.