Operations | Monitoring | ITSM | DevOps | Cloud

The hard part of AI root cause analysis is no longer the model

Every few weeks someone tells me root cause analysis is a solved problem now: pipe your telemetry into an LLM, let it tell you what broke. I wish it were that easy. After years on this, I think "can AI do RCA?" is the wrong question, because doing RCA with an LLM is really two separate jobs, and the answer is different for each. They break in completely different ways, so it's worth pulling them apart.

New Feature: Automatic Snapshots When Latency Spikes

We’ve released an exciting new Lightrun capability: set a duration threshold on your Tic & Toc or Method Duration metrics, and Lightrun will automatically capture a snapshot whenever execution exceeds it. It takes moments to configure, and gives engineers the runtime context they need to understand why unexpected slow executions are occurring.

5 Reasons OnPage Tops the Best HIPAA Messaging Apps List

Choosing a HIPAA-compliant messaging app is rarely about security alone. Healthcare teams need messages that get read, on-call schedules that route to the right provider, and reliability that holds up at 3 a.m. Most apps clear the encryption bar. Fewer guarantee a missed page never happens. Or that critical alerts from medical systems and urgent after-hours calls from a discharged patient reach the right on-call staff.

What's New in InfluxDB 3 Explorer 1.9: Flux-to-SQL Conversion, InfluxQL Support, and More

InfluxDB 3 Explorer 1.9 makes it easier to work with your existing queries. Whether you’re migrating Flux queries to SQL or you’ve been writing in InfluxQL for years, this release helps bring your existing queries forward instead of starting from scratch. For teams moving to v3 from earlier versions of InfluxDB, query migration is often one of the last major hurdles.

Debug and evaluate your AI app from your coding agent with Datadog Agent Observability

Coding agents like Claude Code, Cursor, and Codex CLI handle the coding parts of building an AI application well. The harder work comes after: understanding why a response went wrong, building eval sets that reflect real production behavior, and keeping up with an application that changes faster than any one-off script can. Teams spend 60–80% of their time on evaluation and error analysis, and much of that work needs to be redone every time the stack shifts.

5 pitfalls to avoid when measuring DevEx in the AI era

Developer experience, commonly known as DevEx, describes how an organization’s systems, workflows, tools, and culture affect developer productivity. A positive DevEx leads to tangible organizational benefits, including faster releases, increased innovation, and reduced technical debt. Measuring DevEx enables engineering management to quantify their team’s impact and understand where to direct improvement efforts.

Difference Between Elasticity and Scalability in Cloud Computing

In cloud computing, teams use elasticity and scalability as if they mean the same thing. In reality, the two describe different ways a system handles load, and they solve different problems. Mixing them up can be very expensive. You either pay for capacity that sits idle, or your app buckles the moment traffic spikes, and the bill and the incident report both feel it.

What Customers Are Doing With AI and Honeycomb

At O11yCon, we talked to engineering teams across the industry, and the numbers are starting to get genuinely wild: Mixpanel DevOps Engineer Eddie Bracho told us their engineering team is generating 50% more PRs than before AI came into the mix (sorry). That kind of velocity is exciting, but it's also a pressure test for every part of your stack that isn't writing code, including your observability practice. Here's what we're hearing from customers about how that's playing out.