Operations | Monitoring | ITSM | DevOps | Cloud

Full-stack observability in Grafana Cloud: How to investigate issues across services and infrastructure

Many times, the hardest part of troubleshooting isn’t fixing the actual problem. It’s figuring out where to start. As engineers, it’s easy to lose count of how many times we’ve opened logs, then 10 metrics tabs, and another 10 tabs with trace queries, only to end up back in the logs trying to find a root cause.

What Customers Are Doing With AI and Honeycomb

At O11yCon, we talked to engineering teams across the industry, and the numbers are starting to get genuinely wild: Mixpanel DevOps Engineer Eddie Bracho told us their engineering team is generating 50% more PRs than before AI came into the mix (sorry). That kind of velocity is exciting, but it's also a pressure test for every part of your stack that isn't writing code, including your observability practice. Here's what we're hearing from customers about how that's playing out.

Debug and evaluate your AI app from your coding agent with Datadog Agent Observability

Coding agents like Claude Code, Cursor, and Codex CLI handle the coding parts of building an AI application well. The harder work comes after: understanding why a response went wrong, building eval sets that reflect real production behavior, and keeping up with an application that changes faster than any one-off script can. Teams spend 60–80% of their time on evaluation and error analysis, and much of that work needs to be redone every time the stack shifts.

New Feature: Automatic Snapshots When Latency Spikes

We’ve released an exciting new Lightrun capability: set a duration threshold on your Tic & Toc or Method Duration metrics, and Lightrun will automatically capture a snapshot whenever execution exceeds it. It takes moments to configure, and gives engineers the runtime context they need to understand why unexpected slow executions are occurring.

The hard part of AI root cause analysis is no longer the model

Every few weeks someone tells me root cause analysis is a solved problem now: pipe your telemetry into an LLM, let it tell you what broke. I wish it were that easy. After years on this, I think "can AI do RCA?" is the wrong question, because doing RCA with an LLM is really two separate jobs, and the answer is different for each. They break in completely different ways, so it's worth pulling them apart.

Instrumenting AI Agents for the Agent Timeline: A Practical OpenTelemetry Guide

AI agents are nondeterministic, multi-step, and opaque. When one fails in production, "the model said something weird" is the cheapest, most useless line in your incident postmortem. To debug agents the way they actually run, you need telemetry that captures all of it, in order, with enough context to reconstruct what happened. The OpenTelemetry GenAI Semantic Conventions give you a vendor-neutral way to do exactly that.

Why Observability Isn't Enough for AI Coding Agents

Observability platforms collect pre-instrumented logs, metrics, and distributed traces to monitor production systems and surface failures to human engineers. The adoption of AI into engineering has led observability providers to offer those same signals to agents. This is often packaged as AI observability, but the signals themselves were designed around a human investigation loop. AI coding agents work faster, consume data differently, and need feedback as they work rather than after deployment.

What Is Agentic Observability? The Complete Guide for Enterprise Engineering Teams

TL;DR Agentic observability uses AI agents to autonomously investigate incidents, identify root causes, and take action in production environments. Unlike traditional monitoring (which alerts and waits) or AIOps (which assists human analysis), agentic platforms conduct the investigation themselves. Key capabilities include autonomous incident triage, evidence-backed root cause analysis, alert noise reduction, and governed remediation.

Runtime Aware PR Review: Validate Changes in Live Production

Runtime PR review means validating a code change against live variable state, real execution paths, and downstream service behavior before the merge decision. Not after a checkout regression exposes what the diff missed. As AI coding agents ship PRs faster than any reviewer can mentally simulate execution, static analysis and CI leave a structural gap that only runtime evidence can close. This article explains what that gap looks like, why it recurs, and how to close it with runtime context code review.

Full Stack Observability vs Monitoring: Key Differences

Traditional monitoring tracks system health by collecting data such as metrics and logs, this data is checked to see if a system is behaving as expected and alerts are raised if errors or anomalous data values are found. This works well in stable, predictable environments, but modern IT systems are far more complex and dynamic. In distributed architectures like microservices and cloud-native platforms, predefined alerts usually aren’t enough to explain why a failure is happening.

What's New in Network Observability for Summer 2026

As a network engineer, you likely face two persistent operational challenges every day: When you have to manually track device lifecycles on spreadsheets or spend your scheduled maintenance periods troubleshooting software upgrades, you lose the time you need to proactively ensure network performance. Over the past six months, we have continued to enhance Network Observability by Broadcom. These latest enhancements directly address the operational challenges outlined above.

The Four Pillars of AI Observability in 90 Seconds

AI applications can behave unpredictably, potentially leading to errors such as hallucinations or data leaks, even when classic monitoring indicates a successful response. To effectively monitor AI systems, four key areas should be focused on. Implementing these pillars can enhance trust in AI deployments, help manage costs, and identify safety issues before they impact users.

Observability on Windows, before eBPF is production-ready

No large enterprise runs a single stack. A shiny new Kubernetes cluster sits right next to a Windows Server box that has quietly run the billing system for a decade without missing a beat. Both keep the business running. Both deserve the same visibility. Linux runs most server workloads, and Coroot grew up there. Our open-source node-agent uses eBPF to collect metrics, logs, traces, and profiles, with no code changes. But "most" is not "all".

Using Evaluation Frameworks with Agent Observability

AI teams have invested heavily in evaluation frameworks, yet getting those frameworks beyond local experimentation remains challenging. Teams using open source libraries like DeepEval and Pydantic Evals gain flexibility and research-grounded metrics, but operationalizing those evaluations still requires brittle custom integration code that doesn’t scale.

Monitoring vs. observability: The future of IT operations in 2026

For years, monitoring was the gold standard of infrastructure management. Dashboards. Thresholds. Alerts. If everything on the dashboard was green, you didn't need to worry. If something turned red, you responded. It was a model built on predictability, and for a long time, it worked. But modern infrastructure is no longer predictable.

The Second Edition of Observability Engineering Is Here

IT’S HERE it’s here it’s here it’s here!!!! The second edition of Observability Engineering is available for download, and since Honeycomb is the sponsor, you can now download it from our website (the dead tree version will take another month). This is a strange time to be writing a book.

Agent Timeline Is Now Generally Available

A few weeks ago I wrote about a customer’s refund request that stopped halfway through at 11:47 p.m. on a Tuesday night. That post walked through the 40 minutes it took to work out what happened when an agentic application had a problem: a tool retried against a rate-limited payments API, the error responses filled up the context window, and the agent gave up. The whole reason we built Agent Timeline was to turn that 40 minutes into five. To reduce MTTR. To solve the problem and get back to sleep.

Working as a remote engineer at Cribl | Building the AI Platform for Telemetry

Learn what it’s like to work as an engineer at Cribl, a remote-first company building the AI platform for IT and security data. In this recruiting video, Cribl’s engineering and support leaders share how fully distributed teams collaborate, solve hard data problems, and grow their careers while working from around the world. You’ll hear from managers and leaders in site reliability engineering, security incubation, and technical support about.

Observability for a Privacy-first AI Wearable | Grafana Everywhere

Trust is everything when AI gets personal. Golden Grot Award winner and NeoSapien co-founder and CEO Dhananjay Yadav shares how his team uses Grafana Assistant to ensure the privacy-first AI wearable delivers a seamless, reliable experience without compromising its mission. Because when AI moves closer to our everyday lives, teams need to know what’s happening — and users need to trust that it’s working as intended.

From event correlation to autonomous IT: Why observability isn't enough anymore

Most IT war rooms have plenty of data, but not enough time or clarity to find the real answer. Dashboards are crowded, alerts keep piling up, and the real issue gets lost in all the noise. Ever dealt with this situation? You’re not alone, and there’s a simpler way to deal with it. OpManager Nexus closes this gap by moving beyond visibility to help teams actually diagnose and fix problems faster.

Why AI observability is a critical ITOps priority

AI Observability is a Critical Priority for ITOps Teams See how LogicMonitor helps ITOps teams monitor AI workloads, reduce blind spots, and move toward Autonomous IT. Schedule a meeting AI has shifted from experimental pilots to everyday business operations. Customers are interacting with AI-powered applications. Engineering teams are building with LLMs, GPUs, APIs, and automation at a much faster pace. That adds to the visibility strain on already overburdened ITOps teams.

Datadog Data Observability: Be the first to know when data fails

Bad data doesn't announce itself. Datadog Data Observability gives you unified visibility across your entire data stack—from source systems and pipelines to dashboards and AI applications—so you catch silent failures before they cascade. Detect data quality and pipeline issues before stakeholders do, pinpoint root causes with end-to-end lineage, and reduce pipeline costs with job, cluster, and query recommendations.

Un-observable AI is Un-trustworthy AI

Recently, someone talked Chipotle’s customer support agent into reversing a linked list – a task completely unrelated to burritos in any way. Screenshots circulated, people laughed, but underneath the joke sat a sharper question. If a production support agent will do that on a public channel, what else will it do that nobody is screenshotting? The bug is funny. The trust gap behind it is not.

Why CI/CD Pipelines Miss Runtime Failures

CI/CD pipelines do four things: it builds code, runs tests against mocked dependencies, lints for style violations, and scans for known vulnerability patterns. What it cannot do is validate how that code behaves under real users, real service responses, and real runtime constraints that staging was never configured to reproduce. That entire class of failure clears every gate cleanly and surfaces only in production.

Kubernetes Monitoring: Datadog Alert to Lightrun Root Cause

Datadog Kubernetes monitoring tells an SRE team what failed, which pod failed, and when. It does so within seconds of the alert firing. The investigation then stalls at the same point every time: nothing in the dashboard layer can prove why a specific request behaved the way it did inside a running JVM at the moment of failure. Variable values, feature flag evaluations, and code branches are never captured.

Observability: Are You Measuring What Actually Matters?

Observability has always been important, and much like any core capability in your business, the value needs to be understood. For years, the value of observability was predictable. It was uptime, error rates, MTTR, and likely tool consolidation. That was enough to be able to show progress. These are foundational, tablestakes metrics—and they still matter, but they aren’t enough.

Why Your Agentic Workflow Succeeds and Still Gets It Wrong

Agentic workflows are reshaping how engineering teams operate, fetching context, synthesizing decisions, and shipping results across systems without human intervention. But the same design that makes them powerful adds risk in production. Agents do not crash when they hit bad data; they synthesize around it, substituting a stale value, an empty page, or a missing field for the result they were supposed to capture.

The Next Evolution of Infrastructure Observability

Operational visibility is becoming increasingly important as infrastructure teams are asked to support AI initiatives, automation goals, cost accountability, modernization efforts, and growing operational complexity at the same time. Most are expected to do it without expanding headcount, introducing additional risk, or rebuilding the environment from scratch. Those expectations are changing the role of infrastructure operations.

Monitoring Protocols Compared - Which Standard for What

Modern applications are distributed, ephemeral and built from a dozen moving parts. To keep them reliable, you need real visibility: not just “is the server up?”, but“how is this request behaving, right now, across every component it touches?”. The good news is that the observability world has converged on a handful of open standards.

Graviton5 in Production at Honeycomb: Per-service Results From the m8g to m9g Migration

This is the fourth installment in the Graviton retrospective series we've been writing since 2021. The methodology is the same one I always reach for: hold the workload constant, run both generations on the same Kubernetes namespace concurrently, and let the per-pod numbers speak.

What is SRE Observability and Key Pillars You Should Know?

What happens when a critical service slows down, but nothing is technically “broken”? Most teams have monitoring in place. They know when something goes down. But when performance drops or issues spread across services, finding the real cause becomes slow and unclear. Engineering teams end up switching between dashboards, logs, and alerts just to understand what changed. This delays response and increases pressure on on-call teams. This is where SRE observability becomes essential.

It Can Only Goodhart Happen

When a measure becomes a target, it ceases to be a good measure. Charles Goodhart, 1975 You’ve probably read this quote in relation to any number of things over the years. People complaining about arbitrary metrics like PRs merged, lines of code produced, and now, token usage. But is the era of tokenmaxxing over before it even began? The rise of token leaderboards to the death of token leaderboards at companies like Amazon seem to have taken place in less than three months!

Running the OpenTelemetry Collector as a Lambda

The OpenTelemetry Collector is usually deployed as a long-running process: a sidecar, a DaemonSet, an EC2 instance, a docker container on my computer. It sits there listening for telemetry. That's fine when I want to send telemetry all day, but not when telemetry is rare. Like right now, when I have an agent defined on AgentCore, and it runs a few times a week maybe. Or my website that hardly sees any traffic. Can I run the OpenTelemetry Collector as a Lambda function?

MCP Servers Are Becoming a Core Interface Layer in Data Observability and Data Quality

Data observability has traditionally been built around human workflows. When data breaks, engineers are alerted, open dashboards, inspect lineage graphs, and manually trace the issue across pipelines. The system is designed for human investigation and interpretation. That model is now being challenged by the rise of AI agents in data operations. As organizations begin embedding AI into analytics, engineering, and decision-making workflows, observability is no longer just about explaining what happened - it must also enable systems to understand and act on it.

Why Engineers Don't Trust Autonomous AI - 4th Annual Observability Survey | Grafana Labs

The 2026 Observability Survey from Grafana Labs heard from over 1,300 engineers and leaders across 76 countries on the real-world role of AI in observability. The data reveals a sharp distinction between intelligence and autonomy — and a critical blind spot most teams have.
Sponsored Post

How APM fits into the modern observability stack

Most engineering teams don't have a data problem. They have an interpretation problem. Prometheus is running, logs are shipping to the aggregator, dashboards are green-and then a latency spike hits and the root cause takes 45 minutes to isolate. The data was there but the answer wasn't. That gap is where application performance monitoring (APM) operates. This article explores what APM adds to a modern observability stack, why relying on standalone tools leaves critical blind spots, and how teams can unify infrastructure data with application context for a complete operational picture.

Why Observability Is Essential for Platform Engineers?

Observability is how platform teams stop being the answer to every question and start building platforms that answer those questions themselves. This article explains specifically how observability enables platform engineers to support development teams better which reducing ticket volume, cutting MTTR, enabling SLO ownership, and making microservice debugging something devs can do without escalating to you.

AI Observability Deep Dive Demo | Grafana Cloud

Grafana AI Observability is our new database and platform for observing AI Agents. Over the past year at Grafana Labs, we built Agents and we needed a way to understand how they are performing, what are the costs associated with them, what's the error rate or time to the first token as well as how they are behaving. Grafana Staff Engineer, Ivana Hučková provides a deep dive demo on how Grafana AI Observability connects our experience building Agents with our experience building observability systems.

Observability for Healthcare Systems | Grafana Everywhere

Grafana Assistant is going places you might not expect — including healthcare. Golden Grot winner Oren Lion from TeleTracking reveals how Grafana Cloud supports their systems that help keep patient care moving — and how Assistant enables teams to get from “what happened?” to “here’s why” faster. From moon landings to patient care, Grafana is everywhere. Congratulations to Oren, Chris Johnson, Mark Munson, and the entire TeleTracking team on winning this year's Golden Grot Award for Pioneering AI in Observability!

How to debug REST Collector APIs with Cribl REST Collector Diagnostics

This video introduces the new REST Collector Diagnostics feature in Cribl, which helps you troubleshoot API collection issues faster. It’s designed for observability and data engineers who use REST Collector to pull data from external APIs and need deeper visibility into HTTP requests, responses, and errors.

Claude Code Observability at Scale: How We Did It With Bindplane

At Bindplane, we iterate fast. One of the most important tools we've adopted across our organization is Claude Code. It helps every team here build solutions to complex problems with both speed and precision. But speed without visibility is a liability. We needed a reliable way to monitor and audit how Claude Code was being used across our team. Luckily, we build the best platform on the market for data in motion.

Cribl Search Pack for Zscaler: Setup & security dashboard walkthrough

Learn how to install and configure the Cribl Search Pack for Zscaler, then walk through prebuilt dashboards for your Zscaler security logs. This video is for security engineers, Zscaler administrators, and SOC/observability teams using Cribl Search to monitor and investigate Zscaler activity. In this walkthrough, you’ll see: If you need a reminder or want to share feedback on the pack, you can always refer to the README bundled with the pack or reach out to the Cribl team.

How Support Uses Honeycomb to Debug Honeycomb

You'd think that working at an observability company means everyone knows exactly where to find everything in the data. It doesn't. Especially not on the support team. We're the ones who get the tickets. We're in the telemetry every day trying to figure out what went wrong for a customer, and we do that by pointing Honeycomb at itself. Here's how that actually works, and how it's changed.

Splunk Observability at Cisco Live: Agentic Observability for the AI Era

Observability has always been about seeing clearly under pressure. But the pressure has changed. Applications are more distributed. Kubernetes environments keep expanding. Digital experiences depend on services, APIs, networks, third-party providers, and now AI models and agents that can make decisions faster than a human team can review every signal.

The Observability Journey: Getty Images and Cribl

I recently sat down with Simon Overbey and Lovepreet Singh - the Engineering Manager and systems engineer (respectively) at Getty Images to talk about their experiences implementing Cribl. After getting a rundown of the pre-Cribl environment (described above) I asked to jump straight to the end, the net benefits. If the "before" was a terrifying tidal wave of cost and complexity, what did the "after" look like?

How to Build Real-Time Supply Chain Observability

"One missing pallet." That's how a warehouse supervisor in New Jersey described the start of a week-long supply chain mess back in 2024. One pallet. Then came delayed trucks, angry retailers, overtime pay, and a customer threatening to walk. In logistics, small gaps don't stay small for long. And the uncomfortable part is that most teams are already working hard. The issue isn't effort. It's alignment. The data exists in most organizations-it just doesn't show the same reality at the same time. Which leaves a basic question surprisingly hard to answer: what's actually happening right now?