Operations | Monitoring | ITSM | DevOps | Cloud

The Guide to Kubernetes Debugging

Kubernetes is widely used for deploying, scaling, and managing systems and applications and is an industry standard for container orchestration. Google engineers originally developed Kubernetes as an open-source project. Its first release was in September 2014, and since then, it has matured into a graduate project maintained by the Cloud Native Computing Foundation (CNCF). With the complexities of scale and distributed systems, debugging in Kubernetes environments can be difficult.

Simplifying Container Observability for DevOps Teams

In modern microservices architectures, container observability is crucial for maintaining reliability and performance. It helps teams detect issues early and optimize distributed systems. This guide will walk you through the essentials of container observability, including advanced techniques and troubleshooting strategies to ensure your containerized applications run smoothly.

Accelerating Observability Adoption: Why Self-Service Isn't Optional Anymore

For observability adoption to scale, you must eliminate the bottlenecks. A self-service approach is the only sustainable model, enabling all teams–not just a select few–to access, implement, and scale observability easily. But making the shift requires more than access: you have to design for it.

12 OpenTelemetry-Compatible Platforms You Should Know in 2025

OpenTelemetry has transformed how engineering teams implement observability. This vendor-neutral framework for collecting metrics, traces, and logs has become indispensable for several reasons: Elimination of vendor lock-in Organizations can switch observability providers without changing instrumentation code, enabling greater flexibility and negotiating power with vendors.

Building a Simple Synthetic Monitor With OpenTelemetry

Using server-side telemetry to understand what’s going on inside your system is incredibly valuable, but what about the responsiveness the user actually sees? In this post, I’ll cover what synthetic monitoring is and show an example of how you can create a simple monitor using OpenTelemetry, .NET, and an Azure function. If you only want to see how it’s built, skip ahead to building a synthetic monitor.

Why Observability is Getting Expensive and OpenTelemetry is Becoming More Popular | Grafana Labs

Grafana Labs' Jen Villa shares the latest insights into how organizations are rethinking their observability strategies — with cost now taking center stage. This video covers: Chapters: Grafana Cloud is the easiest way to get started with Grafana dashboards, metrics, logs, and traces. Our forever-free tier includes access to 10k metrics, 50GB logs, 50GB traces and more. We also have plans for every use case.

How does observability enhance operations in cloud native voice networks?

The 17th century saw the onset of the Scientific Revolution. It was a time of knowledge explosion. During this time, scientific practices were still evolving, with no universally accepted protocols for data collection and analysis. Individuals documented any observation, resulting in the collection of massive amounts of information, but much of it was hard to leverage to draw useful conclusions. The development of the scientific method provided a common approach to capturing and analyzing data.

The Role of Observability in Modern DevOps Pipelines

DevOps has radically transformed how organizations build and deploy software, enabling faster delivery with greater reliability. Within this transformation, observability has emerged as a critical foundation for success. Unlike traditional monitoring that simply tracks known metrics, observability provides deep visibility into complex systems, allowing teams to understand and troubleshoot issues they couldn't anticipate. This shift represents much more than a technical evolution - it's a fundamental change in how organizations approach system health and performance.

How Much Should I Be Spending On Observability?

I recently wrote an update to my old piece on the cost of observability, on how much you should spend on observability tooling. The answer, of course, is “it’s complicated.” Really, really complicated. Some observability platforms are approaching AWS levels of pricing complexity these days.

Traces & Spans: Observability Basics You Should Know

In modern software architecture, applications aren't just getting bigger—they're getting more distributed. With microservices, serverless functions, and containers running across multiple environments, understanding what's happening inside your systems can feel like trying to track a single raindrop in a storm. That's where traces and spans come in. These observability tools aren't just buzzwords—they're your secret weapon for making sense of complex distributed systems.

How to get started with frontend observability: A quick Grafana Faro example

Modern cloud-native applications and web browsers are highly complex, making it challenging to gain visibility into their performance. Without an effective way to track and measure frontend performance, it becomes difficult to monitor real user experiences, detect critical issues, assess website health, and ensure optimal functionality. But what if you could see exactly what your users are experiencing in real time?

New Feature: Manage Your session.id in Honeycomb's Web SDK

The session.id field is special in Honeycomb for Frontend Observability. It’s a default option for filtering and grouping, and it’s the basis for session timeline analysis (in Early Access). Now you can control how session.id is set. In prior releases (< 0.15.0) of the Honeycomb Web SDK, we used our own UUID generator for session.id, and it was not accessible outside of the Web SDK itself. As of version 0.15.0, we give you full control.

App crash panic? #speedscale #developer #mocks #appcrashes #debugging #monitoring #tech #shorts

This video walks you through the first steps when your application goes down: check monitoring, validate alerts, rule out cache issues with incognito mode, and dive into your observability data to find the fix!

Troubleshooting Java Applications with Coroot

Java applications run on top of the JVM — a powerful but complex runtime environment that re-implements many OS features. It has its own memory management, garbage collector, and dynamic code compiler (JIT). While these features help with performance and portability, they often make troubleshooting a real challenge. At Coroot, we recently improved our support for continuous profiling in JVM-based applications.

Data Strategy for SREs and Observability Teams

In Honeycomb’s Customer Architects team, we work with the full spectrum of team, scope, and budget sizes. “The data isn’t valuable enough” is something we’re always dismayed to hear, but we hear it often enough. The thing is, as much as we want it to not be true, no product or tool can magically maximize the value of your telemetry data—at least not without gobs of human input, oversight, and review.

The Power of Over 3000 Intelligent Observability Agents

Catchpoint has officially crossed a major milestone: over 3,000 intelligent agents now power our Global Agent Network. This isn’t just a big number. It underscores our commitment to helping our users monitor what matters, from where it matters most: the end user. With agents deployed across 105 countries, 346 cities, and every layer of the Internet stack, Catchpoint now offers the broadest and deepest visibility into user experience available today.

Team-Oriented Observability with Coroot

Modern apps are built by many teams, each owning a different set of services: APIs, background jobs, databases, platform components, and more. As the system grows, it gets harder for each team to focus on what actually matters to them.When everything is mixed together, dashboards get messy, service maps are too large to be useful, and alerts end up reaching the wrong people. Instead of helping, your observability stack turns into a distraction. It has lots of data, but no clear context.

Advanced Python Logging: Mastering Configuration & Best Practices for Production

Python's logging system provides powerful tools for application monitoring, debugging, and maintenance. This comprehensive guide covers everything from basic setup to advanced implementation strategies, helping you build robust logging solutions for your Python applications.

AI Agent Observability Explained: Key Concepts and Standards

AI agent observability has become a critical discipline for organizations deploying autonomous AI systems at scale. This guide explores the emerging standards and best practices for monitoring, analyzing, and improving AI agent performance in enterprise environments.

How Much Should I Be Spending On Observability?

In 2018, I dashed off a punchy little blog post in which I observed that teams with good observability seemed to spend around ~20-30% of their infra bill to get it. I also noted this was based on absolutely no data, only my own experiences and a bunch of anecdotes, heavily weighted towards startups and the mid-market tech sector. This post should have ridden off into the sunset years ago. To my horror, I have seen it referenced more in the past year than in all preceding years combined.

How to get started with Calico Observability features

Kubernetes, by default, adopts a permissive networking model where all pods can freely communicate unless explicitly restricted using network policies. While this simplifies application deployment, it introduces significant security risks. Unrestricted network traffic allows workloads to interact with unauthorized destinations, increasing the potential for cyberattacks such as Remote Code Execution (RCE), DNS spoofing, and privilege escalation.

AWS Lambda, OpenTelemetry, and Grafana Cloud: a guide to serverless observability considerations

In our increasingly serverless world, observability isn’t just a “nice to have”—it’s essential. Serverless functions such as AWS Lambda bring incredible benefits, but they also introduce complexities, especially around monitoring and debugging. In a previous article, I provided a quick, practical guide for sending AWS Lambda traces to Grafana Cloud using OpenTelemetry.

OpenTelemetry for AI Systems: Implementation Guide

AI systems, from machine learning models to Large Language Models (LLMs) and autonomous AI agents, introduce unique observability challenges. Their non-deterministic nature, complex dependencies, and specialized performance characteristics require thoughtful instrumentation approaches. OpenTelemetry has emerged as the leading standard for implementing observability across these systems.

6 Silent Traps Inside CloudWatch That Can Hurt Your Observability

One of the most common things we hear from our users, is how AWS costs keep increasing with CloudWatch often playing a big role. CloudWatch has long been the default observability solution for AWS users. While it’s great for some use-cases, it’s also important to check out and weigh other alternatives which could be better suited for modern observability demands. Let’s examine some areas where modern observability platforms outweigh CloudWatch. Note.

Elastic Observability 9.0/8.18: Elastic Distributions of OpenTelemetry (EDOT) now GA, LLM observability, and more

Elastic Observability 9.0/8.18 announces several key capabilities: Elastic Observability 8.18 and 9.0 is available now on Elastic Cloud — the only Elasticsearch offering to include all of the new features in this latest release. You can also download the Elastic Stack and our cloud orchestration products — Elastic Cloud Enterprise and Elastic Cloud for Kubernetes — for a self-managed experience. What else is new in Elastic 9.0/8.18? Check out the 9.0/8.18 announcement post to learn more.

Observability Trends for 2025

The evolving digital technologies and artificial intelligence (AI) fundamentally reshape business dynamics. Analyzing the growth and impact of running online businesses, several organizations from different industries started adapting this modern approach to create revenue streams and enhance their customer experience. On one end, it turned out to be a brilliant strategy; on the other, managing the complex business data and systems was a big challenge.

MCP, Easy as 1-2-3?

Seems like you can’t throw a rock without hitting an announcement about a Model Context Protocol server release from your favorite application or developer tool. While I could just write a couple hundred words about the Honeycomb MCP server, I’d rather walk you through the experience of building it, some of the challenges and successes we’ve seen while building and using it, and talk through what’s next. It should be pretty exciting, so strap in!

From Traditional Monitoring to AI-Enhanced Observability

Traditional monitoring approaches have served IT operations for decades, providing basic visibility into system health through predefined metrics and thresholds. However, these conventional methods face significant limitations when confronted with modern, complex environments: Static Thresholds and Rules Traditional monitoring relies heavily on manually defined thresholds and rules.

The hidden costs of tool sprawl: An SRE's guide to observability consolidation

An overview of the benefits, challenges, and philosophy behind consolidating your observability tools Picture this: It's 3:00 a.m., and your phone is buzzing with alerts from what seems like a dozen different monitoring tools. As you blearily scroll through the notifications, you can't help but wonder, "How did we end up with so many tools, and why can't they just talk to each other?".

Observability vs APM: What's the Real Difference?

Remember when monitoring your apps meant checking if they were up or down? Yeah, those days are long gone. As systems have gotten more complex—microservices talking to other microservices, containers spinning up and down, serverless functions doing their thing—the approach to understanding system health has had to level up too. APM tools have been the bread and butter for DevOps teams for years, but now everyone's talking about observability.

Cross-domain integration: Combining DEM and observability

Effective monitoring and optimization of an interdependent environment require a coordinated strategy. Through the integration of observability and digital experience monitoring (DEM) platforms, businesses can dismantle silos and obtain a real-time, comprehensive view of their whole digital infrastructure. This comprehensive strategy empowers enterprises to proactively handle problems and optimize the end-to-end digital experience, which also improves performance and risk management.

How SpotOn overhauled its observability strategy with standardized tagging and Grafana Cloud

Many engineers would agree: migrating to a new observability platform can be a serious undertaking. But it’s also the perfect opportunity to step back, revisit some of the foundational practices that drive your observability strategy — and reap some major benefits, as a result. This was the case at SpotOn, a provider of restaurant point of sales systems and business software, which recently migrated from four disparate observability tools and consolidated on Grafana Cloud.

The importance of proactive event handling in modern IT observability

Events are the heartbeat of modern IT observability. Events are the threads that wave across distributed IT systems to create a fabric of cohesion. They empower teams to shift from reactive firefighting to proactive management, fostering resilience, actionable insights, and superior user experiences (UXs) with platforms like ManageEngine Site24x7. This blog explores the pivotal role of events in observability and how to harness them effectively.

Honeycomb Acquires Grit: A Strategic Investment in Pragmatic AI and Customer Value

We’re excited to share that Honeycomb has completed our first-ever acquisition: we’re joining forces with Grit, bringing on board not only a strong team, but also compelling technology that supercharges our ability to deliver on our mission: to bring observability to every software engineer. This is a strategic move that will help us deepen the value we deliver to customers and accelerate our vision for what modern observability can and should be.

The Critical Role of Observability in Healthcare IT

Healthcare organizations are increasingly leading the charge in technology adoption, rapidly deploying advanced applications and digital tools to improve patient outcomes and operational efficiency. However, this acceleration is placing unprecedented pressure on existing IT infrastructure. Teams are being asked to support next-generation workloads, such as AI-powered diagnostics and real-time data platforms, on legacy systems, often without the benefit of increased budget or headcount.

Comparing ELK, Grafana, and Prometheus for Observability

Monitoring and observability are cornerstones of modern infrastructure management. Three popular solutions that often come up in this space are the ELK Stack, Grafana, and Prometheus. This comparison breaks down the key differences, use cases, and integration capabilities to help you determine which tool or combination better suits your operational needs.

Calico Open Source 3.30: Exploring the Goldmane API for custom Kubernetes Network Observability

Kubernetes is built on the foundation of APIs and abstraction, and Calico leverages its extensibility to deliver network security and observability in both its commercial and open source versions. APIs are the special sauce that help automate and operationalize your Kubernetes platforms as part of a CI/CD pipeline and other GitOps workflows. Calico OSS 3.30, introduces numerous battle-tested observability and security tools from our commercial editions. This includes the following key features.

Observability: It's Every Engineer's Job, Not Just Ops' Problem

For years, organizations have used the term “observability” as an evolution of monitoring, a discipline practiced by operations teams to understand whether production software was working. I’ve been annoyed by this—not because it’s philosophically wrong, but because it diminishes the importance of observability as a generalized software engineering practice.

Building a Self-Service and Scalable Observability Practice

Join us in this session and learn how Splunk can help you build a standardized observability practice. From implementing an observability-as-code service to role-based access controls (RBAC), Token Management, Metrics Pipeline Management, and OpenTelemetry, learn how to create an Observability platform to optimize your metrics usage and costs while managing workloads efficiently.

A privacy-first, data-driven approach to optimize the user experience: Introducing Geolocation Insights in Frontend Observability

Grafana Cloud Frontend Observability is a real user monitoring (RUM) solution that provides immediate, clear, and actionable insights into the end-user experience of web applications. Understanding where those end users are located can provide valuable insights into frontend performance, error patterns, and overall user experience.

How Does 'Vibe Coding' Work With Observability?

You can’t throw a rock without hitting an online discussion about ‘vibe coding,’ so I figured I’d add some signal to the noise and discuss how I’ve been using AI-driven coding tools with observability platforms like Honeycomb over the past six months. This isn’t an exhaustive guide, and not everything I say is going to be useful to everyone—but hopefully it will clear up some common misconceptions and help folks out.

How to Set Up Geolocation Insights | Grafana Cloud's Frontend Observability | Grafana Labs

Want to set up geolocation insights in Grafana Cloud's Frontend Observability? In this step-by-step tutorial, we'll show you how to configure geolocation tracking, use MaxMind's offline database for geocoding, and apply filters for precise location-based insights.

Why network observability is a boardroom priority for CEOs

Finances, strategy, and market expansion are all common CEO concerns. However, CEOs also need to focus on automatic advanced observability across highly dynamic environments. Network observability has become a boardroom discussion point because downtime directly impacts business performance. Observability helps reduce costs and enhance service quality. But what is network observability? Is observability truly necessary if you have a monitoring solution in place?

Observability Costs: Tips for More Efficient Data Management

Can you ever get too much data? With modern architectures getting increasingly more complex with hundreds of microservices and containers, data volume grows at an exponential rate, and there’s no pause in sight. In this era of ever-expanding volume of telemetry, it’s nearly impossible to separate valuable data from noise, making things like root cause analysis or alerting needlessly more complicated, while putting pressure on the performance of your stack, your scalability and budget.

Executive Buy-In is Driving Observability Maturity: 2025 Observability Survey Results | Grafana Labs

In this video, CTO Tom Wilkie from Grafana Labs breaks down some of the most compelling findings from our third annual Observability Survey, based on over 1,200 industry responses. The big takeaway? Executive involvement is on the rise—and it’s accelerating adoption of advanced observability practices like distributed tracing, profiling, and SLOs. He also explores how SaaS adoption, the maturation of central observability teams, and new instrumentation methods like eBPF and Beyla are reshaping the observability landscape.

Observability and IT Monitoring for Federal, State, and Local Government | LogicMonitor

If you work in public sector IT—whether at the federal, state, or local level—you know how complex things have gotten. Keeping everything running smoothly is a daily challenge between aging infrastructure, hybrid cloud environments, and growing cybersecurity demands. LogicMonitor's hybrid observability platform powered by AI helps government IT teams simplify monitoring, reduce alert noise, and avoid issues with AI-powered insights. You’ll see how observability helps agencies.

Calico Whisker, Your New Ally in Network Observability

With the upcoming release of Calico v3.30 on the horizon, we are excited to introduce Calico Whisker, a simple yet powerful User Interface (UI) designed to enhance network observability and policy debugging. If you’ve ever struggled to make sense of network flow logs or troubleshoot policies in a complex Kubernetes cluster, Whisker is your friend!

Prometheus Monitoring in 5 Minutes: Set Up Your First Alert

Prometheus is an open-source toolkit for systems monitoring and alerting, designed to collect and store metrics as time-series data. It was initially created at SoundCloud, and has since become essential in the cloud-native ecosystem, benefiting from a powerful query language, dependable alerting functionality, and a pull-based architecture. Prometheus effectively monitors rapidly changing container environments, microservices, and cloud infrastructure. Its main benefits include.

Using eBPF for modern IT observability: challenges and opportunities

Modern IT demands modern observability that flows with its dynamism and all-encompassing approach. Modern observability must overcome the constraints suffered by traditional monitoring due to its custom-built agent-based architectures. Monitoring tools converge poll-based methods with log analysis and application performance monitoring (APM), a process that can be slow and lacking in granularity that today's complex environments demand.