Operations | Monitoring | ITSM | DevOps | Cloud

Datadog Feature Flags, track Claude costs, migrate historical logs, and more | This Month in Datadog

See how you can reduce risk during feature rollouts in September’s This Month in Datadog. This episode, we spotlight Datadog Feature Flags, which combines advanced targeting with built-in observability, and guardrails to make rollouts safer and more controlled. Plus, we cover: This Month in Datadog brings you the latest updates on our newest product features, announcements, resources, and events.

From Logs to Insights: Accelerate Customer-Impact Analysis with Datadog Sheets

Datadog Sheets helps you move from log exploration to actionable insights quickly and with no code required. In this demo, see how to enrich logs with Salesforce data, build pivot tables, uncover customer impact trends, and build shareable reporting, all within Datadog.

Top 11 Java APM Tools: A Comprehensive Comparison

Are your Java applications running at their optimal performance, or is there room for improvement to make them faster and more efficient? With so many services depending on Java, keeping applications responsive and reliable is a core part of modern software engineering. This blog walks you through the leading Java Application Performance Monitoring (APM) tools, with a clear comparison to help you choose the right option for your needs.

OpenMetrics vs OpenTelemetry - A guide on understanding these two specifications

OpenMetrics and OpenTelemetry are popular standards for instrumenting cloud-native applications. Both projects are part of the Cloud Native Computing Foundation (CNCF) and aim to simplify how we generate, collect and monitor services in a modern cloud-native distributed application environment. Let's have a look at how both the standards are aiming to help solve the observability conundrum.

LLM Observability in the Wild - Why OpenTelemetry should be the Standard

A few days ago I hosted a live conversation with Pranav, co-founder of Chatwoot, about issues his team was running into with LLM observability. The short version: building, debugging, and improving AI agents in production gets messy fast. There's multiple competing standards for default libraries for LLM observability. And many such libraries like OpenInference which claim to be based on OpenTelemetry don't strictly adhere to it's conventions.

An overview of Context Propagation in OpenTelemetry

To effectively manage modern applications, you need to understand how they work on the inside. Distributed tracing is the key to this, providing a detailed picture of a request's journey across every service. OpenTelemetry has emerged as the industry-standard framework for implementing tracing and achieving true observability in complex, distributed systems. In this article, we embark on a journey to explore the core concept of context propagation within Open Telemetry.

OpenTelemetry and Jaeger | Key Features & Differences [2025]

OpenTelemetry is a broader, vendor-neutral framework for generating and collecting telemetry data (logs, metrics, traces), offering flexible backend integration. Jaeger, on the other hand, is focused on distributed tracing in microservices. Earlier Jaeger had its own SDKs based on OpenTracing APIs for instrumenting applications, but now Jaeger recommends using OpenTelemetry instrumentation and SDKs. Warning The original Jaeger client SDKs (based on OpenTracing) are archived and no longer maintained.

Key APM Metrics You Must Track

Application Performance Monitoring (APM) helps you understand how your software runs in production. When you track the right metrics, you see how requests move through your system, where slowdowns happen, and how resources are being used. With this knowledge, you can spot issues early and keep your applications reliable for your users. In this blog, we discuss the key APM metrics to monitor, grouped into categories, and why each one matters for performance and user experience.

New Relic's CCU-based pricing is creating unpredictable costs, pushing teams to sample heavily

We talked to 7 companies in August 2025 who were looking to switch from New Relic. One engineering director said they're paying $1,000 a month and only ingesting 10% of their traces. Teams are defaulting to aggressive sampling, some at 1%, others at 10%, to manage costs.

OpenTelemetry Exporters - Types and Configuration Steps

In this post, we will talk about OpenTelemetry exporters. OpenTelemetry exporters help in exporting the telemetry data collected by OpenTelemetry. OpenTelemetry frees you from any kind of vendor lock-in by letting you export the collected telemetry data to any backend of your choice. In modern distributed systems, efficiently collecting, transmitting, and analyzing telemetry data from diverse sources poses a significant challenge.

How to Connect Jaeger with Your APM

Microservices make it tough to understand how applications behave end-to-end. Most teams already rely on an Application Performance Monitoring (APM) tool to track system health. But as requests move across many services, you also need distributed tracing. Jaeger gives you that visibility. The real value comes from connecting the two. Instead of running APM and Jaeger in silos, you can combine their strengths, metrics from your APM, and traces from Jaeger, to get a clearer view of performance.

Reddit to Reality: Top 7 Omnissa Horizon Performance Issues and Fixes

Slow logons, laggy VDI sessions, and poor Horizon performance are common pain points IT admins face frequently. When Omnissa Horizon environments slow down, both end-users and IT teams feel the pressure—users grow frustrated while admins struggle to troubleshoot without complete visibility. To uncover the real-world Horizon issues admins face, we turned to Reddit forums like r/VMwareHorizon, r/sysadmin, and r/Citrix, where IT professionals openly share their Horizon troubleshooting struggles.

OpenTelemetry Logs - A Complete Introduction & Implementation

OpenTelemetry is a Cloud Native Computing Foundation(CNCF) incubating project aimed at standardizing the way we instrument applications for generating telemetry data(logs, metrics, and traces). OpenTelemetry aims to provide a vendor-agnostic observability framework that provides a set of tools, APIs, and SDKs to instrument applications.

LLM app Observability: Opentelemetry as a standard

LLM observability is broken There are too many new libraries floating around, but they don't follow accurately the OpenTelemetry conventions. OTel isn’t perfect for LLMs yet—but extending a proven standard beats inventing another one. Why not use the same standard (OTel) which works so well for rest of the apps, and just work on top of it? This is what I was ranting with Pranav Raj S, co-founder at Chatwoot and we thought there must be other folks facing similar issues.

Automated BSoD (Blue Screen of Death) Monitoring and Troubleshooting

Yes, BSoDs are still cropping up in high-impact ways in 2025, from flawed Windows updates (especially 24H2 patches) to driver rollouts and heavily-threaded server environments. It remains essential for IT admins to track event reports, test updates in staging, enable rollback strategies, and be prepared with recovery mechanisms.

Datadog in the Era of AI

AI is changing everything. At Datadog, our approach is two-fold: empower you with complete observability across your entire stack, including AI as you incorporate it, and harness emergent technologies to make Datadog even more powerful. Join VP of Product Michael Whetten to see how Datadog is accomplishing these two approaches. He'll share the latest feature updates and new products designed to help you thrive in an AI-powered world. Plus, get a look at our long-term vision for the future of AI and its impact on your work.

OpenTelemetry Operator Complete Guide [OTel Collector + Auto-Instrumentation Demo]

Manually deploying and managing OpenTelemetry components in a Kubernetes environment can be a complex and time-consuming task. It involves creating various Kubernetes resources, setting up configurations, and ensuring the components are properly integrated with the applications.

The Personalization Paradox: When Tailored UX Turns "Creepy"

“Stop watching me.” That’s an actual message a user typed into a search bar, captured during session monitoring. They weren’t talking to customer support. They were talking to the algorithm. It sounds absurd until you realize how common this is. When users believe a human is behind your personalization system, attributing consciousness to your automated algorithms, everything changes. Their behavior becomes erratic. Your conversions tank. And nobody talks about it.

Introducing Cost Meter - Proactive Observability Cost Control with Per-Hour Granularity

The irony isn't lost on us - observability platforms are built to be proactive about system health, yet when it comes to managing observability costs themselves, teams are forced to be reactive. Today, that changes with Cost Meter, now live in our platform. Cost Meter transforms observability spend management from a monthly billing surprise into a proactive, data-driven process with hourly aggregated metrics that give you complete visibility into your telemetry ingestion patterns.

Understanding OpenTelemetry Spans in Detail

Debugging errors in distributed systems can be a challenging task, as it involves tracing the flow of operations across numerous microservices. This complexity often leads to difficulties in pinpointing the root cause of performance issues or errors. OpenTelemetry provides instrumentation libraries in most programming languages for tracing.

APM vs Observability: Observing beyond APM

In my previous post I made a bold, sweeping statement that APM is not - in the most specific sense - a subset of observability. Still standing by it I stand by that because words matter and - like many "monitoring engineers" (IT folks who make monitoring and observability their specialty) - I, too, bear scars from the flame-wars on Twitter back in the 2020's where we fought internecine battles over the proper definition of (and number of pillars in) “observability”.

Detect Email Delays Before They Hit Users - Monitor O365 with eG Enterprise

Email downtime or email delays can significantly disrupt business operations, making proactive monitoring essential to avoid problems. In today’s hybrid work environments, email remains a critical communication channel for customer interactions, internal collaboration, and workflow approvals. Even brief outages or delays in email delivery can lead to missed opportunities, poor customer experience, SLA (Service Level Agreement) breaches and reputational damage.

Breaking Free from SQLite - Why We Added PostgreSQL Support to SigNoz

"Let us support different relational databases apart from SQLite. Nobody likes to run SQLite in production." This was one of the most requested features from our community. Your requests have been heard, and we've added support for different relational databases, starting with PostgreSQL. If you're self-hosting SigNoz, you no longer need to worry about SQLite's limitations. Let's dive into what we've built and why it matters for your production deployments.

The Real ROI of Using an APM Tool for SaaS Businesses

For every SaaS leader, engineer, and operations professional, growth is always the main goal. You’re expected to release features quickly, keep user experiences smooth, and manage everything within a limited budget. But behind the scenes, your application may have hidden issues such as slow performance, unnoticed errors, and laggy transactions that quietly eat away at revenue, reduce customer trust, and exhaust your engineering team.

Query Builder v5 - Two Years of Technical Debt, 80 Closed Issues, and a Fundamental Rethinking

In 2022, we had three different query interfaces. Logs had a custom search syntax with no autocomplete. Traces only had predefined filters - no query builder at all. Metrics had a raw PromQL input box where you'd paste queries from somewhere else and hope they worked. Each system spoke a different language. An engineer debugging a production issue had to context-switch not just between data types, but between entirely different mental models of how to query data.

Monitor Cloud-Native & Hybrid Apps and Business Transactions With Observability Cloud APM

As organizations modernize, most applications don’t fit neatly into one category—they span both traditional three-tier architectures and cloud-native microservices. To monitor these hybrid environments effectively, teams need APM tools that can seamlessly connect the two worlds.

Interactive Dashboards - Click Any Panel to Start Debugging

Your dashboard shows a latency spike. To investigate it, you copy the query, open logs in a new tab, paste and modify the query, lose your dashboard filters, and repeat for traces. By the time you find the issue, you have 15 tabs open. Starting today, you can click any panel and investigate right there. All your filters and variables carry over. No more tab juggling.

Why it's time to move beyond APM: Monitoring from the user's perspective

For years, organizations have relied on Application Performance Monitoring (APM) as the backbone of their observability strategy. The idea was simple: collect as many logs, metrics, and traces as possible, then sift through the data to uncover insights. But as applications have shifted to the cloud and become increasingly API-driven, that model has broken down.

Interactive Dashboards | SigNoz Launch Week 5.0 | Day 1

Interactive Dashboards eliminate the current workflow of opening new tabs and manually recreating queries every time you need to investigate a spike or anomaly. Click directly on any data point to drill down and explore. ​What you can do: ​Built for developers who need to debug production issues efficiently, not juggle with multiple tabs.

Monitoring Claude Code Usage with OpenTelemetry and SigNoz

In this video, we’ll walk you through how to monitor Claude code activity using OpenTelemetry and SigNoz. You’ll learn how to instrument your usage, capture telemetry data, and visualize it with SigNoz to get better insights into your system performance. Whether you’re exploring observability for AI workloads or looking for an open-source solution to monitor your llm activity, this guide will help you get started.

Full Session Simulation - Simulate Anything, Everything, Anywhere

Full Session Simulation is a powerful troubleshooting strategy. Have you ever been in a situation where everything on your dashboards looks green, but users are still encountering issues and raising support tickets?The cliche of “everything is fine on our side” moment is not just frustrating for everyone. It’s risky! Because when you can’t replicate what the user is experiencing, you’re flying blind.

What is APM Tracing?

APM tracing records the complete execution path of a request as it travels through your system, including database queries, external API calls, cache lookups, message queue events, and inter-service requests. Each step is captured with precise start and end timestamps, duration, and context such as service name, operation name, and relevant attributes. This lets you pinpoint where latency or errors originate without piecing together metrics and logs manually.

Cost Controls and so Much More: Issue Detection Through Usage Analysis

Keeping tabs on cloud spending across multiple organizations and vendors, including Datadog, can be tough and costly. If you're not tracking expenses, you're also missing other critical insights. The Flight Centre Travel Group (FCTG) faced this when moving to Datadog, needing to monitor costs across numerous organizations and over 180 Azure subscriptions. After a rapid migration, new cost reports quickly revealed more than just financial benefits. Unusual spending patterns often highlighted incidents, bugs, or security issues, offering early warnings about internal system problems.

Bridging the Gap: Legacy Systems and Modern Observability

Technology moves quickly and while the spotlight has shifted to dynamic, cloud-based systems, many organizations have legacy applications and infrastructure that they must maintain. In this fireside chat, Datadog’s Matt Moore (Principal Observability Strategist) will host James Flores (Enterprise Systems Engineer) at Australian Community Media to discuss their journey of modernization and bridging legacy systems with the cloud using a bit of ingenuity and observability.

Bringing Observability to Claude Code: OpenTelemetry in Action

AI coding assistants like Claude Code are becoming core parts of modern development workflows. But as with any powerful tool, the question quickly arises: how do we measure and monitor its usage? Without proper visibility, it’s hard to understand adoption, performance, and the real value Claude brings to engineering teams. For leaders and platform engineers, that lack of observability can mean flying blind when it comes to understanding ROI, productivity gains, or system reliability.

Azure Data Factory Monitoring Integration

Microsoft Azure Data Factory is a cloud-based data integration service provided by Microsoft Azure. It enables you to create, manage, and automate data workflows that move and transform data from different sources to various destinations. Essentially, ADF allows you to design, orchestrate, and manage data pipelines, making it easier to work with large volumes of data across on-premises and cloud environments.

kubectl logs: How to View & Tail Kubernetes Pod Logs

When debugging containerized applications in Kubernetes, kubectl logs serves as your primary command-line tool for accessing container logs directly. Understanding how to effectively retrieve, filter, and analyze logs becomes essential for maintaining application health and resolving issues quickly, especially in multi-container environments where correlation across services can make or break your troubleshooting efforts.

How to Reduce Errors and Improve Reliability in High-Traffic Node.js Applications with APM?

Node.js has become the go-to runtime for building modern, high-performance applications. Its event-driven, non-blocking I/O model makes it particularly well-suited for apps that demand speed and scalability, such as real-time chats, gaming backends, streaming platforms, fintech dashboards, and e-commerce systems. It’s no surprise that some of the world’s largest companies like Netflix, PayPal, LinkedIn, Walmart rely on Node.js to deliver services at scale.