Operations | Monitoring | ITSM | DevOps | Cloud

Key metrics for monitoring Karpenter

In Part 1 of this series, we explored how Karpenter’s architecture enables just-in-time provisioning and active node consolidation. Because Karpenter is constantly making infrastructure decisions based on real-time scheduling pressure, its metrics can give you early warning of provisioning slowdowns, cloud API throttling, and misconfigurations that prevent it from scaling the way you expect.

Tools for collecting metrics and logs from Karpenter

In the first two parts of this series, we explored how Karpenter’s architecture enables just-in-time provisioning and active node consolidation, and we identified the key Karpenter metrics you should track to keep your cluster performant and cost-efficient. In this post, we’ll look at vendor-agnostic tools you can use to capture these signals.

Monitor Karpenter with Datadog

In this series, we’ve explored Karpenter’s architecture, the key metrics that reflect its health and performance, and the vendor-agnostic tools for collecting and analyzing its telemetry data. In this final post, we’ll show you how Datadog helps you monitor and alert on Karpenter alongside your Kubernetes cluster and the infrastructure that runs it.

What your product data is actually saying

As tools such as AI agents become more integrated with the instrumentation, governance, and centralization of product analytics data, product managers (PMs) still own the meaning of those events and the connected outcomes. Knowing when to trust the data, forming strong hypotheses, and being able to act on the insights requires an expert in the loop.

Approaching your observability migration with the right mindset

This guest blog post is authored by Nick Vecellio, Principal Engineer and Co-founder of NoBS, a Premier Datadog Partner specializing in hands-on Datadog migrations and optimizations. At NoBS, we help enterprises migrate their observability stack to Datadog. Teams often come to us after a migration has technically “worked,” but the new setup requires optimization tweaks to provide the clarity, reliability, or operational benefits they’re looking for.

Four ways engineering teams use the Datadog MCP Server to power AI agents

Since the Datadog Model Context Protocol (MCP) Server first launched in Preview, Datadog has experienced an overwhelming amount of interest and feedback from customers. We appreciate those who requested access to test our product, provided feedback, and shared their stories of how the MCP Server helped them overcome engineering challenges.

Meet the new Bits AI SRE: Deeper reasoning, twice as fast

When we announced Bits AI SRE at DASH 2025, we introduced an autonomous SRE agent that investigates alerts the moment they trigger. Bits AI SRE reads the same telemetry data as your team, understands your architecture, and follows your runbooks to identify likely root causes before you even open your laptop. It’s your AI teammate that’s always on call.

Use plain English to query your multi-cloud infrastructure in Resource Catalog

Modern cloud environments include thousands of resources across providers, teams, and accounts. Organizations need the ability to quickly locate the right resources so that they can manage resource compliance and troubleshoot issues. When engineers need to answer questions such as which databases are still on extended support or which storage buckets lack encryption, they often have to switch consoles, use provider-specific query languages, and know obscure version strings or configuration flags.

Simplifying troubleshooting across the user journey with Datadog Synthetic Monitoring

Every digital experience is a chain reaction. A customer logs in to an application, an API authenticates the request, a backend call retrieves data, a page loads, and somewhere along the way, something might break. When it does, teams often chase symptoms while the root cause remains hard to find. The more distributed the system, the more difficult it becomes to see how one small failure can cascade into a visible outage.