Operations | Monitoring | ITSM | DevOps | Cloud

Customers over control: how we measure On-call reliability

Our On-call product has a lot of great features: configuring escalation paths, viewing rotas and schedules, requesting cover, etc. However, when framing its reliability, we reduce it down to two critical pieces of functionality: It’s not that we’re happy if only these parts are working, but they are the most important parts. In this post, I'll go into more detail on how we think about their reliability.

Your Path to Autonomous OT Communication Networks: From Reactive Operations to Self Optimising OT Networks

Power networks (DSOs, TSOs and generation) are under pressure from every direction. They need to improve reliability and sustainability, deliver real-time customer insight, and meet increasingly stringent regulations. In response, power generation has evolved from a simple centralized model, through to a decentralized model with generation from a mix of diverse sources such as centralized generation from carbon-based, nuclear and renewable generation plants, through DERs even located at people premises.

Policy as Code Beyond the Pipeline: What Actually Breaks, Drifts, and Gets Audited

Most teams first adopt policy as code (PaC) in their delivery pipelines. If something breaks a rule, the system stops it before it goes live. That is useful as it helps catch problems early but in real world environments, the hardest issues to resolve do not come from changes that fail validation. They come from changes that happen later, elsewhere, or outside the pipeline entirely.

IaaS cost control: how private cloud reduces enterprise cloud spend

Over the past five years, one of the most consistently tracked figures in the UK business technology sector has been the flight from public cloud. Barclays' 2021 CIO survey revealed that 43% of enterprises plan to shift workloads away from public cloud. By 2024, that had grown to 83%. Research for Pulsant in 2025 found that 87% of UK businesses planned to repatriate data away from the public cloud within the next two years.

Unified observability for Alibaba Cloud with Datadog

Alibaba Cloud is a major cloud provider in APAC, offering industry-leading foundational AI models in addition to compute, managed databases, object storage, and Kubernetes through its Container Service for Kubernetes (ACK). Teams choose Alibaba Cloud for its infrastructure availability across Asia Pacific and its managed services. For SREs and platform engineers, that often means running Alibaba Cloud alongside AWS, Google Cloud, or Microsoft Azure.

Deploy Datadog Kubernetes Autoscaling at scale

Every Kubernetes environment accumulates waste over time. Teams overprovision CPU and memory requests to avoid performance risk, run idle replicas to preserve headroom, and leave Horizontal Pod Autoscalers (HPAs) untouched long after workload behavior has changed. Some of this waste can be addressed at the node level, where Datadog Cluster Autoscaling helps teams rightsize capacity.

Monitor Azure Managed Redis with Datadog

Azure Managed Redis is Microsoft’s fully managed, enterprise-tier in-memory data store. It is designed for the low-latency caching, session storage, and real-time data needs of modern applications, including AI workloads that depend on fast vector and embedding lookups. Because user-facing applications often query Redis directly, even small regressions in latency, hit rate, or memory pressure can degrade the user experience.

Monitor JavaScript framework routing with Datadog RUM

Modern web applications rely on frameworks like Next.js, Vue, and Angular to handle routing and rendering. In these architectures, navigation happens within the application rather than through full page loads, which makes it difficult for traditional browser instrumentation to capture what users actually experience. As a result, teams often see misleading view names, missing navigations, and errors that are either misattributed or not captured at all, especially during hydration or lazy loading.

Instrument LangGraph agents with Datadog: a practical guide

AI agents tend to function as black boxes, and it can be difficult to trace and understand agent workflows end-to-end in order to characterize performance. Particularly, you need visibility into the following: By tracing full agent runs with LLM Observability, Datadog AI Agent Monitoring enables you to visualize workflows with flame graphs and quickly spot sources of failures and latency.

Where to find lost engineering time in your delivery pipeline

If your infrastructure is configured outside version control through dashboards, scripts, or manual steps, environment drift is the expected outcome. Most teams have lived this scenario. A feature works in staging but breaks in production. Two hours later, someone finds a configuration setting that was changed in staging three weeks ago and never documented.