Operations | Monitoring | ITSM | DevOps | Cloud

IaaS cost control: how private cloud reduces enterprise cloud spend

Over the past five years, one of the most consistently tracked figures in the UK business technology sector has been the flight from public cloud. Barclays' 2021 CIO survey revealed that 43% of enterprises plan to shift workloads away from public cloud. By 2024, that had grown to 83%. Research for Pulsant in 2025 found that 87% of UK businesses planned to repatriate data away from the public cloud within the next two years.

Unified observability for Alibaba Cloud with Datadog

Alibaba Cloud is a major cloud provider in APAC, offering industry-leading foundational AI models in addition to compute, managed databases, object storage, and Kubernetes through its Container Service for Kubernetes (ACK). Teams choose Alibaba Cloud for its infrastructure availability across Asia Pacific and its managed services. For SREs and platform engineers, that often means running Alibaba Cloud alongside AWS, Google Cloud, or Microsoft Azure.

Deploy Datadog Kubernetes Autoscaling at scale

Every Kubernetes environment accumulates waste over time. Teams overprovision CPU and memory requests to avoid performance risk, run idle replicas to preserve headroom, and leave Horizontal Pod Autoscalers (HPAs) untouched long after workload behavior has changed. Some of this waste can be addressed at the node level, where Datadog Cluster Autoscaling helps teams rightsize capacity.

Monitor Azure Managed Redis with Datadog

Azure Managed Redis is Microsoft’s fully managed, enterprise-tier in-memory data store. It is designed for the low-latency caching, session storage, and real-time data needs of modern applications, including AI workloads that depend on fast vector and embedding lookups. Because user-facing applications often query Redis directly, even small regressions in latency, hit rate, or memory pressure can degrade the user experience.

Monitor JavaScript framework routing with Datadog RUM

Modern web applications rely on frameworks like Next.js, Vue, and Angular to handle routing and rendering. In these architectures, navigation happens within the application rather than through full page loads, which makes it difficult for traditional browser instrumentation to capture what users actually experience. As a result, teams often see misleading view names, missing navigations, and errors that are either misattributed or not captured at all, especially during hydration or lazy loading.

Instrument LangGraph agents with Datadog: a practical guide

AI agents tend to function as black boxes, and it can be difficult to trace and understand agent workflows end-to-end in order to characterize performance. Particularly, you need visibility into the following: By tracing full agent runs with LLM Observability, Datadog AI Agent Monitoring enables you to visualize workflows with flame graphs and quickly spot sources of failures and latency.

Where to find lost engineering time in your delivery pipeline

If your infrastructure is configured outside version control through dashboards, scripts, or manual steps, environment drift is the expected outcome. Most teams have lived this scenario. A feature works in staging but breaks in production. Two hours later, someone finds a configuration setting that was changed in staging three weeks ago and never documented.

Root Cause Analysis: How Engineering Teams Fix Production Issues Faster?

When a production incident strikes, a sudden latency spike, a cascading API failure, a service returning 500s at scale, every minute of downtime has a cost. Root cause analysis (RCA) is the process that turns that chaos into a clear answer: what actually broke, and why. Not the symptom that triggered the alert. The underlying cause.

Top 9 Network Performance Metrics You Should Measure in 2026

How do you know if your network is actually healthy right now? For most IT teams, answering that question means jumping between multiple tools, dashboards, and alerts, only to end up with more uncertainty than clarity. The problem is not missing data. It is knowing which signals matter, what normal really looks like, and when performance issues start affecting users and business operations. Modern networks generate thousands of metrics every minute, but not every spike or alert deserves attention.