Operations | Monitoring | ITSM | DevOps | Cloud

Amazon Isn't Eating Its Own DNS Dog Food

On October 19-20, 2025, Amazon Web Services (AWS) experienced a significant outage (AWS status) affecting its US-EAST-1 region in northern Virginia. The root cause was DNS resolution failures for DynamoDB’s API endpoints, which cascaded across AWS’s interconnected services, disrupting major platforms including Snapchat, McDonald’s, Disney+, Roblox, Coinbas, Reddit, and Amazon’s own services.

Build Vs. Buy? Why Creating Your Own Cost Management Platform Is Futile

The siren song of building a custom, internal cloud cost management platform is enticing. Many brilliant engineering teams are convinced they can come up with a bespoke solution that perfectly fits their needs. They look at their company’s unique infrastructure and decide they can DIY cost management without having to rely on an external vendor. Believe me, I get the temptation.

4 Everyday IT Headaches You Can Eliminate with Enterprise IT Automation

Every IT operator anywhere on the team ladder dreads this feeling: another day, another flood of service desk tickets. Like cockroaches, they come in waves and they’re repetitive. Worse still, they distract your teams from higher-value work. Ironically for the amount of disruption they can cause, most of these tickets are not complex incidents or novel challenges. They’re the same everyday IT headaches your enterprise has been dealing with for years.

The Hidden Risk of DNS - Lessons from the AWS Outage & Why You Need DNS Spy Monitoring NOW

On October 20, 2025, much of the internet came to a halt. Apps wouldn’t load. Payments failed. Cloud dashboards went dark. From Fortnite to Alexa, Snapchat, and countless business platforms, users across the world were suddenly offline — all because DNS broke inside Amazon Web Services’ (AWS) US-East-1 region.

Building Intelligent Search: A Tutorial on Aiven for OpenSearch and Vertex AI

Aiven for OpenSearch is a fully-managed service that provides an ideal way to run OpenSearch on Google Cloud. It is designed for companies looking to operate search applications without taking on the burden and complexity of self-managing the infrastructure in the cloud. Running on Google Cloud, the service is built upon core infrastructure like Google Compute Engine, Google Cloud Storage, and Private Service Connect.

Detect and map third-party outages with Datadog External Provider Status

Modern applications depend on dozens of external cloud platforms, APIs, and SaaS services to function. But when those providers experience issues, engineers often spend valuable time asking a basic question: Is the problem with us or with them? Provider-maintained status pages are often slow to update, leaving teams waiting for confirmation while incidents escalate. This delay wastes valuable time, prolongs investigations, and risks customer trust.

Optimize HPC jobs and cluster utilization with Datadog

High-performance computing (HPC) environments support some of the most critical workloads in the world—from asset pricing models in financial institutions to molecular simulations in drug discovery. These workloads often span hundreds of thousands of cores, depend on specialized infrastructure such as GPUs, and run for extended periods. As a result, performance and efficiency are critical.

Introducing Updog.ai: Real-time provider status from Datadog

When external SaaS providers or cloud services degrade or go down, engineers often find themselves wondering if the issue they're encountering is local or more widespread. The answers they find are usually slow to surface, limited in detail, or entirely dependent on the provider's updates. Vendor-controlled status pages and third-party aggregators don’t provide the timely, independent visibility that's necessary to quickly and accurately identify the root cause of slowdowns.