Operations | Monitoring | ITSM | DevOps | Cloud

Accelerate root cause analysis with Watchdog and Faulty Kubernetes Deployment

Understanding and managing the impact of Kubernetes changes is one of the biggest challenges for modern DevOps teams. Every modification to a manifest, whether it’s adjusting memory limits, tweaking CPU allocations, or updating container images, has the potential to destabilize services or degrade performance.

Monitor Cloud Run with Datadog

In part 1 of this series, we introduced the key Cloud Run metrics you should be monitoring to ensure that your serverless containerized applications are reliable and can maintain optimal performance. In part 2, we walked through a couple of Google Cloud’s built-in monitoring tools that you can use to view those key metrics and check on the health, status, and performance of your serverless containers.

How to collect Google Cloud Run metrics

In Part 1 of this series, we looked at key Cloud Run metrics you can monitor to ensure the reliability and performance of your serverless containerized workloads. We’ll now explore how you can access those metrics within Cloud Run and Google’s dedicated observability tool, Cloud Monitoring. We’ll also look at several ways you can view and explore logs and traces in the Cloud Run UI and Google Cloud CLI.

Key metrics for monitoring Google Cloud Run

Google Cloud Run is a fully managed platform that enables you to deploy and scale container-based serverless workloads. Cloud Run is built on top of Knative, an open source platform that extends Kubernetes with serverless capabilities like dynamic auto-scaling, routing, and event-driven functions. By using Cloud Run, developers can simply write and package their code as container images and deploy to Cloud Run—all without worrying about managing or maintaining any underlying infrastructure.

Investigate memory leaks and OOMs with Datadog's guided workflow

Containerized application crashes due to exceeding memory limits are often tricky to investigate as they can be caused by different underlying issues. A program might not be freeing memory properly, or it might just not be configured with appropriate memory limits. Investigation methods also differ based on the language and runtime your program uses.

Unlock advanced query functionality with distribution metrics

As organizations break down monolithic applications in favor of a more distributed, microservices-based architecture, they need to collect increasing amounts of metric data. But how do you summarize this data to provide insights at scale? Averages are simple to calculate but can be misleading, especially for increasingly complex and distributed environments that contain outlier values that skew the average.

Datadog acquires Quickwit

Organizations in financial services, insurance, healthcare, and other regulated industries must meet stringent data residency, privacy, and regulatory requirements while maintaining full visibility into their systems. This becomes challenging when logs need to remain at rest in customers’ environments or specific regions, hindering teams’ ability to attain seamless observability and insight.

Kickstart your investigations and reduce alert noise with Doctor Droid's offering in the Datadog Marketplace

Being an on-call engineer is often overwhelming, requiring you to pivot between tickets, dashboards, runbooks, and different data sources as you try to separate legitimate incidents from unnecessary noise. Not only does the process of investigating irrelevant alerts take time away from remediating important issues, but it also compounds alert fatigue.

How to monitor Snowflake performance and data quality with Datadog

In Part 2 of this series, we looked at Snowflake’s built-in monitoring services for compute, query, and storage. In this post, we’ll demonstrate how Datadog complements and extends Snowflake’s existing monitoring and data visualization capabilities, enabling teams to get deeper visibility and extract more valuable insights from their Snowflake data.