New York City, NY, USA
  |  By Addie Beach
Software development and delivery requires cross-team collaboration and cross-service orchestration—all while ensuring that organizational standards for quality, security, and compliance are consistently met. Without careful monitoring, you risk a lack of visibility into delivery workflows, making it difficult to evaluate how they impact release velocity and stability, developer experience, and application performance.
  |  By Bowen Chen
As your organization adopts modern technologies and scales its workloads, it’s critical that your CI/CD environment follows suit to maintain smooth development and testing workflows. Adopting modern CI/CD tools (e.g., pipeline runners and testing frameworks) and best practices can increase the agility and resilience of your CI/CD environment as well as enable your teams to configure new jobs, stages, and tests to meet changing business requirements.
  |  By Trammell Saltzgaber
Over 30,000 people flocked to Las Vegas to see the latest and greatest from Google Cloud and its partners at Google Cloud Next 2024. As a long-time Google Cloud partner and recipient of two Google Cloud Technology Partner of the Year awards this year, we were there in full force to showcase our unified observability and security solutions and engage with the Google Cloud community.
  |  By Felix Geisendörfer
We are excited to release our tooling for continuous profile-guided optimization (PGO) for Go. You can now reduce the CPU usage of your Go services by up to 14 percent by adding the following one line before the go build step in your CI pipeline: You will also need to supply a DD_API_KEY and a DD_APP_KEY in your environment. Please check our documentation for more details on setting this up securely.
  |  By Shah Ahmed
Modern, distributed application architectures pose particular challenges when it comes to coordinating incident management. DevOps, SREs, and security teams—often spread out across separate locations and time zones, and equipped with limited knowledge of each other’s services—must work quickly to collaboratively triage, troubleshoot, and mitigate customer impact.
  |  By Zara Boddula
Maintaining service availability is a challenge in today’s complex cloud environments. When a critical incident arises, the underlying cause can be buried in a sea of alerts from interconnected services and applications. Central operations teams often face an overload of disparate alerts, causing confusion, delayed incident response, alert fatigue, and redundant resolution efforts. These issues can negatively impact revenue and customer experience, especially during an outage.
  |  By Ivan Ilichev
Datadog’s Container Images view provides key insights into every container image used in your environment, helping you quickly detect and remediate security and performance problems that can affect multiple containers in your distributed system. In addition to having a snapshot of the performance of your container fleet, it’s also critical to understand large-scale trends in security posture and resource utilization over time.
  |  By Jordan Obey
Machine learning (ML) platforms such as Amazon Sagemaker, Azure Machine Learning, and Google Vertex AI are fully managed services that enable data scientists and engineers to easily build, train, and deploy ML models. Common use cases for ML platforms include natural language processing (NLP) models for text analysis and chatbots, personalized recommendation systems for e-commerce web applications and streaming services, and predictive business analytics.
  |  By Thomas Sobolik
Regardless of how much effort teams put into developing, training, and evaluating ML models before they deploy, their functionality inevitably degrades over time due to several factors. Unlike with conventional applications, even subtle trends in the production environment a model operates in can radically alter its behavior. This is especially true of more advanced models that use deep learning and other non-deterministic techniques.
  |  By Nicholas Thomson
Datadog’s infrastructure comprises hundreds of distributed services, which are constantly discovering other services to network with, exchanging data, streaming events, triggering actions, coordinating distributed transactions involving multiple services, and more. Implementing a networking solution for such a large, complex application comes with its own set of challenges, including scalability, load balancing, fault tolerance, compatibility, and latency.
  |  By Datadog
Life360 is a family safety app driven by the mission to protect and connect people, pets, and things. Naveen Puvvula, Director of Cloud Operations, and Jesse Gonzalez, Senior Staff Site Reliability Engineer, discuss why observability is critical to achieving reliability and how they continue to deliver real-time location updates for their users even during high-traffic events. Finally, they share their advice for other tech leaders in the industry to choose partners that align closely to solve problems together and technologies that reduce friction and improve developer joy.
  |  By Datadog
Accelerate time to market, reduce IT costs, improve uptime, and onboard new team members faster with real-time, full-stack observability and security.
  |  By Datadog
Learn how Datadog’s generative AI assistant, Bits AI, can help organizations accelerate incident investigations with auto-generated summarization to get you up to speed quickly, fetch information about past related events, update teams and statuses all through Slack.
  |  By Datadog
Datadog is constantly elevating the approach to cloud monitoring and security. This Month in Datadog updates you on our newest product features, announcements, resources, and events. To learn more about Datadog and start a free 14-day trial, visit Cloud Monitoring as a Service | Datadog. This month, we put the Spotlight on Bits AI for Incident Management.
  |  By Datadog
Monitoring backend signals has been standard practice for years, and tech companies have been alerting their SRE and software engineers when API endpoints are failing. But when you’re alerted about a backend issue, it’s often your end users who are directly affected. Shouldn’t we observe and alert on this user experience issues early on? As frontend monitoring is a newer practice, companies often struggle to identify signals that can help them pinpoint user frustrations or performance problems.
  |  By Datadog
In 2018 Datadog released Watchdog to proactively detect anomalies on your observability data. But what defines an anomaly? How do you avoid false positives? At Datadog Summit London 2024, Nils Bunge, product manager at Datadog, shared the story of the creation of the first Datadog AI feature (Watchdog Alert), what we learned from it and how we applied those lessons to all the added AI functionalities across the years.
  |  By Datadog
On This Month in Datadog, we’re covering Datadog Security for Google Cloud, our integration with NVIDIA Triton Inference Server, and Sankey visualizations, which offer overviews of common paths users take across your app.
  |  By Datadog
There are many different ways to implement Site Reliability Engineering (SRE). From team structures to roles and responsibilities to planning and prioritization flows, there’s no golden path for how to organize things. As Datadog has shifted from a startup to a quickly-growing public company, we’ve seen our own SRE practice evolve. With over 22,000 customers sending trillions of data points each day, keeping Datadog reliable is critical to our business.
  |  By Datadog
In this episode we'll visit the world of predictive analytics and machine learning and uncover how these cutting-edge technologies are transforming the way Datadog monitors and improves its services. We’ll focus our conversation on two key aspects: using advanced statistical methods for proactive monitoring and the strategic implementation of machine learning for algorithm enhancement.
  |  By Datadog
As Docker adoption continues to rise, many organizations have turned to orchestration platforms like ECS and Kubernetes to manage large numbers of ephemeral containers. Thousands of companies use Datadog to monitor millions of containers, which enables us to identify trends in real-world orchestration usage. We're excited to share 8 key findings of our research.
  |  By Datadog
The elasticity and nearly infinite scalability of the cloud have transformed IT infrastructure. Modern infrastructure is now made up of constantly changing, often short-lived VMs or containers. This has elevated the need for new methods and new tools for monitoring. In this eBook, we outline an effective framework for monitoring modern infrastructure and applications, however large or dynamic they may be.
  |  By Datadog
Where does Docker adoption currently stand and how has it changed? With thousands of companies using Datadog to track their infrastructure, we can see software trends emerging in real time. We're excited to share what we can see about true Docker adoption.
  |  By Datadog
Build an effective framework for monitoring AWS infrastructure and applications, however large or dynamic they may be. The elasticity and nearly infinite scalability of the AWS cloud have transformed IT infrastructure. Modern infrastructure is now made up of constantly changing, often short-lived components. This has elevated the need for new methods and new tools for monitoring.
  |  By Datadog
Like a car, Elasticsearch was designed to allow you to get up and running quickly, without having to understand all of its inner workings. However, it's only a matter of time before you run into engine trouble here or there. This guide explains how to address five common Elasticsearch challenges.
  |  By Datadog
Monitoring Kubernetes requires you to rethink your monitoring strategies, especially if you are used to monitoring traditional hosts such as VMs or physical machines. This guide prepares you to effectively approach Kubernetes monitoring in light of its significant operational differences.

Datadog is the essential monitoring platform for cloud applications. We bring together data from servers, containers, databases, and third-party services to make your stack entirely observable. These capabilities help DevOps teams avoid downtime, resolve performance issues, and ensure customers are getting the best user experience.

See it all in one place:

  • See across systems, apps, and services: With turn-key integrations, Datadog seamlessly aggregates metrics and events across the full devops stack.
  • Get full visibility into modern applications: Monitor, troubleshoot, and optimize application performance.
  • Analyze and explore log data in context: Quickly search, filter, and analyze your logs for troubleshooting and open-ended exploration of your data.
  • Build real-time interactive dashboards: More than summary dashboards, Datadog offers all high-resolution metrics and events for manipulation and graphing.
  • Get alerted on critical issues: Datadog notifies you of performance problems, whether they affect a single host or a massive cluster.

Modern monitoring & analytics. See inside any stack, any app, at any scale, anywhere.