Operations | Monitoring | ITSM | DevOps | Cloud

Datadog

Monitor Ray applications and clusters with Datadog

Ray is an open source compute framework that simplifies the scaling of AI and Python workloads for on-premise and cloud clusters. Ray integrates with popular libraries, data stores, and tools within the machine learning (ML) ecosystem, including Scikit-learn, PyTorch, and TensorFlow. This gives developers the flexibility to scale complex AI applications without making changes to their existing workflows or AI stack.

Track service provider outages with IsDown and Datadog

When your apps and infrastructure rely on dozens of third-party providers for key functionality, it’s important to closely track their outages. If a service you rely on goes down, you need to move quickly to limit the outage’s impact on your users. IsDown provides a detailed status page aggregator and uptime monitoring for all your third-party dependencies.

Building an Internal Development Platform (IDP): A Journey of Innovation and Growth #shorts

As your organization grows, the increased number of engineers and services can put a strain on your infrastructure and ops teams. As Latin America’s largest online commerce and payments ecosystem, MercadoLibre needed to solve this scaling challenge. So we embarked on a mission to build an Internal Development Platform (IDP). We’ll highlight our transformative journey and how the IDP grew to manage over 26,000 microservices, while delivering a highly productive environment to MercadoLibre’s 12,000+ developers. In this session, you’ll learn about the challenges and solutions required to successfully build your own IDP.

Monitor your chaos engineering experiments with Steadybit's offering in the Datadog Marketplace

Steadybit is a software reliability platform that uses chaos engineering and fault injection to help organizations improve the stability and performance of their applications. By allowing customers to simulate turbulent scenarios in a controlled environment, Steadybit enables you to identify and mitigate potential system issues to reduce downtime and improve resilience.

FinOps and Cloud Cost Optimization #shorts #datadog #cloudservices

As companies scale, it’s become increasingly important to keep cloud cost management and optimization top of mind. In this talk, Yuval Yogev from Sygnia walks you through Sygnia’s optimization journey of cutting their total cloud costs in half. Yogev also shares insights into how you can optimize your own organization’s cloud usage and spend.

A deep dive into CPU requests and limits in Kubernetes

In a previous blog post, we explained how containers’ CPU and memory requests can affect how they are scheduled. We also introduced some of the effects CPU and memory limits can have on applications, assuming that CPU limits were enforced by the Completely Fair Scheduler (CFS) quota. In this post, we are going to dive a bit deeper into CPU and share some general recommendations for specifying CPU requests and limits.

CTO Fireside Chat #cto #asana #datadog #leadership #ml #ai #shorts

Building large scale technical systems is hard, but building and scaling high performing technical organizations is even more difficult. In this session, Datadog Co-founder and CTO Alexis Lê-Quôc will sit down with Prashant Pandey, Head of Engineering at Asana, to discuss their approach to engineering leadership. They’ll share the hard-learned lessons from their long careers to help you cultivate better technical teams, covering topics from staying in tune with new technologies, enabling innovation , shipping modern ML and AI-based features, and scaling teams.

Highlights from AWS re:Invent 2023

Whether or not you made the journey to this year’s re:Invent, there’s always a variety of great announcements lost amid an action-packed week of keynotes, breakouts, expo hall demos, and networking sessions. No need to worry—we’re always happy to be a big part of the re:Invent experience and share our observations with you.