Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on DevOps, CI/CD, Automation and related technologies.

Getting Started Guide with Netdata

New to Netdata? Start here. In this quick and practical guide, we’ll help you get set up and confident with Netdata in just a few minutes. You’ll learn how to: Access your Netdata Space Connect your nodes—servers, VMs, containers, network devices, and more Organize your infrastructure with Spaces and Rooms Collaborate with your team in real time Explore alerting and integrations Customize notifications so you’re only alerted when it truly matters.

Overcoming the common networking challenges when connecting the "big three" clouds

Adopting a multi-cloud strategy is common for many businesses as it promises flexibility in managing their data. However, this isn’t without complexity especially when it comes to networking. In a recent podcast episode, Trent Blakely and Jay Turner unpacked some of the most frequently missed challenges that come with multi-cloud deployments.

Reliability means being there right when your customer needs you

When your systems are reliable, it means your customers can count on your applications to be there for them. Full transcript:  To me reliability means a good night's sleep, and being able to confidently go to bed and wake up the next day feeling ready to get out there and do my best work and not worry about the experience that our customers might have had through the night.

The Rise of Tech Events in India: A New Era for Cloud-Native Computing

As India emerges as a significant player in the global public cloud landscape, with its public cloud services market projected to reach $25.5 billion by 2028 at a CAGR of 24.3% for 2023-28, the country is witnessing a surge in tech events. This growth is mirrored in the live events market, which is experiencing a 15% YoY growth, fostering a stronger community and facilitating the exchange of ideas and innovation in the public cloud sector.

FinOps For AI: How Crawl, Walk, Run Works For Managing AI Costs

“It started as an experiment.” That’s how it begins at most companies. A small team spins up a few GPU instances to train a proof-of-concept model. Maybe it’s a fraud detection algorithm. Maybe it’s GenAI for support tickets. Either way, it’s just a test. Then the results come in, and they’re promising. Suddenly, that model is powering new features. Teams are fine-tuning LLMs in parallel.

How to Build Resilient Networks for AI Production Workloads

Production AI needs a network that can keep up. Learn why private, scalable connectivity is the key in our webinar recap with Vultr. AI is no longer a proof-of-concept hiding in a developer lab. It’s a full-fledged production workload, and it’s hungry for data. But as enterprises move their AI strategies from theory to reality, they’re hitting a wall that isn’t about algorithms or processing power – it’s about the network.

Lost Your Work? This Git Trick Saves The Day!

Ever reset too far? Deleted a branch you needed? Thought you lost a commit forever? In this episode of Wait… Git Can Do That?, we explore git reflog — Git’s local time machine. You’ll learn how to: View every local Git action — even the messy ones Recover unreachable commits Navigate using HEAD@{n} Just remember: it’s local, it’s time-limited, and it’s seriously underrated. Subscribe for more Git features you didn’t know you needed.

Demo: Running a Patch Job with Puppet Advanced Patching

With Puppet, patching is faster and easier than ever. Watch this video to learn how to set up and run a patch job with Advanced Patching in Puppet Enterprise Advanced. Puppet's Barr Iserloth and Liam Sexton cover activating Advanced Patching, creating a patch group, and running a patch job from the Puppet Enterprise console. Highlights include the easy-to-use patching GUI, custom patch groups for cross-OS patching, streamlined scheduling that obeys your defined maintenance and blackout windows, and reporting that shows you where each patch was applied.

Automating High CPU Utilization Remediation with Resolve

High CPU utilization alerts can overwhelm IT teams and disrupt user productivity—especially in virtualized environments. In this video, see how Resolve automates the end-to-end remediation process for sustained CPU spikes. From detecting alerts and creating incidents to gathering host data, verifying VM configurations, and dynamically adding vCPUs—watch how Resolve eliminates manual effort and speeds up incident resolution.

Jaeger Metrics: Internal Operations and Service Performance Monitoring

You're monitoring a microservices-based system. Alerts trigger when response times exceed 2 seconds. But when you open Jaeger, you're faced with thousands of traces. Identifying which service or operation is responsible becomes time-consuming. Jaeger metrics help reduce this friction by exposing aggregated telemetry. Instead of scanning individual traces, you get service-level and operation-level performance metrics, latency, throughput, and error rates that highlight where the issue lies.