Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on DevOps, CI/CD, Automation and related technologies.

Librato on Heroku is Going Away and Hosted Graphite Is the Better Next Step

Librato (a SolarWinds product) is being sunsetted summer of 2025, and that directly affects Heroku teams who’ve relied on the Librato add-on for “good enough” visibility into dynos, routers, and Postgres. If you’re in that group, you’ll need a replacement monitoring add-on that keeps you covered on Heroku and lets you grow beyond it without re-architecting how you ship metrics.

The strategic art of build vs. buy in software delivery ft. Tara Hernandez of MongoDB

Rob Zuber sits down with Tara Hernandez, VP of Developer Productivity at MongoDB and former Netscape engineer who helped create early continuous integration systems, to explore strategic frameworks for build vs. buy decisions in modern software delivery.

Jaeger Monitoring: Essential Metrics and Alerting for Production Tracing Systems

Your Jaeger setup is running. Traces are coming in, and the UI is helping you spot slow services or debug broken flows. But just like any part of your observability stack, Jaeger needs some basic monitoring to stay reliable. If the collector starts queueing spans or the agent runs out of buffer, it can lead to dropped traces, sometimes without any obvious sign in the UI. This blog focuses on the operational side of Jaeger.
Sponsored Post

When AI Becomes the Judge: Understanding "LLM-as-a-Judge"

Imagine building a chatbot or code generator that not only writes answers - but also grades them. In the past, ensuring AI quality meant recruiting human reviewers or using simple metrics (BLEU, ROUGE) that miss nuance. Today, we can leverage Generative AI itself to evaluate its own work. LLM-as-a-Judge means using one Large Language Model (LLM) - like GPT-4.1 or Claude 4 Sonnet/Opus - to assess the outputs of another. Instead of a human grader, we prompt an LLM to ask questions like "Is this answer correct?" or "Is it on-topic?" and return a score or label. This approach is automated, fast, and surprisingly effective.

Autoscaling Made Easy with Rancher Cluster API

Kubernetes has revolutionized application deployment and management. However, manually adjusting cluster sizes to meet fluctuating workloads, without constantly under- or over-provisioning resources, quickly drains platform teams’ time and energy. While traditional cloud provider autoscaling tools are functional, they often fall short when it comes to truly dynamic, Kubernetes-aware scaling, especially in a world with diverse infrastructure.

Is on-prem the top choice to run AI?

‎‎Subscribe. Fuel your curiosity. In this episode, we break down what we’ve learned from teams running AI at scale, and why on-premises infrastructure is making a strong comeback. We’re seeing a shift: performance, cost control, data sovereignty, and platform flexibility are driving conversations about on-prem strategies for AI. No one-size-fits-all answers, but if you’re building or scaling AI, this might help you think a few steps ahead.

Are you running AI the smart way?

Data locality: AI models often rely on large datasets. Locating compute close to the data reduces transfer times and improves training performance. Latency sensitivity: Real-time AI applications, like recommendation systems or edge analytics, depend on low-latency environments. This can be more easily tuned in private or hybrid setups. Hardware specialization: Some AI workloads benefit from custom hardware like GPUs or TPUs. Private cloud allows more control over this, while public cloud offers broader access but less customization.