Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on DevOps, CI/CD, Automation and related technologies.

How Dartmouth avoided vendor lock-in and implemented LBaaS with HAProxy One

History is everywhere at Dartmouth College, and while the campus is steeped in tradition, its IT infrastructure can’t afford to get stuck in the past. In an institution where world-class research and undergraduate studies intersect, technology must be fast, invisible, and – above all – reliable. That reliability was put to the test when Dartmouth’s load balancing vendor was acquired twice in five years, as Avi Networks moved to VMware and VMware moved to Broadcom.

Custom Dashboard Creation: Step-by-Step Tutorial

Creating a custom dashboard is the best way to monitor metrics that matter most to your systems. Tools like MetricFire make this process straightforward by combining hosted Grafana and Graphite, eliminating the need for self-hosted solutions. Here's how you can build dashboards tailored to your needs.

The AI-Empowered Site Reliability Engineer: Automating the Balance of Risk and Velocity

You might expect an AI-SRE agent to target 100% reliable services, ones that never fail. It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a non-linear cost: maximizing stability limits how fast new features can be developed, dramatically increases the operational cost, and reduces the features a team can afford to offer.

Top 9 Observability Tools for AI-Assisted Development & Deployment

AI-assisted development is rapidly becoming the default way software is built. Code generation, AI copilots, agentic pull requests, and automated refactoring are now embedded directly into engineering workflows. While this shift dramatically increases delivery speed, it also introduces a new operational reality: production systems are changing faster than humans can fully reason about them. This is where observability becomes mission-critical.

What AI Has Never Seen: The Context Gap in Code Generation

Your AI coding assistant has read the entire internet. It knows every programming language, every framework, every best practice documented in Stack Overflow answers and GitHub repositories. It can generate a REST API handler in seconds that looks perfect with clean code, proper error handling, following all the patterns. But here’s what it’s never seen: your production traffic. Data from a real API request. Someone filling out a form with messed up or incomplete data.

#052 - The "Short Long Path": Mastering Abstraction, Culture, and Kubernetes Scale with Shemer M...

In this episode, Itiel joins forces with Shemer, Director of Platform Solutions at the gaming giant Playtika, and Scott Rosenberg, Lead Architect at TeraSky, to discuss the realities of platform engineering at a massive scale. The trio dissects Playtika’s multi-year journey from a legacy, homegrown Kubespray infrastructure to a modern, holistic platform built on Spectro Cloud, all while running strictly on-premise to support 25+ games and high-volume traffic.

Your Test Data Environment: Build vs Buy - a conversation we need to have

After three decades of working with databases, one thing I’ve seen over and over is this: we don’t treat our development and test environments with the same respect we do our production systems. Not because people don’t care. Far from it. It’s usually because teams are under pressure, everyone’s juggling multiple priorities, and the quickest path forward often wins the day.