Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on DevOps, CI/CD, Automation and related technologies.

Fix flaky tests in your sleep with Chunk by CircleCI

A test fails. You rerun it and it passes. You shrug and move on. This is how most teams deal with flaky tests. The “rerun until green” approach works in the moment, and rerunning from failed tests is a useful way to confirm whether a failure is real. But reruns don’t fix the underlying issue. Over time, they burn CI resources and can hide real instability in your code. On the other hand, fixing flaky tests can mean hours of work.

What Is Business Continuity?

A single outage can stop operations, affect customers, and impact trust. In a world of pandemics, cyberattacks, weather events, and supply chain delays, your team cannot pray that something does not break. Business continuity drives your team to stay ready, recover earlier, and keep downtime lower. In this blog, we’ll explain what business continuity means, how to create a solid business continuity plan, and which approaches help teams keep operational during a disruption event.

Simple Talk Podcast - Coffee Chat with Lee Brownhill

Steve sits down with Lee Brownhill, who by day helps clients optimize their SQL workloads in Azure and AWS at Cloud Rede, but is also a Redgate Ambassador, blogger and aspiring speaker. Lee talks about his interest in giving back to the SQL Server community through writing and speaking, having taken inspiration from others online and in-person at events, and naturally the conversation also touches upon AI, the cloud, and more.

What Is Incident Response Lifecycle?

The Incident Response Lifecycle is a step-by-step process that helps engineering teams detect, respond to, and recover from unexpected system disruptions or outages. It includes a series of six practical stages: Detection, Analysis, Impact Mitigation, Incident Resolution, Service Restoration, and Post-Incident Analysis. By following this lifecycle, teams can minimize downtime, reduce business impact, and continuously strengthen system reliability.

Why your Kubernetes clusters and GPUs should live under one roof

The world remains abuzz with AI hype, but the reality is that most modern applications aren’t purely AI workloads. The average company will have web services, APIs, databases, and background jobs running alongside its machine learning inference or training components. An architecture question everyone faces: should your Kubernetes cluster and GPU compute live in the same data center, or can you split them across providers?

How to manage ilert call flows via Terraform

Call flows let you design voice workflows with nodes like “Audio message,” “Support hours,” “Voicemail,” “Route call,” and much more. The ilert Terraform provider now includes a ilert_call_flow resource so you can version and promote these flows across environments. This blog post offers an overview of managing call flows in Terraform, detailing the benefits and key scenarios.

A quick recap of IDPCON 2025

Two weeks ago, we hosted IDPCON 2025, and the response has been overwhelming. Over 250 engineering leaders from 20+ countries joined us for 12 sessions featuring speakers from Canva, Skyscanner, Blackstone, and more. Attendees participated in discussions at 20+ roundtables, sharing strategies and challenges around engineering excellence and internal developer portals.

Cultural ROI In FinOps: People Drive Pivots

When I ask clients to picture cloud cost optimization, they think dashboards, policies, maybe a clever right-sizing purchase. What they don’t picture? Meetings. Misunderstandings. Mistrust. To avoid FinOps failures, we need a new starting line; one that gets to the root of spend misalignment.

From Code To Clicks: A Visual Way To Build Dimensions In CloudZero

In early October, we launched Dimension Studio, a new visual editor for engineers and others that brings point-and-click simplicity to the same powerful, precise allocation engine CloudZero is known for. Before that, when CloudZero users built cloud cost allocations, they got it from our YAML-based CostFormation engine, a code-driven way to describe how cloud and AI costs roll up to products, customers, or teams.

Announcing HAProxy ALOHA 17.5

HAProxy ALOHA 17.5 is now available. This release delivers powerful new capabilities that improve security and performance — while future-proofing HAProxy ALOHA to enable richer features and advanced functionality. With this release, we’re introducing HTTPS health checks to Global Server Load Balancing (GSLB), new partitioning for larger firmware updates, enhanced web application firewall (WAF) functionality, and our new Threat Detection Engine (TDE).