Operations | Monitoring | ITSM | DevOps | Cloud

Cloud freedom with AI built in

Most cloud providers give you the hardware and leave you to figure out the rest. Civo AI is different. Chief Innovation Officer Josh Mesout explains how Civo thinks strategically about AI adoption, guiding organisations through the full lifecycle from planning and infrastructure through to running and scaling workloads, powered by best-in-class NVIDIA GPUs.

What is the sovereignty tax, and is your organization paying it?

Most organizations know cloud costs are rising. Fewer realize that some of what they're paying isn't for infrastructure at all; it's a penalty for not being in control of it. That penalty has a name: Sovereignty Tax. It isn't a line item on your invoice. It won't appear in your cloud dashboard. But it's accumulating quietly, in egress fees, outage exposure, audit blind spots, and the creeping realization that leaving your current provider would be harder, and more expensive, than you ever anticipated.

Building vs. Buying your platform: The honest framework nobody discusses

Most organizations get the build versus buy decision wrong in the same way. They underestimate the cost of building while overestimating the cost of buying. In the recent Konstruct monthly webinar with M R Rishi (Platform Engineer at Civo), we explored the discussion surrounding whether you should build or buy your platform. If you want to watch the full discussion, watch the recording here.

How AI is changing platform engineering

AI is changing software development fast. But what does that actually mean for platform engineering teams? In this conversation, Civo's John Dietz and M R Rishi dig into what they're seeing on the ground, the 10x effect of AI on app count, what it means for platform team workloads, the debugging skills that are quietly being lost, and whether Kubernetes itself might eventually become just another abstraction.

How Kubernetes Operators May Conflict With Resource Optimization (And How to Avoid It)

A Kubernetes Operator is a method of packaging, deploying, and managing a Kubernetes application. It extends the native Kubernetes API by combining custom resources (CRDs) with a dedicated controller: a custom control loop that continuously watches the state of those resources. The primary purpose of an operator is to automate complex, stateful applications (like databases, message queues, or monitoring suites) that require human operational knowledge to maintain.

How we saved over $3 million in idle compute costs with Datadog Kubernetes Autoscaling

At Datadog, our broad Kubernetes footprint amplifies the significance of a familiar autoscaling tradeoff: Overprovisioning wastes cloud spend, while underprovisioning threatens reliability. We built Datadog Kubernetes Autoscaling (DKA) to help teams rightsize their workloads by generating intelligent resource recommendations and automating multidimensional workload scaling. Across Datadog, adopting DKA has eliminated more than $3 million in annualized idle compute costs while reducing reliability risks.

The debugging crisis nobody's talking about: AI, abstraction, and the skills gap

Here's a scenario that's playing out in engineering teams across the industry right now. A developer uses AI to rapidly prototype a microservice. The code works. They deploy it to production. Six months later, something breaks. The system is under load, a database connection pools, and the service starts failing in subtle ways. The engineer pulls up the code, but here's the problem, they didn't write it. An AI assistant did. They don't understand the flow deeply. They don't know where to look first.

New in Kubex: KAI Scheduler Integration for Shared GPU Inference

Today, we’re launching Kubex support for the KAI Scheduler and automated GPU sharing for inference workloads. As AI inference moves into production, platform teams are being asked to serve more models, support more teams, and control GPU costs at the same time. But many inference workloads do not need an entire GPU all the time. When teams reserve full GPUs or oversized GPU fractions to stay safe, expensive capacity can sit idle across the cluster.

Why we built relaxAI, and where your AI data actually goes

Sandboxing your AI agent is only half the story. The other half is where your data goes when it hits your LLM provider's API. In this clip from our secure execution agents webinar, Ben Norris, founding engineer at relaxAI, explains why the sovereignty of your AI provider matters just as much as the security of your agent's environment and why relaxAI was built on a sovereignty-first principle, with inference running exclusively in the UK and no foreign data transfer.

What nobody tells you about platform engineering at scale

Platform engineering has become one of the most discussed topics in cloud native infrastructure. Yet despite the rising focus, most conversations around platform engineering skip over the uncomfortable truths. What actually works at scale? When should you build versus buy? And how do you avoid the traps that trip up even experienced teams?

How to build a hybrid private cloud strategy that scales with your business

Most hybrid cloud strategies fail not at launch but at scale. The architecture works fine for the first year. The team's workloads are modest, the integration points are limited, and the operational overhead is manageable. Then the business grows. Workloads multiply, data volumes climb, the team expands, and the seams between public cloud and private infrastructure start showing.

How to build sustainable AI infrastructure on GPU cloud

AI's environmental cost is real, and it's growing. Training a large language model can consume the electricity of hundreds of households for weeks. Inference at production scale runs continuously, with GPU clusters drawing power around the clock. The data centers that house all of this are some of the most concentrated energy consumers in the modern technology stack.

Platform engineering unplugged: What nobody tells you about platform engineering at scale

Most platform engineering stories are told in hindsight, with the rough edges smoothed out. On June 17th, we are doing it differently. Join us for Platform Engineering Unplugged, a frank conversation with a practitioner who has navigated the real challenges of building and scaling platform engineering. What worked, what didn't, and what they would do differently. If you lead engineering teams and are thinking seriously about platform engineering, this is the session for you.

How to build a secure AI agent sandbox with relaxAI and Claude Code

AI agents are powerful. They're also unpredictable, non-deterministic, and capable of doing things you didn't ask them to do, as the Rome Alibaba and Claude Mythos case studies make very clear. The answer isn't to avoid agentic AI. It's to run it properly. In this demo, Ben Norris, founding engineer at relaxAI, shows how to build a fully sandboxed AI agent environment from scratch, an ephemeral Civo VM provisioned via Terraform and GitHub Actions, locked down with egress policies, an unprivileged Linux user, and hard resource caps, running a Claude Code session pointed at the relaxAI API.

Klaudia Under the Hood: How We Built an AI SRE That Actually Earns Trust

In reliability engineering, being ‘mostly right’ is a liability. An AI SRE that sometimes misses the root cause or gives a confident, wrong answer at 2:17 AM has no place in an enterprise cloud environment. In this context, silence is better than noise. That’s the bar Klaudia is built to clear: genuine reliability that you can trust in production. The kind of reliability that earns a place alongside your best engineers. Getting there requires more than just a capable model.

Lock-in is not theoretical: What UK organizations told us about cloud exit barriers

For years, vendor lock-in has been discussed as a theoretical risk. A concern to acknowledge in architecture reviews. A box to tick in compliance frameworks. A future problem that might need addressing. Our latest research reveals something more urgent. For UK organizations, lock-in isn't theoretical anymore. It's structural. It's measurable. And it's preventing organizations from acting on their own strategic priorities.

Why We Built Lynx: Bringing Control to the Age of AI Agents

For a decade, one idea has guided everything we’ve built at Tigera: How do you secure a dynamic system with a lot of moving parts that is changing rapidly, with a programmatic approach? Calico has applied that idea for Global 2000 companies running the largest Kubernetes platforms in the world, securing tens of millions of mission-critical transactions every day. Today I’m excited to announce the next chapter of that work: Lynx, a unified control plane for Kubernetes-native AI agents.

Kubernetes Monitoring: Datadog Alert to Lightrun Root Cause

Datadog Kubernetes monitoring tells an SRE team what failed, which pod failed, and when. It does so within seconds of the alert firing. The investigation then stalls at the same point every time: nothing in the dashboard layer can prove why a specific request behaved the way it did inside a running JVM at the moment of failure. Variable values, feature flag evaluations, and code branches are never captured.

The cloud bill explained: A guide for finance and engineering

The cloud bill arrives at the end of every month, and somewhere in it sits a line item that nobody outside the infrastructure team really understands. It might be called "data transfer," "egress," or "outbound bandwidth," and it might be 5% of the total or even 25%. Whatever it is, it tends to be the line that finance asks engineering about, and engineering struggles to explain in a way that finance can act on. The problem is that egress is a fee that hides in plain sight. It's not on the marketing page.

Why developer teams are rethinking their cloud provider this year

The default cloud choice for technically literate teams has shifted. It hasn't shifted dramatically; the major hyperscalers aren't going anywhere, and their enterprise position is still strong, but the conversation that used to start with "which hyperscaler" now genuinely starts with "what do we actually need." That's new.

How to monitor and optimize GPU utilization in the cloud

GPU utilization is one of the most expensive metrics in cloud infrastructure to get wrong. A GPU running at 30% utilization costs the same as one running at 90%, but it's doing a third of the useful work. For workloads measured in tens of thousands of GPU-hours, the difference between average utilization in the 30s and average utilization in the 70s is hundreds of thousands of dollars across the life of the workload.

A New Console for Qovery

We rebuilt large parts of the Qovery Console: new navigation, overviews at every level, dark mode, and a modernized frontend architecture with TanStack Router and React Suspense. Rémi is a staff frontend engineer at Qovery. He writes about frontend architecture, developer experience, and building scalable UI systems for platform engineering tools. Théo is a senior product designer at Qovery.

What is Cloud Security - Explained in 5 minutes

Cloud security isn't just about locking things down — it's about staying ahead of threats in fast-moving, dynamic environments. In this video, Kat breaks down what cloud security actually means in 2024 and why traditional approaches don't cut it anymore. In this video: Whether you're securing containers, Kubernetes workloads, or multi-cloud infrastructure, this is your foundation. Subscribe for more cloud security explainers, tutorials, and best practices from Sysdig.

The next era of telco clouds: get open infrastructure choice with Sylva and Canonical Kubernetes

The telco industry is undergoing a fundamental change. Over the past few years, the increasing maturity of cloud-native infrastructure has accelerated the movement from manually operated and hardware-centric systems to automated, software-defined platforms. Underpinning this change are open source initiatives such as the Sylva project. Sylva is hosted by Linux Foundation Europe and heavily backed by major telecom operators and vendors.

#060 - Beyond ELK: Elastic's 10-Year Evolution, Open-Source Licensing, and the AI Frontier with P...

In this episode of the Kubernetes for Humans podcast, Philipp shares his incredible 10-year journey at Elastic, witnessing the company's massive growth from 300 to 4,000 employees. Discover the fascinating origin story of how Elastic evolved from a simple recipe search project into a global powerhouse for observability, security, and vector databases.

How to run self-hosted AI on your own infrastructure with Konstruct

Civo Platform Engineer M R Rishi demonstrates how to go from zero to self-hosted AI in minutes using Konstruct. While most teams are stuck managing thousands of configuration values across multiple models and tools, Rishi shows how Konstruct eliminates that complexity with GPU cluster provisioning, GitOps catalog deployments, and production-ready infrastructure on day zero.

3 Platform Engineering Shifts From Devoxx France 2026

Three days, 20 talks at Devoxx France 2026. The through-line wasn't AI hype - it was discipline. Context engineering, code review under AI volume, and the local-vs-remote question now shaping security, cost, and sovereignty. Fabien is a senior software engineer at Qovery. He writes about platform engineering, AI tooling, context engineering, and the practical realities of running modern developer infrastructure.

Secret Manager Integration: One Source of Truth for Humans and Agents.

Production secrets should live in one place and stay there, whether your next deployment is triggered by a developer or an AI agent. The Secret Manager integration connects AWS Secrets Manager, AWS SSM, or GCP Secret Manager to Qovery so secrets are referenced, never copied, and enterprise governance holds regardless of who deploys. Alessandro leads product at Qovery. He drives the changelog, roadmap, and product strategy - turning customer feedback into platform capabilities.

The Two-Sided Scheduling Problem: Reaching the Next Layer of Cloud Savings

You’ve deployed Karpenter or Cluster Autoscaler and tightened your resource requests, but while you saw an initial dip in your cloud bill, your savings have flatlined. Organizations that thought they had the fundamentals of cloud cost under control are now seeing stagnation. The problem isn’t that they need another FinOps tool or better visibility. The problem is that the current state of enterprise cloud cost optimization strategy is fundamentally reactive.

The Inference Paradox: How Split-Brain LLMs Are Killing Your GPU ROI

During the Toronto KCD (Kubernetes Community Days), I attended an insightful talk on AI resource optimization that highlighted a staggering Gartner study: “AI infrastructure is adding $401 billion in new spending this year alone. Yet, real-world audits tell a much darker story, revealing that average GPU utilization in the enterprise is stuck at a dismal 5%”. While many people in the audience were shocked by that number, the data didn’t come as a surprise to us.

A field guide to the agents in your cluster

You know every service in your cluster by name. You know which team owns each one, what it talks to, how it scales, where its logs go. The agents are a different story. That’s not a criticism, it’s an observation, and it’s one we keep running into. Every company we talk to is shipping agents of some kind, from scales of 10s to 1000s. Customer service bots that field tier-one tickets. Internal copilots that draft emails and summarise meetings and write the boring half of every PR.

Five Principles of an Accountable AI Agent Network: How to Evaluate Any Governance Platform

The first post in this series argued that AI agent governance hasn’t kept pace with deployment. The second laid out the five pillars of accountability, and what is required. The third walked through why network policies, API gateways, MCP/A2A protocols, DIY security patterns, and Role-based Access Control (RBAC) each leave critical accountability gaps. So what does good look like? The five pillars define what AI agent accountability requires.

Kubeflow MLOps tutorial: from notebook development to production inference

In this video, our engineering team takes you through a full end-to-end Kubeflow implementation, step by step – from data exploration to production inference. Follow the journey of a house price prediction use case and see how modern MLOps components work together: Kubeflow architectures and starter repositories Notebook-based development workflows Data exploration and model development MLflow for experiment tracking Katib for hyperparameter optimization Kubeflow Pipelines for automated preprocessing and training KServe for scalable model inference.

Coding Agents Write the Code. Who Verifies It Works? We Built the Answer.

Coding agents are good at reading a spec and producing code. But producing code is one step in a longer process. The real loop is Spec -> Code -> Deploy -> Test -> Verify -> Ship. Agents stop at step two. Romaric founded Qovery to make Kubernetes accessible to every engineering team. He writes about platform strategy, developer experience, and the future of cloud infrastructure.

Who really controls your data?

Digital sovereignty has moved from buzzword to boardroom priority. But most organisations are still asking the wrong question. Civo CEO Mark Boost cuts through the noise. Digital sovereignty isn't about marketing; it's about jurisdiction, accountability, and operational certainty. And it starts with where your data is hosted and how it's processed. Civo's UK Sovereign Cloud delivers public cloud, private cloud, and AI services, hosted and operated exclusively within the United Kingdom, under UK legal authority, with no exposure to foreign control.

From Visibility to Real Savings: Turning FinOps Insights into Measurable Cost Reduction

FinOps programs are maturing, and most organizations have better visibility into cloud spend than ever before. Dashboards are full of data. And yet costs keep climbing. The problem isn’t the data. It’s the gap between knowing where the waste is and actually eliminating it. In this joint session, Tangoe and Kubex come together to bridge that gap. Tangoe brings deep expertise in spend management and FinOps discipline, while Kubex delivers infrastructure-level optimization across cloud, Kubernetes, and the AI and GPU workloads that are rapidly becoming the next frontier of cost pressure.

Beneath the Stack: A Software Engineer's Journey into Infrastructure

A software engineer's hands-on journey building a private cloud on bare-metal: Incus clustering, K3s, OVN networking, the Gateway API, and everything that breaks along the way — and what it taught them about why platforms like Qovery exist. Antoine is a senior software engineer at Qovery. He writes about hands-on infrastructure engineering, Kubernetes internals, and the realities of running production systems.

Sovereign GPU cloud: Data residency across training, inference, and model weights

Sovereign cloud conversations usually center on where customer data sits at rest. The provider points at a UK data center, the contract gets signed, and procurement marks the box. For most workloads, that's a defensible position. For GPU workloads, it isn't.

GPU cloud for AI inference in production: How infrastructure requirements change after training

Training a model is a project with an end date. Inference is what happens for the rest of the model's working life. The two workloads share GPUs, frameworks, and a lot of vocabulary, but the infrastructure decisions that make sense during training are usually the wrong ones in production. Teams that treat inference as "training, but smaller" tend to discover the gap somewhere around their first traffic spike.

5 questions you should be asking about cloud dependency

Cloud infrastructure has become the backbone of modern business operations. But as organizations deepen their reliance on cloud providers, a critical question often goes unasked: just how dependent are we, and at what cost? For years, the cloud adoption narrative focused on agility, scalability, and cost efficiency. Those benefits remain real. But the landscape is shifting.

[Webinar] Building Regulated Infrastructure: How Lucis Standardized Security for Global Care

In Healthtech, downtime is more than a loss of revenue, it is a disruption to patient care. Whether supporting digital health platforms or AI-driven healthcare applications, infrastructure must remain secure, compliant, and highly available. Join Lucis and Qovery for a technical breakdown of building compliant and secure infrastructure that scales AI and healthcare workloads, handles traffic peaks, and maintains SOC 2, HDS, and HIPAA standards.

4 Best Chainguard Alternatives for Zero-CVE Images in 2026

Chainguard helped make zero-CVE and near-zero-CVE container images a mainstream topic in cloud-native security. For many engineering and security teams, the core appeal is clear: fewer vulnerabilities in base images, smaller attack surfaces, stronger software provenance, and less time wasted chasing noisy vulnerability reports.

AI inference vs. training: What they are and how they differ

AI inference and training are terms you'd run into if you have been around software engineering or even just scrolled through the news. Both are integral to delivering the AI-powered experiences we have come to expect from many of the applications we use daily. According to McKinsey, by 2030 inference will overtake training as the dominant workload in AI data centers, making up more than half of all AI compute and roughly 30-40% of total data center demand.

10 Enterprise AI Infrastructure Voices Worth Following

Enterprise AI has crossed an inflection point. The model problem is largely covered. What remains unsolved is the operational impact: how to run AI inference and agentic processes continuously, reliably, and at a cost that doesn’t cancel out the value. Most enterprises are discovering this the hard way. GPU utilization dashboards show 80%. Actual compute efficiency is half that. Token demand is compounding at 200-500% annually as agents multiply every action into dozens of model calls.

21 AI concepts every beginner should know before their first interview

If you’re prepping for your first AI or MLOps interview, the hardest part usually isn’t always the hands-on element. For me, it’s the vocabulary. Interviewers sometimes lob single-word concepts at you (“what’s quantization?”) and watch how far you can carry the thread. The questions sound clear-cut, but each one is really a doorway into a bigger topic, and the interviewer is judging how cleanly you walk through it.

Blackwell sold out in weeks. Here's what Rubin demand will look like.

"Blackwell sales are off the charts, and cloud GPUs are sold out. Compute demand keeps accelerating and compounding across training and inference, each growing exponentially. We've entered the virtuous cycle of AI." Jensen Huang, CEO, NVIDIA When NVIDIA's CEO makes that statement in a quarterly earnings release, it is not marketing language.

How to deploy Canonical Managed Kubeflow on Microsoft Azure?

Learn how to deploy Canonical Managed Kubeflow on Microsoft Azure step by step. Canonical's Managed Kubeflow on Azure gives enterprise and startup AI teams a fully operational, open source MLOps platform in under an hour. It is managed 24/7 by Canonical's engineers. This means you can focus entirely on building models rather than running infrastructure.

What's new in Calico: Spring 2026 Release

Kubernetes has come a long way since its debut in 2014. It’s gone from running a couple of containerized microservices to orchestrating fleets of production workloads spanning everything from AI agents to full scale VMs running in pods. As Kubernetes adoption grows, and its use cases stretch to cover more ground, managing its increasingly complex networking and security landscape demands operational maturity and a platform that supports it.

Introducing Cycle's European Control Plane: Strict data sovereignty, lower latencies, and more

We're thrilled to announce that Cycle's European Control Plane is now live! While a few organizations have been utilizing it over the past month, we're eager to officially open access to all teams. Before diving deeper into the "why," let's clarify what a Cycle Control Plane actually is. If you visit our status page, you'll see a list of the core services powering Cycle. These services include everything from our APIs to our 'factory' build systems.