Operations | Monitoring | ITSM | DevOps | Cloud

Understanding GPU cloud instance types: How to read a spec sheet for real-world ML performance

A GPU spec sheet is a confidence trick. It looks like an objective document - numbers, units, comparable rows - but most of the numbers on it don't map cleanly to the performance a real workload will see. Teams that pick GPUs by reading the headline figures usually find out the gap between spec and reality somewhere around the first production run. This is a working guide to reading GPU cloud instance specifications against actual ML workloads. The goal isn't to recommend a card.

The Lovable Experience. Enterprise Governance. Your Infrastructure. We Built It.

Introducing the AI Builder Portal - the governed alternative to Lovable and Bolt.new for enterprise. Same one-click builder experience, running on your Kubernetes cluster, under your governance. Romaric founded Qovery to make Kubernetes accessible to every engineering team. He writes about platform strategy, developer experience, and the future of cloud infrastructure.

Keep ArgoCD. Get Qovery. ArgoCD Integration Is Here.

Moving to a new platform shouldn't mean weeks of migration work before you see any value. Qovery now lets you connect your ArgoCD server and manage your existing applications directly alongside your Terraform modules, lifecycle jobs, and Qovery-native services, from a single control plane. Alessandro leads product at Qovery. He drives the changelog, roadmap, and product strategy - turning customer feedback into platform capabilities.

Monitor LLM routing with the Kubernetes Inference Extension

If you serve LLMs on Kubernetes without inference-aware routing, your load balancer is likely wasting inference capacity. Generic HTTP traffic management blindly routes requests, assuming the backends in your cluster are interchangeable. But your model-serving backends are stateful and unevenly prepared to handle any given request. As a result, requests are often routed to the backend that’s not the one best suited to respond.

Konstruct product updates: Global resources, MCP support, and smarter permissions

May has been one of our busiest months yet for Konstruct. Across three releases, 0.5, 0.5.1, and 0.5.2, we've shipped some of the most requested platform-level changes since we launched: a unified model for sharing resources across organizations, native support for AI-driven workflows via MCP, a completely redesigned API keys experience, and a cleanup to how permissions actually work in multi-org environments. Let's walk through what shipped and why it matters.

Deploy Datadog Kubernetes Autoscaling at scale

Every Kubernetes environment accumulates waste over time. Teams overprovision CPU and memory requests to avoid performance risk, run idle replicas to preserve headroom, and leave Horizontal Pod Autoscalers (HPAs) untouched long after workload behavior has changed. Some of this waste can be addressed at the node level, where Datadog Cluster Autoscaling helps teams rightsize capacity.

Secure execution: Agents in sandboxes with relaxAI

The hard part of deploying AI agents isn't the agent. It's the environment around it. As organisations move from AI experimentation into production, the question isn't just what agents can do; it's whether you can trust the environment they run in. Sandboxed execution gives you both the autonomy and the guardrails, keeping agents isolated, auditable, and under your control.

Digital sovereignty: Who's in control?

Digital sovereignty isn't a marketing buzzword. It's about jurisdiction, accountability, and operational certainty and it starts with where your data is hosted and how it's processed. Civo's UK sovereign cloud delivers public cloud, private cloud, and AI services, all hosted and operated exclusively within the United Kingdom under UK legal authority with no exposure to foreign control.

The Kubeshark Workflow That Doesn't Stop at the Dashboard

The Observability Gap shows up the moment you try to reproduce a production bug locally. Your traces tell you a request was slow. Your logs tell you which line printed. Neither tells you what was actually on the wire: the headers, the JSON body, the surprise field your client started sending last Tuesday. Until now, closing that gap meant SSHing to a node, attaching a debugger, or shipping a sidecar through change review.

The inside scoop on alerting changes in Kubernetes Monitoring

Kubernetes Monitoring in Grafana Cloud comes out of the box with preconfigured alert rules that notify you about issues like CPU throttling, crash-looping pods, and nodes going offline. These rules are installed automatically when you set up the app, and they start evaluating immediately. But if you've recently reinstalled the Kubernetes Monitoring app and your alert notifications stopped arriving, or started looking different, you're not alone.

Hybrid Cloud Monitoring Explained: On-Prem + Cloud + Kubernetes in One View

Understand what hybrid cloud monitoring is and why it’s critical for managing modern distributed IT environments. Hybrid cloud monitoring helps organizations unify visibility across on-prem infrastructure, public cloud platforms, virtual machines, containers, and Kubernetes clusters in a single monitoring platform. In this video, learn how fragmented monitoring tools create operational blind spots and slow down incident response across hybrid environments.

The AI Agent Accountability Gap: Why Network Policies, API Gateways, And RBAC Are Not Enough

In The Five Pillars of AI Agent Accountability: A Diagnostic Framework for Engineering Leaders, we walked through each pillar of AI agent accountability (traceability, authorization provenance, identity and ownership, policy at scale, and human oversight) and argued that most enterprises today sit at Level 0 or Level 1 of the Accountability Maturity Model. The most common reaction we get when we share that framework is some version of: “We’re already covered. We have network policies.

The Case for VM and Container Consolidation in 2026

Two platforms, two teams, two procurement relationships, all doing one job. There’s a reason it ended up this way. There isn’t a reason it has to stay this way. Ask anyone at a typical enterprise why the VM platform and the container platform are separate, and they’ll give you a sensible answer. The VM estate has been there for fifteen years. It runs the workloads the business depends on.

Kubernetes Optimization Beyond Requests and Limits - Node Scaling Blockers

Many of us understand the concept of Kubernetes Requests and Limits, and that by reducing over-sized resource requests we can reduce waste in our clusters. And for GKE Autopilot and EKS Fargate clusters that is true. Because you’re being billed directly for the resources you’re requesting, driving down requests can result in instantaneous savings. However in most hosted Kubernetes environments you’re not actually being billed for requests.

Your Company Has 10x More Developers Than You Think

The low-code promise failed for 15 years. AI builders delivered in 15 months. Here's what actually changed, why the engineer in me resisted it, and what it means for every CTO. Romaric founded Qovery to make Kubernetes accessible to every engineering team. He writes about platform strategy, developer experience, and the future of cloud infrastructure.

Don't Ban the Builders - Govern Them

AI tools turned everyone into a builder. Your sales team, your finance team, your CEO - they're all shipping apps now. The answer isn't to ban them. It's to give them a governed platform they actually want to use. Romaric founded Qovery to make Kubernetes accessible to every engineering team. He writes about platform strategy, developer experience, and the future of cloud infrastructure.

The Five Pillars of AI Agent Accountability: A Diagnostic Framework for Engineering Leaders

You’re in a board meeting. The CISO is presenting on AI risk. The CFO asks a simple question: “When that finance agent we deployed last quarter accessed a customer payment record, can we tell who authorized it, what policy permitted it, and produce the full audit trail?” The CISO looks at the head of the platform. The head of the platform looks at security. Nobody answers. If you can picture that meeting happening at your company, you’re not alone.

Autonomous K8s Optimization Involves Both Compute and Storage Resources - Are You Doing Both?

One of the most powerful capabilities in K8s is the ability to autoscale resources to meet demands, scaling resources up during peak periods to ensure performance, and down again during lower periods to save money. In this joint session, Lucidity and Kubex walk through what end-to-end K8s optimization looks like when you address both layers together. We cover: Expect real examples, not slides full of theory. You’ll leave with a clear picture of where waste is hiding in your environment and a prioritized approach to addressing it.

Ubuntu Core 26 fleet observability

What is Ubuntu Core? Ubuntu Core is a minimal and strictly confined variant of Ubuntu powering devices around the world. Ubuntu Core 26 now integrates with the Canonical Observability Stack, streaming device logs and metrics to centralized Grafana, Loki, and Prometheus infrastructure, deployable in the cloud or on-premise, without burdening the device's primary workloads.

Canonical announces fully Managed Kubeflow AI operations platform on the Microsoft Azure Marketplace

Canonical, the publisher of Ubuntu, today announced the general availability (GA) of Managed Kubeflow on the Microsoft Azure Marketplace. This solution enables AI teams to get a fully managed, production-ready MLOps platform in their own tenant. Upstream Kubeflow is a powerful tool for machine learning, but it remains notoriously challenging to deploy and maintain.

Civo AI: Strategy over complexity

Most cloud providers think AI is just a hardware problem. They focus on the GPUs, the racks, and the raw compute, but they leave the strategy up to you. At Civo, we do AI differently. We don't just provide the hardware; we guide you through the full life cycle of AI adoption, from initial planning to scaling production workloads. By leveraging best-in-class NVIDIA models and GPUs, we give you the performance to unlock AI at scale without the fear of being bogged down by complexity. It's more than infrastructure, it’s cloud freedom with AI built-in.

Self-host AI on Kubernetes: GPU clusters, private models, and the GitOps Catalog

Spin up a GPU workload cluster using Konstruct's new GPU cluster templates, deploy a self-hosted LLM, and use it in your organization — all live on stream. This hands-on session shows how shipping AI workloads to GPU clusters is just as easy as deploying to Konstruct physical or virtual clusters, and how open source apps in the GitOps Catalog make it even faster. Walk away knowing how to cut your token spend by running models privately on your own infrastructure.

NVIDIA Vera Rubin: What is it, what's new, and when you can get it

NVIDIA's infrastructure roadmap moves fast, and the next major milestone is already here. The NVIDIA Vera Rubin platform is the company's next-generation AI compute architecture, the successor to Blackwell, and it's shaping up to be one of the most significant leaps forward in AI infrastructure NVIDIA has ever shipped. Whether you're planning your next training cluster, scaling inference pipelines, or building the infrastructure to power autonomous agents, Vera Rubin is worth understanding now.

What Vera Rubin means for AI infrastructure in 2027

Every so often, NVIDIA releases something that quietly changes the direction of the industry. CUDA did it. DGX did it. NVLink did it. Vera Rubin feels like one of those moments again. At first glance, Rubin looks like the natural successor to Blackwell. Faster GPUs, larger memory pools, and eye watering performance numbers. But the more you dig into the architecture, the clearer it becomes that NVIDIA is not simply shipping another accelerator generation.

The sovereignty without toil guide: why compliance shouldn't require a Kubernetes tax

True data sovereignty isn't about managing your own cloud accounts; it’s about where your data resides and how it is governed. By utilizing a unified configuration file to deploy on sovereign infrastructure like OVHcloud, Upsun provides standardized sovereignty without the complexity of “Bring Your Own Cloud”.

The Hidden Cost of Kubernetes: Why Your Cloud Bill Is 40% Higher Than It Should Be

The average enterprise running Kubernetes wastes between $2 million and $10 million annually — not from overspending, but from under-optimizing. This is the story of costs you can't see on your dashboard but that your CFO feels every quarter.

Cursor Cloud Agents Are Incredible - Until You Need Production Governance

Cursor Cloud Agents are the best AI coding environment for individual developers. But for enterprises that need AI-written code to ship through staging to production with audit trails, RBAC, and compliance - there's a gap. Romaric founded Qovery to make Kubernetes accessible to every engineering team. He writes about platform strategy, developer experience, and the future of cloud infrastructure.

Solved: fatal: Not a git repository (or any of the parent directories): .git

The fatal: not a git repository (or any of the parent directories): .git error means Git cannot find a.git directory in your current folder or any parent folder. In most cases, you are either in the wrong directory, the project was never initialized with Git, or the.git folder is missing or corrupted.

Multi-cloud vs. hybrid cloud: Which approach is right for your organization?

Cloud adoption has evolved from simple infrastructure outsourcing into a spectrum of deployment models designed to balance performance, resilience, compliance, and cost. Two of the most widely adopted approaches today are multi-cloud and hybrid cloud. While they are often discussed together, they solve different architectural problems.

What are the benefits of decentralized AI infrastructure?

Have you ever considered how you can utilize artificial intelligence (AI) without sacrificing control over your data and autonomy? As we continue to navigate the changes of AI in the 21st century, it is important to understand how decentralized AI infrastructure can empower individuals and organizations to harness the potential of AI while maintaining sovereignty over their data and decision-making processes.

KubeVirt Live Migration Done Right: What it Takes to Run VMs on Kubernetes

Running VMs in Kubernetes sounds like a crazy workaround for avoiding vendor lock-in, and standardizing legacy applications and newer containerized workloads on one control plane with one set of security policies to govern them all. It is, however, a rapidly growing pattern, and KubeVirt live migration — moving running VMs between nodes without downtime — is increasingly central to platform engineering use cases that require full VMs, like on-demand CI/CD pipelines.

The AI Agent Accountability Crisis: Why Governance Isn't Keeping Up With Deployment

Every enterprise is building AI agents. Marketing has one summarizing campaign performance. Engineering has one triaging incidents. Customer support has one resolving tickets. Finance has one processing invoices. Each was built by a different team, using a different framework, with different assumptions about security. Now those agents are talking to each other through agent-to-agent (A2A) communication. The incident-triage agent calls the customer-support agent to check affected accounts.

The FinOps Competitive Landscape in 2026 - When Cost Optimization Meets Reliability

The dashboard says you can save 30%. The SRE team won’t sign off. You’ve probably been in this meeting. Finance has a number. The platform team has a scar. Somewhere between them sits a senior manager, maybe you, being asked to choose a cost optimization tool that one side will champion and the other side will quietly refuse to deploy in production. The standoff isn’t about price. It’s about trust.

Lovable, Bolt, and Replit Are Wonderful - Until Your CISO Finds Out

Non-technical teams are building apps on Lovable, Bolt.new, and Replit with company data and zero governance. Here's why that's a compliance nightmare - and what enterprise platform teams should deploy instead. Romaric founded Qovery to make Kubernetes accessible to every engineering team. He writes about platform strategy, developer experience, and the future of cloud infrastructure.

AI DevOps in 2026: How AI Coding Tools Are Breaking Your CI/CD Pipeline (and How to Fix It)

AI coding tools turned every engineer into a 10x developer. Now your CI/CD pipeline is the bottleneck. Learn how to handle 10x more deploys per engineer with Qovery's dual deployment model. Romaric founded Qovery to make Kubernetes accessible to every engineering team. He writes about platform strategy, developer experience, and the future of cloud infrastructure.

What's New in Calico v3.32

We’re excited to announce the release of Calico Open Source v3.32! This release corresponds with Kubernetes v1.36 (Codename Haru) and it goes beyond just sharing a cat as the mascot of the release, it actually extends capabilities and features of Kubernetes to keep you up to date with the latest innovations of the cloud. This release brings some of the most significant architectural changes in Calico, from live-migrating KubeVirt VMs to eBPF based Maglev load balancer.

#058 - The Future of AI and Platform Engineering with Blake Sherwood (Smarsh)

In this episode, special guest Blake Sherwood joins the show to discuss his unique career trajectory from tourism and coal mining to leading massive-scale Kubernetes migrations. Blake shares insights from his experience managing petabytes of data in high-compliance environments, delving into the practical realities of integrating AI into enterprise workflows and observability systems.

Claude Code Sandbox: The Complete Guide to Sandboxing AI Agents in Production

How to sandbox Claude Code, Codex, and other AI coding agents for production use. Compare local Docker, Daytona, E2B, and Qovery approaches - with architecture diagrams and real-world examples. Romaric founded Qovery to make Kubernetes accessible to every engineering team. He writes about platform strategy, developer experience, and the future of cloud infrastructure.

How are hyperscalers misleading the cloud industry?

In 2024, Mark Boost, CEO at Civo, introduced the concept of ‘cloud parity’, a cloud computing approach that ensures a consistent, identical experience, feature set, and operational model across public, private, hybrid, and edge environments. “Cloud parity gives teams the freedom the cloud was supposed to deliver in the first place. It gives enterprises the sovereignty they need. It gives public sector bodies the clarity they require.

AI startup on a budget? How to master GPU computing without overspending

This blog is based on the webinar, “Panel Discussion: Understanding the importance of GPUs for AI success”. You can watch the full recording by clicking here! Cheap GPUs don't kill AI startups. Cheap thinking about GPUs does. In 2026, the teams burning through runway fastest aren't the ones who can't afford compute; they're the ones measuring the wrong thing and scaling the wrong way.

What Architecture Ensures Long-Term Scalability in a Rails-Based B2B Platform?

Scalability is not a feature you add later; it is a choice made at the architectural level from day one. A Rails-based B2B platform that handles growing clients, data, and transactions without slowdowns or costly rewrites is built on a modular design, clear domain boundaries, background job processing, caching, and a database strategy that supports load distribution and horizontal scale. Get these foundations right, and you stay in control of growth instead of reacting to problems after they appear.

A New Era of Linux Kernel Vulnerabilities

There have been TWO major kernel vulnerabilities announced this week. Copy-Fail (CVE-2026-31431) was announced on April 30th. Dirty Frag (CVE-2026-43284), also known as 'Copy Fail 2: Electric Boogaloo' announced literally hours ago. Both have already been patched on Cycle, and our users can receive this update simply by restarting their nodes. The Linux patch was released less than an two hours ago, and we're the first to get it to our customers.

What is sovereign AI, and why does it matter for your business?

With AI reshaping every corner of the modern business, the highest-value workloads are often locked behind complex regulatory frameworks. Yet many organizations are still running them on infrastructure they don't fully control, trusting external platforms to decide where their data lives, where workloads run, and how their AI operates. Civo was built to change that.

The state of cloud and AI in 2026

Over the past decade, cloud computing has evolved from an emerging technology into the foundation of modern digital infrastructure. However, the latest industry research shows that the industry has now crossed a critical threshold. The conversation is no longer about whether to adopt cloud, cloud-native technologies, or AI. Instead, it has shifted toward operational efficiency, economic predictability, and infrastructure at scale.

Calculating The Kubernetes Integration Tax: What Your DIY Networking Stack Actually Costs

It was 11:47pm on a Thursday night, and a senior platform engineer at a large North American bank was rolling back a ‘simple’ configuration change. The change itself was small, a routine update approved through the usual review process, but when it was applied, pods began cycling and connections started dropping. For the next three seconds, mobile banking sessions already mid-transaction dropped. Customer support lit up.

No egress fees. No lock-in. That's cloud freedom

With hyperscalers, growth comes with a hidden cost. The more your data moves, the more you pay, by design. Egress fees are that cost. A model built to discourage migration, limit flexibility, and keep you trapped in their ecosystem. At Civo, we've eliminated that barrier completely. No egress fees, no hidden charges. Every cost is transparent and predictable, so you always know exactly what you're paying for. You stay because you choose to. That's cloud freedom.

What Is AWS EKS, and How Does It Work with Kubernetes?

Amazon EKS is AWS’s managed Kubernetes service for deploying and scaling containerized applications. Amazon Elastic Kubernetes Service (Amazon EKS) is a managed Kubernetes service that simplifies deploying, scaling, and running containerized applications on AWS and on-premises. EKS automates Kubernetes control plane management, ensuring high availability and seamless integration with AWS services like IAM, VPC, and ALB.

#057 - From Pagers to Pair Programming: Navigating Massive Scale and AI with Stefana Muller (Sale...

In this episode of "Kubernetes for Humans," Stefana Muller, VP of Infrastructure & Operations at Salesforce, shares her fascinating journey from technical support to navigating the massive scale of the Own Backup acquisition. Stefana dives into the immense multi-cloud Kubernetes challenges of scaling from 18,000 to over 52,000 clusters, standardizing environments across AWS and Azure, and leveling up security to meet stringent Salesforce standards.

Hyperscaler vs. independent cloud: How startups should choose in 2026

A two-person startup signs up for the obvious hyperscaler because their last company used it, because Stripe runs on it, because the documentation is exhaustive, and because the free tier looks generous. Eighteen months later, with a small team and a healthy seed round, they discover they're spending $18,000 a month, and they don't quite know where most of it is going. Three engineers can describe the architecture in detail. Nobody can describe the bill.

Stop ECS Containers From Collapsing Into One Service in OpenTelemetry

Why ECS containers collapse under service.name = aws_ecs and how to fix it for both EC2 launch type and Fargate, including the resource-vs-log-record pitfall that quietly breaks log filtering. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Build with Claude Code, Deploy with Qovery

AI coding tools eliminated the 'writing code' bottleneck. But deploying that code? Still a mess. Here's how Claude Code + Qovery Skill lets you go from idea to production in a single prompt - with enterprise-grade guardrails. Romaric founded Qovery to make Kubernetes accessible to every engineering team. He writes about platform strategy, developer experience, and the future of cloud infrastructure.

ISO 27001, G-Cloud and SOC 2: How to vet a sovereign cloud provider

A procurement officer at a mid-sized financial services firm spent six months last year negotiating with a cloud provider that turned out not to hold the certification it had implied in its sales deck. The contract collapsed during legal review. The firm lost the time, the provider lost the deal, and somewhere in the middle, a senior engineer learned the difference between "compliant with the principles of" and "audited to the standard of.".

Shadow IT Is Back - And Vibe Coding Made It 10x Worse

AI coding tools are the new Shadow IT - but instead of rogue Trello boards, they have OAuth access to your code repos, cloud accounts, and production databases. Here's what's already gone wrong, and how platform engineering fixes it. Romaric founded Qovery to make Kubernetes accessible to every engineering team. He writes about platform strategy, developer experience, and the future of cloud infrastructure.