Operations | Monitoring | ITSM | DevOps | Cloud

FinOps in the Age of Kubernetes: When Everyone Owns the Bill

A FinOps analyst walks into a Monday morning meeting with a detailed spreadsheet showing $2.3M in potential Kubernetes cost savings. The recommendations look straightforward: reduce memory limits by 40%, scale down replicas during off-peak hours, consolidate workloads onto fewer nodes. The numbers are compelling, the methodology is sound, and the savings would make a material impact on quarterly cloud spend. The SRE team immediately objects.

AI SRE in Practice: Enabling Non-Experts to Troubleshoot Kubernetes

Kubernetes troubleshooting traditionally requires deep platform expertise. Understanding pod lifecycle, decoding error messages, correlating events across resources, and identifying root cause all demand experience that takes years to build. This expertise gap creates a bottleneck where only senior engineers can handle production issues, limiting how quickly teams can resolve incidents.

When AI Writes the Code, Who Pays the Cloud Bill?

This is part two of a series of the implications of AI generated code becoming mainstream. We recently wrote about how AI-generated code is overwhelming SRE teams with production complexity they can’t manage. Turns out that’s only half the problem. The other half shows up on the cloud bill. A prospect reached out to us last month. They’d been using Cursor and Claude Code for six months, shipping features at unprecedented velocity. Product was thrilled.

When AI Writes the Code, Who Keeps Production Running?

The production environment has become a minefield of code nobody really understands. Here’s what’s happening: Development teams are using Claude Code, Cursor, and GitHub Copilot to ship features at 10x their previous velocity. Product managers are ecstatic. Business stakeholders are thrilled. And somewhere in a war room at 2:17 AM, an SRE is staring at a stack trace for code that was AI-generated three weeks ago, trying to figure out why the payment service just fell over.

AI SRE in Practice: Accelerating Engineer Onboarding with Contextual Expertise

Onboarding new engineers to complex Kubernetes environments is expensive. Junior engineers need to learn cluster architecture, understand organizational conventions, navigate internal documentation, and build relationships with senior team members who can answer questions. The process takes weeks or months, and during that time, senior engineers spend significant time mentoring instead of working on complex problems.

AI SRE in Practice: Diagnosing AWS CNI IP Exhaustion Before Widespread Outage

IP address exhaustion in Kubernetes doesn’t announce itself with clear error messages. Pods fail to schedule, services degrade unpredictably, and the symptoms look like a dozen different problems before anyone realizes the cluster has run out of available IP addresses. By the time the root cause becomes clear, multiple services are affected and recovery requires coordination across infrastructure layers.

AI SRE in Practice: Tracing Policy Changes to Widespread Pod Failures

Policy changes in Kubernetes are supposed to improve security, enforce standards, or optimize resource usage. But when a policy change triggers cascading pod failures across multiple namespaces, the investigation becomes a race to identify what changed before more workloads are affected.

The AI-Empowered Site Reliability Engineer: Automating the Balance of Risk and Velocity

You might expect an AI-SRE agent to target 100% reliable services, ones that never fail. It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a non-linear cost: maximizing stability limits how fast new features can be developed, dramatically increases the operational cost, and reduces the features a team can afford to offer.

From Blueprint to Production: Building a Kubernetes MCP Server

As Large Language Models (LLMs) evolve from simple chatbots into agentic workflows, the need for a standardized way to connect them to external data and infrastructure has become critical. In a recent workshop hosted by Nir Adler, Innovation Engineer at Komodor, we explored how to bridge this gap using the Model Context Protocol (MCP).

Building Trust in the Machine: A Guide to Architecting Agentic AI for SRE

The promise of Artificial Intelligence in Site Reliability Engineering (SRE) is seductive: an autonomous system that never sleeps, instantly detects anomalies, and fixes broken infrastructure while humans focus on high-value work. However, the gap between a demo-ready chatbot and a production-grade Autonomous AI SRE is vast. In complex, noisy environments like Kubernetes, a “naive” implementation of Large Language Models (LLMs) is not just ineffective, it can be dangerous.