Latest Posts

The AI-Empowered Site Reliability Engineer: Automating the Balance of Risk and Velocity

Feb 5, 2026 By Udi Hofesh In Komodor

You might expect an AI-SRE agent to target 100% reliable services, ones that never fail. It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a non-linear cost: maximizing stability limits how fast new features can be developed, dramatically increases the operational cost, and reduces the features a team can afford to offer.

Read Post

Komodor

Read more about The AI-Empowered Site Reliability Engineer: Automating the Balance of Risk and Velocity

From Blueprint to Production: Building a Kubernetes MCP Server

Feb 5, 2026 By Nir Adler In Komodor

As Large Language Models (LLMs) evolve from simple chatbots into agentic workflows, the need for a standardized way to connect them to external data and infrastructure has become critical. In a recent workshop hosted by Nir Adler, Innovation Engineer at Komodor, we explored how to bridge this gap using the Model Context Protocol (MCP).

Read Post

Komodor

Read more about From Blueprint to Production: Building a Kubernetes MCP Server

Building Trust in the Machine: A Guide to Architecting Agentic AI for SRE

Feb 4, 2026 By Itiel Shwartz In Komodor

The promise of Artificial Intelligence in Site Reliability Engineering (SRE) is seductive: an autonomous system that never sleeps, instantly detects anomalies, and fixes broken infrastructure while humans focus on high-value work. However, the gap between a demo-ready chatbot and a production-grade Autonomous AI SRE is vast. In complex, noisy environments like Kubernetes, a “naive” implementation of Large Language Models (LLMs) is not just ineffective, it can be dangerous.

Read Post

Komodor

Read more about Building Trust in the Machine: A Guide to Architecting Agentic AI for SRE

Komodor AI SRE vs. OSS AI Agent: A Technical Comparison of Agentic AI for Kubernetes Troubleshooting

Feb 2, 2026 By Nir Adler In Komodor

Gartner predicts that AI agents will be implemented in 60% of all IT operations tools by 2028, up from fewer than 5% at the end of 2024. This acceleration has sparked an explosion of AI SRE solutions, from enterprise platforms to open-source alternatives, all promising faster root cause analysis and reduced MTTR.

Read Post

Komodor

Read more about Komodor AI SRE vs. OSS AI Agent: A Technical Comparison of Agentic AI for Kubernetes Troubleshooting

How Cisco Revolutionized Platform Engineering with Komodor's Agentic AI

Jan 28, 2026 By Itiel Shwartz In Komodor

In the world of cloud-native infrastructure, complexity is the silent killer of innovation. For Cisco Outshift, the company’s incubation engine, managing a sprawling environment of AWS EKS clusters and edge-based MicroK8s workloads created a classic bottleneck: the Platform Engineering team was drowning in toil. Facing SRE burnout and the limits of human scaling, Cisco embarked on an ambitious journey to evolve its internal operations from standard DevOps to Agentic AI.

Read Post

Komodor

Read more about How Cisco Revolutionized Platform Engineering with Komodor's Agentic AI

AI SRE in Practice: Resolving Node Termination Events at Scale

Jan 25, 2026 By Itiel Shwartz In Komodor

When a node terminates unexpectedly in a Kubernetes cluster, the immediate symptoms are obvious. Workloads restart elsewhere, services experience partial outages, and alerts fire across multiple systems. The harder question is why it happened and how to prevent it from recurring. This scenario walks through a node termination event where the entire node pool was affected, requiring investigation across infrastructure layers to identify root cause and implement lasting remediation.

Read Post

Komodor

Read more about AI SRE in Practice: Resolving Node Termination Events at Scale

AI SRE in Practice: Diagnosing Configuration Drift in Deployment Failures

Jan 18, 2026 By Itiel Shwartz In Komodor

Deployments fail for dozens of reasons. Most of them are obvious from the error messages or pod events. But when a deployment rolls out successfully according to Kubernetes but your application starts experiencing latency spikes and error rate increases, the investigation becomes significantly harder. This scenario walks through a configuration drift incident where the deployment appeared healthy but available replicas were constantly flapping, creating cascading reliability issues.

Read Post

Komodor

Read more about AI SRE in Practice: Diagnosing Configuration Drift in Deployment Failures

AI SRE in Practice: Resolving GPU Hardware Failures in Seconds

Jan 11, 2026 By Itiel Shwartz In Komodor

When a pod fails during a TensorFlow training job, the investigation usually starts with the obvious questions. The answers rarely come quickly, especially when the failure involves GPU hardware that most engineers don’t troubleshoot regularly. This scenario walks through an actual GPU hardware failure and shows how AI-augmented investigation changes both the time to resolution and the expertise required to handle it.

Read Post

Komodor

Read more about AI SRE in Practice: Resolving GPU Hardware Failures in Seconds

When is it ok or not ok to trust AI SRE with your production reliability?

Jan 8, 2026 By Ilan Adler In Komodor

There’s a moment every engineer knows. An AI suggests a fix, it looks reasonable,maybe even obvious, but production is on the line and you hesitate before clicking execute. There’s a big difference between an AI that can recommend an action and one you’re willing to let take that action. All it takes is one bad call, one kubectl command that makes things worse, and suddenly every automated suggestion is a potential liability instead of a help.

Read Post

Komodor

Read more about When is it ok or not ok to trust AI SRE with your production reliability?

From Promise to Practice: What Real AI SRE Can Actually Do When Production Breaks

Jan 4, 2026 By Itiel Shwartz In Komodor

We’ve written before about the advantages of training an AI SRE on real telemetry data rather than generic Kubernetes documentation. We’ve explained why RAG augmentation based on actual high-scale workload patterns produces better results than LLMs trained on generic scenarios or forum threads. The theory makes sense, the architecture is sound, and the approach is defensible.

Read Post

Komodor

Read more about From Promise to Practice: What Real AI SRE Can Actually Do When Production Breaks

Operations | Monitoring | ITSM | DevOps | Cloud

The AI-Empowered Site Reliability Engineer: Automating the Balance of Risk and Velocity

From Blueprint to Production: Building a Kubernetes MCP Server

Building Trust in the Machine: A Guide to Architecting Agentic AI for SRE

Komodor AI SRE vs. OSS AI Agent: A Technical Comparison of Agentic AI for Kubernetes Troubleshooting

How Cisco Revolutionized Platform Engineering with Komodor's Agentic AI

AI SRE in Practice: Resolving Node Termination Events at Scale

AI SRE in Practice: Diagnosing Configuration Drift in Deployment Failures

AI SRE in Practice: Resolving GPU Hardware Failures in Seconds

When is it ok or not ok to trust AI SRE with your production reliability?

From Promise to Practice: What Real AI SRE Can Actually Do When Production Breaks

Monthly Archive

Follow Us