Operations | Monitoring | ITSM | DevOps | Cloud

Cloud Outage History: Six Years of Recurring Failures

Cloud infrastructure has never been more reliable in theory. In practice, the last six years of cloud outage history have delivered some of the most disruptive incidents on record. Not because cloud providers got worse, but because the systems built on top of them got larger, more interconnected, and more brittle in ways that don't show up until everything breaks at once.

Claude Code Sandbox: The Complete Guide to Sandboxing AI Agents in Production

How to sandbox Claude Code, Codex, and other AI coding agents for production use. Compare local Docker, Daytona, E2B, and Qovery approaches - with architecture diagrams and real-world examples. Romaric founded Qovery to make Kubernetes accessible to every engineering team. He writes about platform strategy, developer experience, and the future of cloud infrastructure.

#058 - The Future of AI and Platform Engineering with Blake Sherwood (Smarsh)

In this episode, special guest Blake Sherwood joins the show to discuss his unique career trajectory from tourism and coal mining to leading massive-scale Kubernetes migrations. Blake shares insights from his experience managing petabytes of data in high-compliance environments, delving into the practical realities of integrating AI into enterprise workflows and observability systems.

Your Metrics Look Fine. Your Engineers Are About to Quit.

Developer experience predicts what's coming 3 to 6 months before it shows up in your delivery metrics. So why are most engineering leaders measuring it last? In this session, GitKraken VP of Developer Research Jeremy Castile breaks down what developer experience (DevX) actually is, how to measure it across 6 key dimensions, and how it connects to velocity, code quality, and AI impact data your team is already tracking.

Addressing Cold Start problem in Travel Personalization for OTAs

In the high-stakes world of Online Travel Agencies (OTAs) like Expedia, Hopper, Priceline, and Airbnb, seconds matter. A traveler searching for a "beachfront stay in Hawaii" isn't just looking for a room — they are reacting to weather changes, fluctuating flight prices, and social media trends. Traditional travel platforms often rely on stale data: yesterday's search history or last week's preferences. To truly compete, travel platforms must pivot to Real-Time Context Engineering.

What Is an Incident Commander? Role, Skills, and Best Practices

The fastest incident response teams treat coordination as a craft. Someone owns the call, drives the decisions, and keeps everyone moving in the same direction while the team puts the system back together. That person is the incident commander (IC), and getting the role right is what separates your 15-minute fix from a four-hour war room where nobody’s sure who’s making the call.

What Is APM? A Guide to Application Performance Monitoring

A well-instrumented service tells your on-call engineer which deploy broke checkout, which span ate the latency budget, and which line to revert before the support queue fills up. Getting there depends on how cleanly your application performance monitoring layer turns telemetry into answers. The sections ahead walk through how APM works, the metrics and components worth tracking, the cloud-native challenges at scale, and how to evaluate APM tooling against your real workload.

From Monitoring to Observability: How DEX Integrations Strengthen IT Visibility and User Productivity

When I started working in IT in the last 90’s, IT performance was always measured by the health of infrastructure: CPU utilization, network latency, server uptime, and for many organizations, little has changed in the last 30+ years. We became very good at keeping systems alive, yet users still struggled to get work done. That disconnect is exactly why Digital Employee Experience (DEX) has emerged as a critical discipline. But DEX on its own is not the end goal.

Honeycomb Innovation Week: Debugging Agentic Workflows with Ken Rimple

Canvas skills are how your team's runbooks and tribal knowledge become an active part of the investigation instead of a document someone has to remember to open. Pre-built skills cover the most common investigation patterns out of the box. Custom skills let you encode the specific context, thresholds, and decision logic your team has accumulated, so every auto-investigation starts with your best thinking already applied.