Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

Metrics That Matter: Measuring Developer Productivity in the AI Era

In this episode, Ryan McDonald is joined by Mark Quigley, Head of Platform Engineering at Ninety.io, for a conversation that cuts through the noise around developer productivity metrics and AI. Mark dives deep into how teams can measure what matters—without falling into the trap of turning every measure into a target. He shares how tools like Developer NPS, DORA metrics, and balanced scorecards can help teams optimize for both output and well-being—but only when framed with the right intent.

Java Util Logging Configuration: A Practical Guide for DevOps & SREs

Setting up proper logging is like having a good navigation system when you're driving through unfamiliar territory. For DevOps engineers and SREs managing Java applications, understanding how to configure the built-in java.util.logging framework is essential knowledge that can save you hours of troubleshooting headaches. Let's break down java util logging configuration in a way that makes sense — no fancy jargon, we promise!

How to View and Understand VPC Flow Logs

If you're running workloads in AWS, you've probably heard about VPC Flow Logs. These logs are your eyes and ears for network traffic in your Virtual Private Cloud, and knowing how to check them properly can save you hours of troubleshooting headaches. Whether you're tracking down connectivity issues or monitoring for suspicious activity, this guide will walk you through checking VPC flow logs step by step, with practical examples you can apply today.

Envoy vs HAProxy: Which Proxy Server Is Right for Your Infrastructure?

Choosing between Envoy and HAProxy isn't just about picking a proxy server. It's about deciding which tool will handle your traffic, balance your loads, and keep your services running when everything else wants to crash. If you're a DevOps engineer or system admin weighing these options, you're in the right place.

Incident management vs. problem management: A practical guide for SREs

In Site Reliability Engineering (SRE), distinguishing incident management from problem management is crucial. While both processes aim to maintain system reliability, they fulfill distinct roles: incident management focuses on quickly resolving immediate disruptions, whereas problem management identifies and rectifies root causes to prevent recurrence. Effectively combining these processes helps minimize downtime, enhances system resilience, and fosters a proactive operational approach.

Java GC Logs: How to Read and Debug Fast

When a Java application starts slowing down, garbage collection is often a good place to look. For engineers responsible for keeping systems stable and responsive, understanding GC logs can make a real difference. This guide walks through the basics—what to look for, what the logs mean, and how to troubleshoot common issues—so you can get ahead of problems before they impact performance.

Essential Unix Commands Cheat Sheet for DevOps Engineers

If you work in DevOps and spend time in the terminal, knowing Unix commands isn’t optional. It’s part of the job. Whether you're managing servers, setting up deployments, or fixing something that just broke in production, these commands help you move faster and work smarter. This cheat sheet keeps things simple. No filler. Just the commands you’ll use when you’re in the middle of real work.

An Easy Guide to Pausing Docker Containers

Docker containers have become essential tools in modern development workflows. While most DevOps engineers are familiar with starting, stopping, and removing containers, the "pause docker container" functionality often flies under the radar. Yet, this feature can be incredibly useful in various scenarios, from testing to resource management.