%term

The latest News and Information on Service Reliability Engineering and related technologies.

An Easy Guide to Getting Started with Elastic APM

Jun 11, 2025 By Faiz Shaikh In Last9

Code in production will break. Maybe a request takes too long, maybe it fails quietly, or maybe it works fine one minute and falls over the next. Logs can help, sure—but they don’t always show the full picture, especially when performance issues are involved. Elastic APM gives you a clearer view. It traces what your application is doing from incoming requests to database queries and everything in between.

Read Post

Last9

Read more about An Easy Guide to Getting Started with Elastic APM

How to Monitor Kafka Producer Metrics

Jun 10, 2025 By Anjali Udasi In Last9

Your Kafka producer pushed a million messages yesterday. Nice. But can you tell if they all made it? Or why did latency spike at 2 PM? Producer metrics help you determine that. They expose how long messages take to send, whether messages are getting stuck, and whether retries are piling up. Let’s go over which ones help while debugging and how to monitor them.

Read Post

Last9

Read more about How to Monitor Kafka Producer Metrics

Introducing Bits AI SRE, your AI on-call teammate

Jun 10, 2025 By Kai Xin Tai In Datadog

Getting paged pulls engineers away from meaningful work, yet incident response in many organizations remains manual, reactive, and draining. An alert fires and teams scramble to find the root cause, relying on siloed knowledge, incomplete context, and a few on-call experts who are already stretched thin. The rise of AI coding agents has only intensified this challenge: As teams ship code faster with less human oversight, production systems grow increasingly complex and harder to understand.

Read Post

Datadog

Read more about Introducing Bits AI SRE, your AI on-call teammate

How to Integrate OpenTelemetry Collector with Prometheus

Jun 9, 2025 By Prathamesh Sonpatki In Last9

Pulling observability data together is rarely clean. Metrics come from everywhere, formats vary, and making sense of it takes some work. OpenTelemetry Collector and Prometheus fit perfectly here. The Collector handles ingestion and processing from different sources, while Prometheus stores and queries the data. Simple, effective, and no vendor lock-in. In this blog, we cover how to integrate the Collector with Prometheus, common pitfalls, and ways to control costs.

Read Post

Last9

Read more about How to Integrate OpenTelemetry Collector with Prometheus

A Complete Guide to Linux Log File Locations and Their Usage

Jun 9, 2025 By Anjali Udasi In Last9

Linux log files are text-based records that capture system events, application activities, and user actions. They're stored primarily in the /var/log directory and provide essential information for debugging issues, monitoring system health, and maintaining security. This guide covers the most important Linux log files and a few detailed techniques for reading and analyzing them.

Read Post

Last9

Read more about A Complete Guide to Linux Log File Locations and Their Usage

How to Configure and Optimize Prometheus Data Retention

Jun 5, 2025 By Preeti Dewani In Last9

Prometheus can be lightweight to start with, but once it’s in production, storage usage tends to grow faster than expected. Managing how long data is kept becomes critical, especially when you're working with limited disk space or tight budgets. This guide outlines the key concepts behind Prometheus data retention, how to configure it effectively, and what to watch out for.

Read Post

Last9

Read more about How to Configure and Optimize Prometheus Data Retention

How to Log Into a Docker Container

Jun 4, 2025 By Anjali Udasi In Last9

When your Docker container isn't behaving the way you expect, you need to get inside and see what's going on. Maybe your app is throwing errors, a service won't start, or you just need to check some configuration files. Getting into a running Docker container is simpler than you might think, but there are several ways to do it depending on your situation. This guide shows you exactly how to log into Docker containers, troubleshoot common issues, and debug your applications effectively.

Read Post

Last9

Read more about How to Log Into a Docker Container

Graylog vs ELK: Which Log Management Solution Fits Your Stack?

Jun 3, 2025 By Faiz Shaikh In Last9

Your app logs start simple—maybe a few print() or logging.info() calls. But in production, things get noisy. Thousands of log lines per minute, scattered across services, and it’s hard to know what matters. This is when tools like Graylog and the ELK stack help. They let you collect, search, and make sense of logs, but they do it in different ways. This guide breaks down how each one handles setup, scale, and day-to-day use.

Read Post

Last9

Read more about Graylog vs ELK: Which Log Management Solution Fits Your Stack?

How to Monitor and Manage Grafana Memory

Jun 3, 2025 By Anjali Udasi In Last9

It’s late, you get an alert, and Grafana is down. The reason? It ran out of memory. If you’ve ever watched Grafana slowly eat up RAM until it just stops responding, you know how frustrating that can be. Memory can spike quickly, especially with complex dashboards and multiple data sources. This guide will help you understand what’s going on and how to keep Grafana running without surprises.

Read Post

Last9

Read more about How to Monitor and Manage Grafana Memory

Prometheus Alerting Examples for Developers

Jun 2, 2025 By Prathamesh Sonpatki In Last9

Everything looks fine—dashboards are green, logs are quiet. But users start reporting slow response times. No errors, no traffic spikes. Just a general slowdown. It’s a common situation. Not all problems show up as crashes or clear failures. Sometimes, performance degrades quietly, and standard metrics don’t catch it early. But that's where Prometheus alerting can help, if you're monitoring the right signals.

Read Post