Operations | Monitoring | ITSM | DevOps | Cloud

Featured Post

6 Threat Detection Challenges for MDRs and How to Overcome Them

Managed Detection and Response (MDR) is a cybersecurity service offered by a Managed Security Services Provider (MSSP) that combines human security expertise with modern security tools to deliver managed threat detection, security monitoring, and incident response capabilities for both SMBs and enterprise clients. MDR services are especially valuable for organizations that need robust security monitoring and response capabilities, but may not have the resources or expertise to manage an in-house Security Operations Center (SOC).

AWS Centralized Logging: A Complete Implementation Guide

In cloud environments, logs are often spread across numerous services, making it difficult to track down issues or gather meaningful insights. For AWS users, this challenge can become especially time-consuming. Centralized logging in AWS helps by bringing all your logs into a single platform, making management and analysis easier.

Accelerating Observability Adoption: Why Self-Service Isn't Optional Anymore

For observability adoption to scale, you must eliminate the bottlenecks. A self-service approach is the only sustainable model, enabling all teams–not just a select few–to access, implement, and scale observability easily. But making the shift requires more than access: you have to design for it.

What Is a Logging Formatter and Why Use One?

Logs play a crucial role in DevOps and software development, especially when troubleshooting issues. However, raw, unformatted logs can quickly become overwhelming and difficult to navigate. This is where logging formatters help by turning messy log entries into clear, structured data, making it easier to pinpoint problems. In this guide, we’ll cover everything you need to know about logging formatters—how they work, why they matter, and tips for implementing them effectively in your workflow.

From Logs to Metrics Part 1: Building an Open-Source Logs-to-Graphite Pipeline

Monitoring doesn't always need to be complex. In this guide, we'll show you how to turn raw logs into usable metrics using a lightweight open-source setup with no ELK stack and no heavy lifting. We'll use Loki, Python, and Telegraf to convert logs into Graphite metrics you can easily monitor or alert on. This is perfect for system admins, DevOps beginners, or anyone curious about building more innovative monitoring pipelines from scratch.

Cribl and Palo Alto Networks Launch Partnership with Cortex XSIAM Integration

Cribl’s powerful data processing engine is designed specifically for IT and Security teams, enabling organizations to take control of their ever-growing data volumes. By simplifying the management, processing, and analysis of telemetry data, such as logs, metrics, and traces, generated across complex digital environments. This empowers organizations with the choice, control, and flexibility to manage and analyze data, allowing them to adapt to evolving needs and strategies.

Australia Is Investing in Resilience - Are Businesses Ready?

The 2025-26 Australian Federal Budget sets out a clear priority: building a stronger economy and a more resilient nation. That includes investment in critical infrastructure, skills and services to help Australians navigate ongoing uncertainty. More than $3 billion has been committed to upgrade the National Broadband Network (NBN), extending high-speed fibre to 95% of homes and businesses.

AI and the Data Value Challenge: Why It's Time to Rewrite the Rules of Data Management

Like the sailor in Coleridge’s “The Rime of the Ancient Mariner,” surrounded by ocean water that he cannot drink, modern organizations contend with similar challenges: data is all around, but it’s not doing them much good (or as much as it could at least). Exploding data volumes have complicated the data management strategies for security and observability teams seeking to contain costs while meeting regulatory and compliance obligations.

Events, Alert, and Incidents: What's The Difference? How Do They Relate?

Effectively managing events and alerts is essential for preventing or quickly resolving incidents, whether it’s a sudden service outage or an ongoing cyberattack. The three terms — events, alerts, incidents — are different but they are closely related. Read on to learn more. Ensuring the reliability, performance, and efficiency of IT systems is both the heart of operational excellence and an important strategic objective for digital organizations.

All about OTel and Logging on Kubernetes with Loki (Loki Community Call April 2025)

In this pre-recorded Loki Community Call, we talk all about OTel and logging on Kubernetes with Cyril Tovena, Ward Bekker, Jay Clifford, and Nicole van der Hoeven at KubeCon EU 2025 in London. We discuss when why you should switch to OTel and why you shouldn't, what OTLP is exactly, and best practices for ingesting data through an OTLP endpoint.

Think Like a Query with Pablo Loaiza - Customer Brown Bag - April 24, 2025

Join us as we discuss how to approach real-world questions, translate them into queries, and refine them for maximum efficiency. Hands-on examples teach you how to filter effectively, compare historical data, correlate events, and troubleshoot common challenges.

Strategic Windows Event Routing with Bindplane

Windows event logs can provide valuable insight into day-to-day operations and potential security issues. But making sense of that data—and getting it to the right place without overloading your systems or driving up costs—takes some planning. Bindplane helps with this by providing a flexible way to collect, process, and route Windows events. It’s designed to support security and compliance needs without adding unnecessary complexity.

How Does OpenTelemetry Logging Work?

Modern systems throw off logs like confetti—and making sense of all that noise is half the battle. OpenTelemetry logging offers a way to bring some order to the chaos. It helps DevOps teams collect logs in a consistent format, no matter what language or framework they’re working with. In this guide, we’ll walk through what OpenTelemetry logging is, why it matters, and how to put it to work in your stack.

Elastic Cloud Serverless now generally available on Google Cloud

Elastic Cloud Serverless provides the fastest way to start and scale security, observability, and search solutions — without managing infrastructure. Today, we are excited to announce the general availability of Elastic Cloud Serverless on Google Cloud — now available in the Iowa (us-central1) region. Elastic Cloud Serverless provides the fastest way to start and scale observability, security, and search solutions without managing infrastructure.

Data Backup Strategies: The Ultimate Guide

Despite the nonstop warnings, millions of users still gamble with their data. A 2023 survey by Acronis revealed that 41% of people rarely or never back up their digital files, and businesses aren’t much better. Fewer than 20% of businesses back up their SaaS data, even though tools like Google Workspace and Microsoft 365 don’t guarantee full recovery after a loss or attack. The consequences?

Responsible AI: What It Means & How To Achieve It

The information age has leapt forward with the explosive rise of generative AI. Capabilities like natural language processing, image generation, and code automation are now mainstream — driving the business goals of winning customers, enhancing productivity, and reducing costs across every sector. New large language models are emerging almost daily, existing language models are optimized in a frantic race to the top. There seems no stopping the AI boom.

How to Build a Successful SIEM Migration Strategy

At least once a week, a team reaches out to discuss migrating from an established SIEM or analysis platform. This major decision is influenced by several compelling factors, which can create significant work for engineering teams and pose risks to the business. The cost of switching to a new platform, often referred to as displacement costs, can be substantial.

Datadog: The Good, The Bad, The Costly

When things break, logs are often the first place you turn to figure out what's going on, which is why Datadog makes it easy to find them. The ability to pivot between traces, metrics, and logs in one place speeds up investigations and helps teams move faster during incidents. That level of correlation is a big reason so many teams rely on Datadog. ‍

One year in: How Flex Licensing is transforming log management and visibility

A year ago, we set out to transform log analytics pricing by making it as flexible, transparent, and as customer-friendly as possible. We built a model that aligns cost with business value, charging only for data storage and analytics executed. With Flex Licensing, customers can scale usage up or down without breaking the bank, eliminating hidden costs and inefficient licensing structures. There is no more pre-planning or tiering of log data; there is just log ingest with sensible pricing.

How a global bank turned a search engine into its data backbone

BBVA transformed customer experience and operational insight by using Elastic to unify 45B+ data points across 50+ banking services, with sub-second response times. When BBVA's David Jiménez Ausin looks back at 2014, he sees a very different banking landscape. “Almost everything was still via web channel, as the app wasn't as developed as it is now, and each service had its information in its own systems,” he recalls.

New in Adaptive Logs: user-facing temporary pauses, exemptions, and per-service recommendations

We launched Adaptive Logs last year to help you optimize your log volumes and costs in Grafana Cloud, and we’ve been hard at work ever since making improvements based on your feedback. Over the past couple of months, we’ve delivered several new features to help reduce toil, apply recommendations with precision, and—what we’re most excited about—confidently optimize your log ingestion while still providing peace of mind to your end users!

Advanced Python Logging: Mastering Configuration & Best Practices for Production

Python's logging system provides powerful tools for application monitoring, debugging, and maintenance. This comprehensive guide covers everything from basic setup to advanced implementation strategies, helping you build robust logging solutions for your Python applications.

GDPR Log Management: A Practical Guide for Engineers

GDPR compliance for logs can be tricky—especially when you're trying to maintain system visibility and protect user data at the same time. For SREs and IT teams, it’s a balancing act between staying on the right side of privacy laws and not losing the context you need to troubleshoot. This guide walks through practical ways to handle personal data in logs, set up retention rules that make sense, and stay compliant without creating unnecessary friction.

Serverless Monitoring In The Cloud With Bindplane and OpenTelemetry

Almost two years ago I wrote the first installment of what was supposed to be a 3 part series on Serverless Monitoring. Parts two and three never materialized. Today, however, I am revisiting that original idea and expanding upon it. I hope to succeed this time in making it a full three-part series. For this first installment (Revisited), I will again work with Google Cloud Run to monitor MongoDB Atlas.

Mezmo Recognized with 25 G2 Awards for Spring 2025

We’re thrilled to share that Mezmo has been recognized by G2 with 25 badges across four key categories: Enterprise Monitoring, Log Monitoring, Log Analysis, and Cloud Infrastructure Monitoring. These awards are more than just a celebration of our platform—they’re a reflection of you, our customers. Your feedback, support, and insights push us to build better solutions and deliver the highest standards of performance and service.

A Closer Look at Docker Build Logs for Troubleshooting

In the world of containerization, understanding what's happening under the hood during image builds can mean the difference between smooth deployments and frustrating debugging sessions. Docker build logs are your window into this process, offering crucial insights that help you optimize builds, troubleshoot errors, and maintain robust container infrastructure.

An easier way to configure the OpenTelemetry SDK in your applications | Declarative Configuration

In this video, we'll explore OpenTelemetry's declarative configuration feature, a powerful new method to configure the OpenTelemetry SDK using a YAML file without the complexity and overhead of programmatic instrumentation. I'll demonstrate this with a simple Go application instrumented using declarative configuration, sending metrics, traces, and logs to Splunk Observability Cloud. We'll cover: Resources.

What Is Hybrid Cloud? Trends, Benefits, and Best Practices

Over the past decade, businesses have realized that relying solely on their data centers has limitations. That’s why 38% of organizations turned to private clouds in 2024 to control their data. However, as the need for more flexibility and scalability grew, they started integrating public cloud services. In this article, we’ll explore hybrid cloud computing, what it is, how it works, and why it’s a hot future trend for businesses.

How to create and monitor an AWS Lambda function in Java 11

Serverless computing is a modern cloud-based application architecture in which the application’s infrastructure and support services layer are completely abstracted from the software layer. While every application still relies on physical servers to run, serverless applications shift that responsibility to cloud service providers like Amazon Web Services (AWS).

Log Consolidation Made Easy for DevOps Teams

Managing multiple systems that each generate their alerts and logs can quickly become overwhelming. The challenge of scattered logs is a real headache, especially in the fast-paced world of DevOps. Log consolidation is not just a convenience—it's an essential practice that can save you from chaos and improve your operational efficiency. This guide covers everything you need to know about log consolidation, from understanding what it is and why it matters, to practical steps for making it work.

Elastic Observability 9.0/8.18: Elastic Distributions of OpenTelemetry (EDOT) now GA, LLM observability, and more

Elastic Observability 9.0/8.18 announces several key capabilities: Elastic Observability 8.18 and 9.0 is available now on Elastic Cloud — the only Elasticsearch offering to include all of the new features in this latest release. You can also download the Elastic Stack and our cloud orchestration products — Elastic Cloud Enterprise and Elastic Cloud for Kubernetes — for a self-managed experience. What else is new in Elastic 9.0/8.18? Check out the 9.0/8.18 announcement post to learn more.

The hidden costs of tool sprawl: An SRE's guide to observability consolidation

An overview of the benefits, challenges, and philosophy behind consolidating your observability tools Picture this: It's 3:00 a.m., and your phone is buzzing with alerts from what seems like a dozen different monitoring tools. As you blearily scroll through the notifications, you can't help but wonder, "How did we end up with so many tools, and why can't they just talk to each other?".

Logging vs Monitoring: What's the Real Difference?

Let's talk about something central to DevOps work: logging vs monitoring. While both are essential components of maintaining system health and reliability, they serve distinct purposes and complement each other in different ways. The distinction between them isn't always clear-cut, especially as tooling continues to evolve. This guide talks about the practical applications, technical differences, and implementation strategies for both logging and monitoring in modern DevOps environments.

KubeCon 2025 London: OpenTelemetry Steals the Show and Splunk's Bold Moves

I was lucky enough to attend KubeCon Europe 2025 in London, where the energy around OpenTelemetry (OTel) reached fever pitch. From packed sessions to buzzing hallway conversations, it’s clear: OpenTelemetry isn’t just the future—it’s the present. Here’s what stole the spotlight.

AI assistant: From generalist to specialist

In the AI world, there’s a lot of buzz about creating custom large language models (LLMs) tailored for specific domains, perhaps for better security, context, expertise, or accuracy. It’s an appealing idea: What better way to solve your niche challenges than with a bespoke AI designed just for you? But here’s the thing — building a great LLM isn’t just challenging; it’s prohibitively expensive and resource-intensive.

Debug Logging: A Comprehensive Guide for Developers

When an app breaks and there's no clear clue why, debug logs often hold the answers. They record what the code was doing at each step, making it easier to trace back and spot what went wrong. This guide covers what debug logging is, why it’s useful, and how to use it without turning logs into a wall of noise.

Reducing Telemetry Toil with Rapid Pipelining

Intellyx BrainBlog by Jason English for Mezmo ‍ “Bubble bubble, toil and trouble” describes the mysterious process of mixing together log data and metrics from multiple sources as they enter an observability data pipeline. ‍ Customers demand high performance, functionality-rich digital experiences with near-instantaneous response times.

Flexible Log Management at Scale for Government

As government agencies scale their IT modernization initiatives and deepen their focus on security, managing and maximizing the value of growing log volumes becomes more challenging. During this webinar, Datadog experts examined how to collect, process, and store large machine-generated data sets, transforming them from noise into actionable intelligence.

Elastic extends production-ready AI capabilities for all!

Elastic Security is making your organization safer with general availability of our favorite AI features. Elastic Security is announcing the general availability (GA) of two of our most widely deployed generative artificial intelligence (GenAI) capabilities: Attack Discovery, launched in May, and Automatic Import, launched in August. Elastic’s AI-driven security analytics are providing immense value to many organizations.

Building a Self-Service and Scalable Observability Practice

Join us in this session and learn how Splunk can help you build a standardized observability practice. From implementing an observability-as-code service to role-based access controls (RBAC), Token Management, Metrics Pipeline Management, and OpenTelemetry, learn how to create an Observability platform to optimize your metrics usage and costs while managing workloads efficiently.

What Is Synthetic Data? A Tech-Savvy Guide to Using Synthetic Data

Synthetic data is gaining attention as artificial intelligence (AI) continues to evolve. But what exactly is it, and why is it so important today? At a high level, synthetic data refers to data that's generated by algorithms or mathematical models. It is not data collected from the real world.

How Cribl Partners with Google Cloud Security to Transform Telemetry Data Management for Google Security Operations

Organizations today are grappling with an explosion of telemetry data growth as cloud adoption accelerates, digital infrastructures expands, and operational complexity increases. More data creates more challenges for IT and security teams as they struggle to separate signal from noise while maintaining compliance and efficiency within constrained budgets. It often feels like being caught in the deep end of a wave pool without a floatie, with each new data source sending another wave crashing down.

Java Util Logging Configuration: A Practical Guide for DevOps & SREs

Setting up proper logging is like having a good navigation system when you're driving through unfamiliar territory. For DevOps engineers and SREs managing Java applications, understanding how to configure the built-in java.util.logging framework is essential knowledge that can save you hours of troubleshooting headaches. Let's break down java util logging configuration in a way that makes sense — no fancy jargon, we promise!

How to View and Understand VPC Flow Logs

If you're running workloads in AWS, you've probably heard about VPC Flow Logs. These logs are your eyes and ears for network traffic in your Virtual Private Cloud, and knowing how to check them properly can save you hours of troubleshooting headaches. Whether you're tracking down connectivity issues or monitoring for suspicious activity, this guide will walk you through checking VPC flow logs step by step, with practical examples you can apply today.

Comprehensive Guide to Log Aggregation Techniques and Tools

Logs can provide vital insights to help you monitor system health, pinpoint and resolve issues, and improve cybersecurity. They capture real-time errors and record information about events and other system activities, shedding light on everything from application performance to security threats. However, managing logs can be overwhelming. To get the most out of your logs, you need to aggregate them into a centralized system where they can be organized, searched, and analyzed effectively.

Application Logging Best Practices for Network Technicians: A Comprehensive Guide

If you need to monitor your application’s health, troubleshoot issues quickly, and ensure compliance with various security policies, application logging is compulsory. Without proper logging, identifying the root cause of failures, tracking suspicious activity, or optimizing application performance will become significantly more challenging, if not impossible.

The Role of Log Shippers in Your Stack

Log shippers are essential components in modern infrastructure, serving as the critical connection between the systems that generate logs and the platforms that store and analyze them. They operate behind the scenes to ensure that important system and application information reaches its destination reliably. This guide provides a comprehensive overview of log shippers, including their functionality, implementation considerations, and selection criteria for different environments.

Splunk Federated Data Management - Process, Route and Search Cisco ASA logs

Imagine you have Cisco ASA logs that you want to onboard to the Splunk platform and Observability Cloud, but not all the logs need to be onboarded; some need to stay on low-cost storage like S3. In addition, you must mask or encrypt data before the logs are onboarded to these platforms. In this video, we will explore how Splunk Federated Data Management can assist with this challenge and help maximize the value of your data.

IIS log files: How to find, analyze, and centralize IIS logs

Microsoft Windows Internet Information Services (IIS) log files hold a wealth of data on web application activity and performance. But, locating and managing these logs can be challenging for busy sites with constant traffic and complex infrastructures. IT operations teams rely on IIS logs to troubleshoot web applications, track server requests, identify users, and address other user traffic concerns for optimal security.

How to Master Log Management with Logrotate in Docker Containers

Docker containers continuously generate logs during operation, and without proper management, these logs can consume significant disk space, impact system performance, and create operational issues. Logrotate offers an effective solution for managing these logs in containerized environments. This guide covers the implementation of logrotate in Docker containers – from initial setup through advanced configurations that ensure stable, maintainable container deployments.

Announcing BYOC and the OpenTelemetry Distribution Builder

Instead of deploying a patchwork of proprietary agents for every platform, a telemetry pipeline lets you route your data through a single, consistent layer—and send it to any backend you choose. Flexibility, achieved. But there’s a catch. If your pipeline is proprietary, you’ve only shifted the lock-in left. Sure, you can now add or swap destinations freely—but you’re still deeply dependent on a vendor in the middle of your data flow.

Observability Costs: Tips for More Efficient Data Management

Can you ever get too much data? With modern architectures getting increasingly more complex with hundreds of microservices and containers, data volume grows at an exponential rate, and there’s no pause in sight. In this era of ever-expanding volume of telemetry, it’s nearly impossible to separate valuable data from noise, making things like root cause analysis or alerting needlessly more complicated, while putting pressure on the performance of your stack, your scalability and budget.

Leverage Cloudflare logs for cost optimization, troubleshooting, and security

Cloudflare is a content delivery network (CDN) that helps businesses accelerate, protect, and optimize their websites, applications, and APIs. It acts as a reverse proxy, sitting between users and a website’s origin server to provide DDoS protection, web application firewall (WAF), CDN caching, and load balancing.

Essential Steps for Troubleshooting Network Problems

Everyone has a story about that one road trip where traffic got backed up, making people late to the event. When you have network connectivity problems, your information highway gets clogged up, making it difficult for users to access resources efficiently. While network troubleshooting strategies may seem simple, a lot of nuance and complexity lies in the activities when you dig into your data.

Deployment Tracking with Mezmo Live Streaming Tail

You've deployed a new feature into production. You've done your unit testing, fixed lots of bugs, your code is awesome. Now it's time for hundreds/thousands/millions of users to break...err...use your feature. You're diligent about tracking usage in real-time, and getting customer feedback when something goes wrong. You track the performance and response time impacts on the server. All is good...except...that feature isn't quite working for a specific group of users. Now what?

When Should You Enable Trace-Level Logging?

There’s nothing like debugging a broken system at 2 AM, running on caffeine and frustration. When everything’s on fire, logs are your lifeline. That’s where trace-level logging comes in. Unlike standard logs, it captures the step-by-step execution of your code—think of it as the difference between a crime report and full CCTV footage. But more logs don’t always mean better debugging. Too much detail, and you’re drowning; too little, and you’re guessing.

Webinar: Petabyte Scale, Gigabyte Costs: Mezmo's ElasticSearch to Quickwit Evolution

Many engineering teams rely on ElasticSearch for search and analytics, but as data volumes grow, so do the challenges of scale, cost, and performance. At Mezmo, we faced this reality head-on, recognizing the need for a more efficient and scalable solution to support our multi-cluster, multi-petabyte telemetry data backend. After extensive evaluation, we made the leap to Quickwit, an open-source, cloud-native search engine for logs. But making such a fundamental architectural shift—without disrupting customers—was no small feat.