Operations | Monitoring | ITSM | DevOps | Cloud

Syslog Implementation: Servers, Integration and Best Practices

Syslog is a fundamental protocol for collecting messages and event data from various devices and applications across a network. Think of it as a universal language that allows your servers, routers, firewalls, and software to send their operational insights to a central logging point. Born from Unix systems, Syslog has evolved to become the industry standard, forming the backbone of effective log management and providing a unified view of your infrastructure's activity.

Kubernetes observability: How to enrich logs with GeoIP using the Kubernetes Monitoring Helm Chart

When your Kubernetes app suddenly has traffic spikes in a distant country, it can be difficult to determine why. Let’s say, for example, we have an e-commerce app that started to receive an unusual surge of visitors from Australia — something we never anticipated. We search for answers in our logs, but without geographic context, we don’t have the full insights we need.

Detect hallucinations in your RAG LLM applications with Datadog LLM Observability

Hallucinations occur when a large language model (LLM) confidently generates information that is false or unsupported. These responses can spread misinformation that jeopardizes safety, causes reputational damage, and erodes user trust. Augmented generation techniques, such as retrieval-augmented generation (RAG), aim to reduce hallucinations by providing LLMs with relevant context from verified sources and prompting the LLMs to cite these sources in their responses.

Simplifying Observability: Streamlining Telemetry with a Centralized Pipeline

Modern applications generate a deluge of telemetry data—logs, metrics, and traces—that hold the key to understanding system performance and reliability. However, managing this data effectively is a growing challenge for DevOps teams. Raw telemetry can overwhelm teams with complexity and noise even when collected via robust standards like OpenTelemetry.

How to Choose an APM Solution: 5 Critical Questions for 2025

An APM solution, or Application Performance Monitoring tool, is a software application that helps businesses monitor and manage the performance and availability of software applications. APM tools gather data from systems, servers, databases, APIs, and end-user devices to provide deep insights into the root causes of performance issues. APM solutions have evolved far beyond basic monitoring.

Grafana Campfire - Hiring with AI and more about Grafana MCP (Grafana Community Call - May 2025)

In this Campfire community call, we will talk about the new and the future of AI in the field of Observability space and also discuss about the Grafana MCP server to provide access to your Grafana instance and the surrounding ecosystem. Join me (Usman), Matt Ryer, Carl Bergquist, David Kaltschmidt for this exciting session. Special guests: Sarah Zinger, Cyril Tovena and Ben Sully.

Harnessing Network Observability to Enhance Grid Resilience

Within the utility sector, a lot is changing. Utilities continue to pursue digital transformation, altering the way services are delivered and operations are managed. What hasn’t changed is the criticality of the services provided. These organizations deliver essential resources like natural gas, electricity, and water—services that we as consumers rely upon constantly for our comfort, sustenance, communications, and more.

Inside the Observability Journey: Lessons from CarGurus, Nearform & More

Join us for a dynamic panel from Observability Sessions Boston where leaders from CarGurus, Nearform, and Grafana Labs share their real-world experiences with observability. In this candid discussion, David Frankel (CarGurus) and Joe Szodfridt (Nearform) delve into the challenges of implementing scalable observability practices, moving from centralized models to federated teams, and navigating cloud migration with a focus on performance and cost.

Observability 2.0 in the Real World: Lessons from SimpliSafe's Engineering Journey

In this candid and insightful talk from Observability Sessions Boston, Laban Eilers, a platform engineer at SimpliSafe, takes us on a practical deep dive into the evolution of observability—from the traditional “three pillars” model to the emerging promise of Observability 2.0.

Using the OpenTelemetry Operator to boost your observability

If you’ve ever wrangled sidecars or sprinkled instrumentation code just to get basic trace data, you know the setup overhead isn’t always worth the payoff. But what if it was… just easier? That’s where the OpenTelemetry Operator for Kubernetes steps in… and it plays great with Coralogix out of the box!

How to implement business observability

It sounds simple: You define metrics for success, you track them, and if they fail, you fix them. For decades, this was how businesses monitored their systems. However, a reactive monitoring approach, which alerts businesses about failures only after the issue has already impacted operations, became insufficient as digital architectures grew more complex.

Is There an Existential Crisis in Network Observability?

We've all been there. Users report that applications are slow, calls are dropping, or that "the internet is broken." Yet, a glance at the network dashboards shows a sea of green—latency looks acceptable, packet loss is minimal, and bandwidth seems fine. This common scenario highlights a fundamental challenge in network observability: the perceived disconnect between the technical measurements we gather and the actual experience of the people using our digital services.

Sneak Peek: MetricFire's New Logging Tool for Scalable, Open-Source Observability

Take a first look at MetricFire’s brand-new logging tool — designed to simplify log ingestion, storage, and visualization using open-source components like Loki, Python, Telegraf and Grok. Collect logs, search across services, and correlate them with your metrics — all inside your existing Hosted Graphite environment. Whether you're an SRE, DevOps engineer, or running logs on a budget, this sneak peek reveals how MetricFire is evolving toward full observability.

Logz.io AI Agents: Transforming Observability Through Intelligent Automation

Let’s be honest. AI features can sound cool on paper, but too many tools overpromise and underdeliver. At Logz.io, we didn’t want to build “yet another AI chatbot.” We wanted to create something our engineers and yours would actually use when incidents hit, logs explode, or someone asking, “What just happened to production?” Here’s how our AI Agent evolved from a basic chat interface to an incident-resolving, log-analyzing, doc-digging, context-aware assistant.

Grafana Cloud updates: New observability as code tools, Grafana Drilldown enhancements, and more

We consistently roll out helpful updates and fun features in Grafana Cloud, our fully managed observability platform powered by the open source Grafana LGTM Stack: Loki for logs, Grafana for visualization, Tempo for traces, and Mimir for metrics. With GrafanaCON 2025 — and the release of Grafana 12 — earlier this month, there are a ton of Grafana Cloud updates to share.

Observability vs Monitoring: Enhancing, Not Replacing

In the dynamic world of IT operations, a common misconception has emerged: Observability vs Monitoring is often framed as a battle where one replaces the other. At Icinga, where open-source monitoring is our expertise, we aim to clarify this misunderstanding. Observability doesn’t supplant monitoring—it complements and enhances it. The term “Observability” has become a buzzword in the tech industry, often touted as the modern solution to outdated, static monitoring practices.

Introducing Native Mobile Support in Honeycomb for Frontend Observability

You shipped your latest release. You tested it on emulators, QA devices, and the latest OS versions. But now it’s live and running on thousands or millions of mobile devices, across a jungle of screen sizes, hardware specs, OS versions, and network conditions. A user reports a crash on an old Samsung device over 3G. Someone else complains the app feels “sluggish” after updating. You dig through logs. Rebuild test cases. Ping the backend team. Try to reproduce. Yet, still no answers.

Why a No-Index Observability Architecture is Essential

When was the last time you asked about the architecture behind your observability provider? For most IT professionals whether in development, operations, or security, it’s not a question that naturally comes up. Yet, this architectural detail could be the difference between insight at scale and runaway costs. People are drawn to the features, the shiny things. They promise to unlock insight, drive faster response times, and tighten security.

AI's Unrealized Potential: Honeycomb and DORA on Smarter, More Reliable Development with LLMs

Charity Majors, CTO and Co-founder at Honeycomb, and Phillip Carter, Principal Product Manager at Honeycomb, recently hosted a webinar with DORA's Nathen Harvey on AI's unrealized potential. As part of this, we created a 3-minute highlight reel of the webinar that you can watch.

Evaluating Synthetic Monitoring Platforms: What to Look for in 2025

Synthetic monitoring simulates user interactions with applications to proactively identify performance issues before they impact real users. Modern distributed systems require sophisticated monitoring capabilities to effectively test microservices, APIs, and complex user journeys across diverse environments. This article provides a framework to evaluate synthetic monitoring platforms in 2025.

Transforming Observability: Simpler, Smarter, and More Affordable Data Control

At Mezmo, we’ve always believed that observability should empower innovation, not hold it back with complexity and unpredictable costs. However, as organizations scale and data volumes continue to explode, the old ways of managing telemetry data aren’t sustainable.

A Mindset Shift: Making Observability Integral to DevOps Practices: Datev & OpenTelemetry | Grafana

In the evolving landscape of DevOps, observability is no longer optional—it’s a fundamental pillar of success. During this session, Gunter from Datev explores the critical mindset shift required to make observability an integral part of DevOps practices.

Your Observability Platform Has a Blind Spot: Don't Risk Your Operations on Bolt-on Incident Response Modules

Observability platforms want to do it all—from data collection to incident response. Their pitch is appealing: one platform to eliminate context switching and reduce overhead. But when critical systems fail—and they will fail—, add-on incident management modules won’t save you. You need an end-to-end system built specifically for high-stakes incident management.

CI/CD Observability Powered by OpenTelemetry

Modern engineering teams spend a lot of time and resources in setting up monitoring of their production systems - tracking uptime, catching errors, and responding to incidents before customers ever notice. But what about the journey before code reaches production? For most teams, observing the CI/CD pipeline is either an afterthought or completely overlooked. While we recognize its importance, do we truly understand how well our CI/CD process is functioning?

Understanding Your App's Health With Core Mobile Vitals

Mobile apps are a little different from services run on servers. You build your mobile app, you ship it off to the world, and then it gets run by the end user on their own machine. If your app is running poorly on some percentage of users’ devices, you may never know. That’s where observability comes in. There are certain important metrics that every mobile app has in common.

State of the Observability Databases with Dee Kitchen (Grafana Office Hours #30)

In this Grafana Office Hours, we talk about the state of observability databases (Grafana Loki, Mimir, Tempo, and Pyroscope) and where they're going. We talk about current and upcoming architectural changes in all four, how we're making them more performant, how compatible they are with OpenTelemetry, and what we're working on next for each database. In this conversation are Dee Kitchen (VP of Engineering for Databases) and Senior Developer Advocates Jay Clifford and Nicole van der Hoeven.

Contextual Observability: Using Tagging and Metadata To Unlock Actionable Insights

Observability isn’t about collecting more telemetry — it’s about making that telemetry data meaningful. Contextual observability transforms raw telemetry into actionable insights by enriching it with consistent tagging and metadata. Without context, telemetry data remains fragmented, troubleshooting slows, and aligning with business priorities is nearly impossible.

CI/CD Observability Powered by OpenTelemetry and SigNoz

Most teams have strong monitoring for production, but what about the journey before your code gets deployed? The CI/CD pipeline is where bottlenecks, flaky tests, and process gaps silently waste your team’s time. Until now, this part of the workflow has mostly been a black box. We’re excited to announce CI/CD Observability in SigNoz - a new way to track, analyze, and improve your software delivery process, powered by OpenTelemetry.

Unifying OpenTelemetry & Datadog | #Observability #OpenTelemetry #datadog

Previously, teams had to choose between adopting the OpenTelemetry Collector’s capabilities and fully leveraging our advanced features. On This Month in Datadog, we’re spotlighting our OTel Collector distribution, which unifies OTel and Datadog. Check out the link in our bio to watch the new episode.

Deep Temporal Observability - Correlate Metrics with Logs & Traces

Temporal lets you orchestrate complex, reliable workflows, but when something breaks or slows down, the built-in dashboards only give you a list of events and some basic filters. You can see what happened and filter by attributes like workflow type or namespace, but you can't drill deeper. There's no way to jump straight from a metric spike to the exact trace or log line you care about.

Building a Culture of Observability Through Ownership

There’s a problem in engineering culture that we don’t talk about enough: observability is an afterthought. It’s treated as tooling, not thinking. As a checkbox, not a habit. And that mindset gap creates real consequences: longer outages, frustrated teams and massive business costs. Atlassian’s Incident Management for High-Velocity Teams overview cites a 2014 study by Gartner, that the average cost of IT downtime is $5,600 per minute.

Gotta Go Slow

The last few months have been wild. Some of the busiest of my life, actually: For context: I’m Canadian, and all of this happened during the continued threats of annexation. All this to say, it’s been rough. I anticipated this would be a challenging time and that I would be exhausted. So, the plan became: do all the demanding things, take my sabbatical in May, and use April as an ‘in-between’ period with a bit less pressure.

Splunk Observability Cloud's AI Assistant in Action | Practical Examples | Part 2

In this video, we'll explore practical ways to utilize the AI Assistant in Splunk Observability Cloud. Through real-world scenarios, learn how the AI Assistant can help you interpret metrics, contextualize data, onboard new team members to your organization, and automate tasks via the Splunk Observability Cloud API. AI Assistant in Splunk Observability Cloud enhances observability by providing actionable insights and streamlining workflows.

Establishing SD-WAN Observability to Fuel SASE Success

For today’s enterprises, ensuring optimized network connectivity and robust network security represent key imperatives. Given that, it makes sense that there’s rapidly growing use of solutions like secure access service edge (SASE). In fact, the SASE market is expected to grow to $5.9 billion by 2028. SASE delivers converged network and security capabilities. SASE is a cloud-based offering that is primarily delivered on an as-a-service basis.

The Complete Guide to Observing RabbitMQ

Message queues quietly power a lot of what happens behind the scenes in distributed systems. RabbitMQ is no exception—when it’s working, you don’t notice it. But when it’s not, things break in ways that are hard to trace. This guide walks through what you need to monitor in RabbitMQ, how to set it up, and how to troubleshoot when things go wrong—so you’re not stuck guessing when messages go missing.

Unleash SaaS Data With the Webhookevent Receiver

There are many vendors, Honeycomb included, where actions on the application can emit a web request that goes to another service for coordination or tracking purposes. Many vendors have pre-built integrations, but some have a fallback that says “Custom Webhook” or similar. If you’re looking to create a full picture of your request flow, you would want these other services to show up in your trace waterfall.

Splunk Observability Cloud's AI Assistant in Action | Practical Examples | Part 1

In this video, we’ll provide practical, real-time examples demonstrating how to effectively use the AI Assistant in Splunk Observability Cloud. You'll learn how the AI Assistant can quickly identify unknown issues in your environment, perform detailed root cause analysis, analyze service performance and deployment impacts, and even help manage infrastructure costs and compliance. TOC.

Google's Agent-to-Agent (A2A) Protocol is here-Now Let's Make it Observable

Can your AI tools really work together, or are they still stuck in silos? With Google’s new Agent-to-Agent (A2A) protocol, the days of isolated AI agents are numbered. This emerging standard lets specialized agents communicate, delegate, and collaborate—unlocking a new era of modular, scalable AI systems. Here’s how A2A could transform your workflows, and why making it observable is just as important as making it possible.

Observability Best Practices: Balancing Sustainability and Cost in a Data-Driven World

Imagine this: Your IT team has invested in cutting-edge observability tools to keep systems running smoothly. But does that imply you are following observability best practices? As your business grows, so does the flood of logs, traces, and metrics—along with a skyrocketing cloud bill. What started as a way to gain better visibility is now a major expense, and suddenly, you’re asking: Are we paying too much for too little value? This challenge is becoming all too common.

Optimising OpenTelemetry Pipelines to Cut Observability Costs and Data Noise

Fat bills from observability vendors and tons of not-so-insightful telemetry data have turned out to be a very common issue today. This often leaves teams having to explain the lack of clear ROI, despite the growing costs. If you’re using OpenTelemetry to record your observability data, there are some practical methods you can apply to keep those costs from piling up.

We built AI-powered Root Cause Analysis that actually works

Figuring out why things break still sucks. We’ve got all the data: metrics, logs, traces, but getting to the actual root cause still takes way too long. Observability tools show us everything, but they don’t really tell us what’s wrong. So why do we even need to automate root cause analysis? First, time. Outages are expensive. And if your system has hundreds or thousands of services, digging through everything by hand just takes way too long.

SQL Server Observability: Monitoring, Troubleshooting, and Best Practices

For DevOps teams managing mission-critical databases, SQL Server observability is a fundamental capability that provides comprehensive insight into database performance and health. Effective observability practices enable teams to identify potential issues before they impact end users and provide the context necessary to resolve problems efficiently. SQL Server observability involves collecting and analyzing metrics, logs, and traces to build a complete picture of database behavior.

Reporting CSP Errors in Honeycomb With the OpenTelemetry Collector

The HTTP Content-Security-Policy response header is used to control how the browser is allowed to load various content types. It is used to control which URLs, fonts, images, scripts, and more can be loaded onto the page. It’s a great defense against XSS (cross-site scripting), clickjacking, and cross-site vulnerabilities. The header can also specify a URL that will be used to send reports on violations of these properties.

Logz.io Integration for AWS and Kubernetes Observability

Ever feel like you’re flying blind in your AWS environment? You’re not alone. In the sprawling universe of microservices, containers, and serverless functions, trying to troubleshoot without proper observability is like trying to find a bug in a datacenter… with the lights off… while wearing sunglasses.

Cribl Edge: Unify Telemetry Collection | Lightboard Demo

Cribl Edge is a vendor-neutral, intelligent agent designed for the variety and scale of today’s modern architectures. With a unified telemetry collection system, you can have hundreds of thousands of agents at your fingertips to automatically discover and collect data from your Windows, Linux, and Kubernetes environments. Featuring a rich UI, centralized fleet management, and seamless upgrades, it’s time to transform your agent management.

Mission-Critical Visibility: How Observability Empowers the DoD

Tech is entering another wave of innovation with AI. With accelerated innovation comes increased complexity in already disparate environments. For Defense, those complexities are compounded by the need to maintain and operate mission critical infrastructure with highly sensitive data in air-gapped environments, often running on custom digital systems and applications. Accelerating the speed of innovation with leading technology is key for the military to maintain its competitive edge.

Why no one talks about querying across signals in observability?

In today’s complex distributed systems, observability has evolved from a nice-to-have feature to a mission-critical engineering discipline. Engineering teams across organizations depend on robust observability to maintain system reliability and quickly diagnose issues when they inevitably arise. However, current observability tooling significantly lags behind user expectations by failing to support a critical capability: querying across different telemetry signals.