Operations | Monitoring | ITSM | DevOps | Cloud

Sponsored Post

How to Reduce Continuous Monitoring Costs

Continuous monitoring is a crucial practice in the fields of DevOps, cybersecurity, and compliance. It involves the proactive and ongoing process of observing, assessing, and collecting data from various systems, applications, and infrastructure components in real-time or near real-time. Continuous monitoring is closely related to observability, which goes beyond simple monitoring to provide a deep understanding of complex and dynamic systems.

How Data Ingestion Works in Elasticsearch (Quick Guide)

Before you can search, analyze, or visualize anything in Elasticsearch, you need data ingestion. In this quick guide, we explain how data moves from raw logs, metrics, or JSON into an index using tools like Logstash, Beats, or language clients. Learn why consistency matters more than perfection and how once data is ingested, it’s ready for search, analysis, and insight.

Set up Splunk AI Assistant for SPL in Enterprise environments with Cloud Connected Integration

Unlock the power of the Splunk AI Assistant for SPL in your enterprise environment! In this quick tutorial, we'll walk you through the entire process, from downloading the app on Splunkbase, accepting the license agreement, and installing it in your environment, to completing the cloud-connected configuration which now allows you to use the AI Assistant in even more environments!

What Are Mappings in Elasticsearch? (Explained Simply)

Elasticsearch mappings turn logs from unstructured text into usable data. In this video, we explain what mappings are, how they define fields like text, number, and date, and why they matter. With the right mappings, Elasticsearch can filter error codes, sort by response time, and group results by browser, region, or version.

Understanding Incident Response vs Incident Remediation

At a high level, incident remediation is a part of the incident response process. An Incident response plan manages the incident lifecycle across planning, detection, investigation, and recovery. Meanwhile, incident remediation focuses on identifying root causes and implementing measures to prevent future occurrences.

OpenTelemetry Deep Dive: Resilience & High Availability in the OTel Collector

Missed it live? Catch the full recording of OpenTelemetry Deep Dive: Resilience & High Availability in the OTel Collector — a 1-hour workshop on building telemetry pipelines that never drop a signal. We’ll show you why resilience matters, how to design high-availability architectures, and how to configure the OpenTelemetry Collector with retries, batching, and persistent queues. Plus, you’ll see live demos in both Docker and Kubernetes — including scaling Gateway collectors with an HPA — and how Bindplane makes large-scale management seamless.

Tech Talk - Mastering Data Pipelines Unlocking value with Splunk

On this Tech Talk to learn how Splunk can help you unlock the value of your security and observability data by building an effective data management strategy. Understand how Splunk’s approach to federated data management can help you maximize the value of data. Build effective pipelines using our latest SPL2-powered data processing capabilities to collect, transform and route data based on your business needs. Run effective searches on data in Amazon S3 without having to ingest or index data into Splunk.

Tech Talk - Aligning Observability Costs with Business Value Practical Strategies

Learn how to tackle the challenges of growing telemetry data and optimize your observability model to maximize value while minimizing costs. This session will explore strategies to reduce log ingestion, centralize pipeline management, and gain visibility into metric usage to identify waste.

The business impact of Elasticsearch logsdb index mode and TSDS

The Elasticsearch storage engine team has made significant strides in improving storage efficiency and performance in Elasticsearch 8.19 and 9.1. Now that these changes are available, what impact can they have on your business? And how do you make the most of them?

Tech Talk - Holistic Visibility and Effective Alerting Across IT and OT Assets

On this Tech Talk to learn how to gain complete visibility into all hosts and their potential vulnerabilities, misconfigurations and unpatched components in a single analytics platform, adding Tenable asset and exposure risk context improves alert prioritization and joint customers use Splunk for Centralized Reporting.

How Elasticsearch Works: Documents, JSON & Index Explained

Ever wondered how Elasticsearch can search any kind of data? In this video, we break it down with a simple deck of cards analogy that makes indexing easy to understand. Each card is like a JSON document with fields and values, suit, color, number, type. Combine them and you’ve built an index, giving Elasticsearch the power to answer queries like “show me all the red cards” or “show me only the face cards.” If you can describe it, you can index it, and if you can index it, you can search it.

Visualize Logs Alongside Metrics: Complete Observability for Slow PostgreSQL Queries

When latency creeps into your app, metrics tell you that performance regressed, but logs tell you why. PostgreSQL’s slow-query logging gives you the exact statement, duration, user, and database which is perfect for hunting down missing indexes, inefficient filters, or N+1 patterns.

Caddy Webserver Data in Graylog

If you’re running Caddy Webserver on Ubuntu, Graylog now has a new way to make your access logs more actionable without tedious parsing or manual setup. The new Caddy Webserver Content Pack, available in Illuminate 6.4 and a Graylog Enterprise or Graylog Security license, delivers ready-to-use parsing rules, streams, and dashboards so you can quickly turn raw logs into structured, searchable insights.

Raising the bar in observability and security: Coralogix extensions at scale

In today’s high-velocity digital ecosystem, visibility isn’t enough. SREs and engineering leaders need real-time insights, actionable signals, and automated workflows to operate at scale. As systems grow more distributed and cloud-native, the demand for intelligent observability and security has never been higher. Extensions are solutions to get instant observability with prepackaged parsing rules, alerts,dashboards and more.

Elasticsearch Explained for Beginners: From Spreadsheets to JSON, Indices & Shards

Ever wondered how Elasticsearch actually works? In this quick breakdown, I’ll use a simple spreadsheet analogy to explain the basics from documents and indices to shards, CRUD operations, and mappings. You’ll see how Elasticsearch stores data as JSON documents, splits indices into shards for scalability, uses CRUD with ID hashing for fast lookups, and applies mappings to organize text, numbers, and labels.

How Tipalti mastered Elasticsearch performance with AutoOps

From manual monitoring to proactive optimization, learn how Tipalti used AutoOps to save 10% annual costs. For a global payables automation leader like Tipalti, where financial transactions are the lifeblood of the business, infrastructure performance isn't just a technical goal; it's a core business requirement. Managing a complex ecosystem of databases, including Postgres, SQL Server, MongoDB, Kafka, and Elasticsearch, with a lean team of four engineers demands efficiency and powerful tooling.

APM Logs: How to Get Started for Faster Debugging

When application performance monitoring detects a spike in latency or error rates, the immediate challenge is determining the underlying cause. APM logs address this by correlating performance metrics with the specific log events that occurred at the same time. Instead of switching between monitoring dashboards and manually searching through log files, APM log correlation consolidates both views.

What Is Vector Search? Difference Between Vector & Semantic Search Explained [Quick Question Ep. 5]

What is vector search? In this breakdown, learn how vector search leverages machine learning to capture the meaning and context of unstructured data by transforming it into a numeric representation that is stored in a vector database. This video also explains the difference between sparse and dense embeddings, and how vector search differs from semantic search and lexical search.

The Smartest Member of Your Developer Ecosystem: Introducing the Mezmo MCP Server

Building a great developer experience is about more than just the code. It’s about creating a unified ecosystem where your tools work together seamlessly. That’s been the vision behind our work on the Mezmo MCP Server, and I’m excited to share it with you. At its core, the MCP Server is a universal remote for your data pipeline.

Fix It Fast: Tips, Tricks & Tools for Sumo Logic Success -- Customer Brown Bag -- August 21st, 2025

Led by Sumo Logic experts Andrei and Austin, this session dives into troubleshooting dashboards, silent failure scenarios, and missing collector data—helping your team spot blind spots, catch incidents you never knew you missed, close visibility gaps, and ensure dashboards reflect the full picture for faster resolution.

How to go from ingestion to insights in 10 minutes

When assessing SaaS observability solutions, customers often explore features that are built into the platform, but there ia a whole collection of deployable libraries across all SaaS vendors. In Coralogix, we lead the way in deployable assets, with 4400+ alerts, dashboards, parsing rules, metric generation rules and more. But why should you care about these deployable assets, and why do they accelerate insight generation so profoundly?

Log Files Explained: Types, Uses, and Best Practices for IT Teams

Every system in your environment—cloud, on-prem, or hybrid—generates log files. They capture everything from user actions to system failures, security events, and performance issues. But with so many log types and so much raw data, it’s easy to get buried in noise and miss what matters.

Supercharge your Android app

In today’s technological landscape, mobile applications are on the rise, boosting efficiency, portability and accessibility in daily life, across a spectrum of industries, from financial services to food delivery. As mobile apps become more essential, the quality of their features, performance, and user experience is critical.l.

Nginx Logs & Performance Monitoring with Loki and Telegraf | MetricFire

When a web service slows down or errors spike, metrics can tell you what changed (active connections rise, error rate increases), but the root cause can sometimes be found in your logs (which IPs are hammering POST endpoints, 4XX/5XX occurrences). Put the two together and you get the full observability picture. Time-series metric trends to spot incidents, and line-level details to fix them fast.

Kafka Performance Crisis: How We Scaled OpenTelemetry Log Ingestion by 150%

When your telemetry pipeline starts falling behind, the countdown to production impact has already begun. One Bindplane customer operating a large-scale log ingestion pipeline built on the OpenTelemetry Collector and Kafka hit that breaking point. Instead of keeping pace with incoming data, their pipeline was ingesting just 12,000 events per second (EPS) per partition/collector—and this Kafka topic had 16 partitions. In aggregate, that was roughly 192K EPS.

Logs & Search slowing you down? Simplify and accelerate with Aiven for OpenSearch

For many growing businesses, data infrastructure grows and evolves organically. This often results in teams running one technology for log analytics, like a self-managed ELK cluster, and a completely separate technology for application search. While functional, these disparate tech stacks begin to eat into the bottom line. Businesses grapple with fragmented skill sets, inconsistent security models, multiple vendors, and a constant operational tax.

The Observability Problem Isn't Data Volume Anymore-It's Context

For years, the observability industry has been obsessed with one thing: data volume. We've built incredible pipelines, optimized agents, and scaled storage to handle petabytes of logs, metrics, and traces. The promise was simple: collect more data, get more visibility. But we've hit a wall.

Elastic Powers GitHub's Seamless Developer Experience

David Tippet, Search Engineer at GitHub, shares how Elastic powers GitHub’s massive search platform and enables a seamless developer experience. He explains how GitHub balances AI-driven semantic search with traditional keyword search, ensuring accuracy for millions of diverse users, from engineers to security researchers.

How to Effectively Monitor Kubernetes in 2025

As Kubernetes environments continue to grow in scale and complexity, having a robust monitoring strategy is no longer just good practice, it’s essential for survival. For engineering teams in 2025, effective monitoring and observability is the bedrock of performance, reliability, and cost control. This guide dives into the critical aspects of modern Kubernetes monitoring, from key metrics to the top tools/frameworks and the rising role of AI in managing these complex systems.

Visualize Logs Alongside Metrics: A Complete Guide for Monitoring Slow MySQL Queries

When a service slows down, metrics will tell you that it’s happening but logs tell you why. For MySQL, slow queries can be a silent performance killer, gradually chewing through resources until users start complaining. By enabling MySQL’s slow query log and forwarding it to Loki (via Promtail), you can visualize query-level details right alongside your metrics on Grafana dashboards. This makes it easy to correlate what is slow (metrics) with what is causing the slowdown (logs).

How to Adjust Semantic and Lexical Search Weights in Elasticsearch

In this session, we’ll show you how *hybrid search using Elastic* lets you assign weights to different search types — for example, giving semantic search three times more influence than lexical search. This lets you fine-tune the balance between precise keyword matching and broader, context-aware results.

How Elastic Powers Search in Real-Time (Explained in 52 Seconds)

Ever wondered how Wikipedia loads answers instantly? Or how does your Uber update in real-time? That’s Elastic Search working behind the scenes. In this video, I break down how Elastic powers lightning-fast, scalable search for complex data from ride requests to stock prices.

Inside the Coralogix AI Center: Solving AI's Silent Failure Crisis

Observability has always answered one core question: Is it running? But in the era of LLMs, autonomous agents, and AI-powered workflows, that’s no longer enough. We need to ask a harder, scarier question: Is it right? And right now, most teams can’t answer that. Let’s fix it. In our last post, “The AI Monitoring Crisis No One’s Talking About,” we outlined why prompt injection, hallucinations, and context drift create invisible failures.

What Is an MCP Server?

Ok MCP server, If you’ve been following AI development lately, you’ve probably heard whispers about “MCP Servers” floating around developer circles. It’s been around a little while now, and I myself have finally gotten round to using it. Boy, do we need to talk about it. MCP (Model Context Protocol) is Anthropic’s open standard that lets AI assistants connect directly to your tools and data sources, not just static documentation or code snippets.

Introducing the Coralogix Transactions processor

Coralogix Transactions are a trace segmentation strategy, unique to the Coralogix platform. They allow users to analyze the performance, over time, of a collection of related spans, across billions of traces. Coralogix has introduced a transactions processor into the OpenTelemetry contrib image, enabling users to activate this unique feature using nothing more than OpenTelemetry configuration.

REST easy with REST Packs

The countdown to CriblCon 25 is on and we’re giving you an exclusive first look at the expert insights, innovative solutions, and success stories you’ll see on the big stage. REST collector configuration can be painful, requiring navigating to multiple screens and importing multiple configuration files, but it’s about to get a lot easier. Join Cribl experts to preview how easily you can install and build new packs with new enhancements.

What is Data Logging? A Complete Guide to Process and Practical Uses

Log management encompasses the practice of gathering, organizing, archiving, and maintaining access to logs. As devices and services multiply, so does the data they emit. It demands structured systems that can ingest logs in various formats, sort them for clarity, and retain them based on policies set by security, audit, or technical teams. In this blog, you’ll better understand data logging, its use cases, challenges, as well as trends.

LogRocket - The Ultimate Toolkit for Front-End Insight and Performance

When you need to get beyond surface-level metrics and see what users are actually going through when using your web application, LogRocket provides a potent set of tools. It was built with designers, developers, marketers, ecommerce managers, and web site owners in mind and in a nutshell it combines session replay, error tracking, product-level analytics, and AI-driven insight all in one place.

Elastic wins 2025 Google Cloud DORA Award for Architecting for the Future with AI

Applying DORA principles to improve software delivery and operational performance with Google Cloud We’re thrilled to announce that Elastic has been honored with the 2025 Google Cloud DORA Award for Architecting for the Future with AI. Google Cloud DORA awards recognize organizations that have demonstrated significant advancements by applying DORA principles to improve their software delivery and operational performance with Google Cloud.

How ELSER Transforms One Keyword into Better Search Results

In this session, we’ll show you how Elastic's ELSER takes a single token like _“Terminator”_ and expands it into semantically related terms such as _software, alien, computer technology,_ and _Connor_ (for John Connor). This makes search results more relevant, even when the exact keyword isn’t used.

How to Monitor NVIDIA GPU Metrics with Cribl Edge & Stream (Complete Tutorial)

If you’re running AI, ML, or data-intensive workloads on GPUs, monitoring their performance is critical. Overheating, under-utilization, or memory bottlenecks can cost you thousands in cloud bills and potential downtime. This guide walks you through collecting real-time GPU telemetry using nvidia-smi, sending it to Cribl Edge, routing it through Cribl Stream, and using Cribl Search to analyze the data—step by step.

What Is Semantic Intent? Interpreting User Intent in AI Search [Quick Question Ep. 4]

What is semantic intent, and why is it crucial in the age of *AI search?* In this episode of Quick Question, we break down how semantic *intent interprets* the meaning behind your query in semantic search. About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

Beyond the Pipeline: Data Isn't Oil, It's Power.

Originally published on Medium, this piece by Winston Hearn dives into a philosophical discussion on why the "data is oil" metaphor is no longer serving the tech industry. Hearn argues that by reframing our thinking to "data is power," we can better understand and manage today's complex data systems. ‍ For more than a decade, we in the tech industry have referenced a common metaphor: data is the new oil. It’s a concept that’s easy to grasp.

What Is Log Monitoring (and Why IT Teams Are Shifting to Log Intelligence)

Your infrastructure isn’t confined to a single location anymore. It’s spread across clouds, containers, and on-prem systems, and every layer is spitting out logs: access attempts, performance spikes, error codes, config changes. That data is invaluable if you can find the signal in the noise. But with millions of logs flying by every day, that’s easier said than done.

RUM measurements: Start with the data, discover the story

When something breaks in your application, a slow page, a spike in errors, or a drop in engagement, the typical response is to chase the symptoms. But what if we flipped that process? What if we started not from user complaints, but from actual performance measurements, collected from real sessions in real time? That’s exactly the idea behind Coralogix RUM Measurements.

How To Use Alloy and Hosted Graphite's Loki to Store and Visualize Logs

In a modern DevOps environment, having just metrics or just logs is like trying to navigate with half a map because you’re missing important context that makes decisions faster and smarter. Metrics tell you what is happening (CPU spikes, request rates, failed logins) but logs tell you why it’s happening, with the timestamps to prove it.

Introducing Logz.io Open 360 AI: The Next Generation of Observability Is Here

Traditional observability tools can’t keep up with modern complexity. Dashboard and alert-based approaches still rely heavily on manual processes, resulting in longer troubleshooting cycles, slower decisions, and higher MTTR. Engineering teams need something better. Today we’re launching Open 360 AI, the first observability platform designed for both humans and AI agents working together.

VictoriaLogs Practical Ingestion Guide for Message, Time and Streams

VictoriaLogs Practical Ingestion Guide for Message, Time and Streams This VictoriaLogs article serves as a quick way to grasp the core concepts of VictoriaLogs. It covers only the most important information from the documentation, along with common cases identified after troubleshooting many real-world scenarios. If you’re just getting started with VictoriaLogs, this is a great place to begin. For more in-depth or advanced details, refer to the official documentation.

Visualizing Logs Alongside Metrics: A Practical Use Case

Security threats aren’t always loud and don’t always crash systems or trigger alarms. Sometimes they creep in quietly as a steady stream of unauthorized login attempts, slow brute-force probes, or unknown IPs scanning your server for vulnerabilities. These behaviors often show up in logs before they surface in metrics but if you're only watching logs or only tracking metrics, you're missing part of the story.

AI-driven alert triage and root cause analysis (RCA) that proactively responds to production alerts

Watch AI transform alert management in real-time. This technical demonstration compares manual alert investigation with AI alert investigation. It shows how AI agents automatically investigate production alerts, correlate telemetry across distributed systems, and identify root cause, faster and with more insights than manual processes. Watch and learn how to shift your team from reactive firefighting to proactive system reliability management with agentic AI.

Introducing the Coralogix SLO Center

Are you struggling to define reliability targets? Teams nowadays are turning to Service Level Objectives (SLOs), reliability targets that can be used to define how much you can play around with your systems before users are affected too much. While they're a great way of defining reliability targets, they are difficult to manage. That's why we built the SLO Center. One place to define, track, zoom into, and stay on top of all your reliability targets and error budgets - so you can be sure when you can experiment, and when it's best to stay safe.

Manual vs. AI-Driven Alert Triage and RCA: Who Will Win?

Curious to see how AI actually performs in a real-world production scenario? Watch the webinar “AI-Driven Alert Triage and RCA” with Logz.io Customer Success Engineer, Seth King. Below, we also bring the main highlights of the webinar. AI claims to make engineers more efficient and agile, by shortening processes and surfacing insights that help drive decisions.

Coralogix SLO Center & SLO Alerts are now available

Coralogix has released a new flagship service management product, the SLO Center. The SLO Center allows customers to define service level objectives (SLOs) for their teams. SLOs can be defined across multiple services or metric streams. Powered by the Coralogix Streama engine, this unlocks full coverage SLOs for every team, regardless of volume and with very high cardinality limits.

Leaning into AI, ML, and observability to manage your ever-growing infrastructure

The complexity and scale of modern infrastructure requires an equally intelligent set of observability tools to effectively monitor it. Remember when scaling meant ordering new servers and racking them in a data center? Remember when cloud providers first offered access to seemingly infinite virtual machines at the click of a button? Remember when Kubernetes made it trivial for infrastructure to automatically scale itself based on demand?

Resilience with Zero Data Loss in High-Volume Telemetry Pipelines with OpenTelemetry and Bindplane

This was the problem one Bindplane customer had with processing enormous S3-stored log files. Our engineering team tackled the problem head-on, enhancing the S3 event receiver with offset tracking and chaos testing methodologies.

Semantic Search Explained: Search with intent [Quick Question Ep. 3]

In this video, I’ll explain what semantic search is and how it’s different from traditional keyword search. I’ll walk you through the limitations of lexical search, what we mean by semantic intent, and how vector search plays a role under the hood. About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

Coralogix becomes first observability vendor to earn ISO/IEC 42001:2023 certification for responsible AI

We’re proud to announce that Coralogix is now officially ISO/IEC 42001:2023 certified, becoming the first observability vendor to achieve this globally recognized standard for responsible AI management. ISO/IEC 42001:2023 is the world’s first international standard for Artificial Intelligence Management Systems (AIMS). It provides a comprehensive framework for how organizations should govern AI, focusing on transparency, ethical use, accountability, and regulatory compliance.

The Platform Engineer's Playbook: Mastering OpenTelemetry & Compliance with Mezmo and Dynatrace

The rise of platform engineering has put a new team at the center of the developer experience. These teams are tasked with building the "paved road" for developers, which includes providing a robust, self-service observability stack. However, they face a dual mandate: provide a great developer experience and manage the ever-growing costs and complexity of the tools involved.

Introducing Cribl Guard

Does sensitive data flowing through your network feel like a ticking time bomb? Well, it just might be. Legal mandates, security frameworks, and customer expectations have made the stakes higher than ever. One leaked spreadsheet of personally identifiable information (PII) can wipe out years of customer trust, rack up regulatory fines, and invite ransomware actors to your doorstep.

SLF4J and Log4j - Understanding the Differences

Good logging isn’t optional when building Java applications—it’s critical. Logs are often the first place we turn to when something breaks and are essential for performance tuning, security audits, and long-term maintainability. Two names come up in the Java logging conversation: Simple Logging Facade for Java (SLF4J) and Log for Java (Log4j). They sound similar and often work together, but they serve distinct roles.