Operations | Monitoring | ITSM | DevOps | Cloud

Sponsored Post

What is a Real-Time Data Lake?

A data lake is a centralized data repository where structured, semi-structured, and unstructured data from a variety of sources can be stored in their raw format. Data lakes help eliminate data silos by acting as a single landing zone for data from multiple sources. But what's the difference between a traditional data lake and a real-time data lake? Some traditional data lakes use batch processing, which involves processing and analyzing a collection of data that has been stored over a specific timeframe. For example, payroll and billing systems that are handled on a weekly or monthly basis might use batch processing.

Your Data is Whispering and Needs a Human to Listen

If you have ever owned, operated, or supported a piece of technology, you have probably built a dashboard. Maybe it started as a quick chart to answer a simple question, then quietly grew into something more important. Dashboards are often created by the people who know the systems best, the ones who can wire together data sources and click all the right buttons. But those same builders are rarely trained in how humans actually interpret data.

Talk to Your Logs: LLM-Powered Chat UI in DSDL 5.2.3

We are excited to announce the release of the Splunk App for Data Science and Deep Learning (DSDL) version 5.2.3. Since 2018, DSDL has served as an innovation hub for custom AI integrations within Splunk. In 2025, the release of DSDL 5.2.0 introduced customizable Large Language Model (LLM) integrations, bringing Retrieval Augmented Generation (RAG) and Agentic AI workflows to Splunk users.

AI can do what now?! What an ethical hacker says about deepfakes and AI

Real-time camera deepfakes are no longer science fiction. High-fidelity, AI-generated impersonation may be advancing quickly — but that's not the only AI risk financial services companies should be thinking about. In this episode of AI Can Do What Now?!, Lisa Jones-Huff, director of security solutions architecture at Elastic, sits down with ethical hacker Freakyclown (FC) to explore what is technically possible today with AI, where reality still falls short of the hype, and what security teams should be worried about.

AI can do what now?! The real risks of AI in social engineering

What is the most immediate risk financial services companies face today? AI-enabled social engineering is already accelerating real-world attacks. Scale, personalization, speed, and automation are lowering the barrier for attackers while making fraud detection more complex for defenders. In this episode of AI Can Do What Now?!, Lisa Jones-Huff, director of security solutions architecture at Elastic, is joined by ethical hacker Freakyclown (FC), and principle solutions architect Joe Murin to explore what is actually happening right now — beyond the hype.

Build a Unified Operational Ecosystem with ServiceNow and Coralogix

During high-priority incidents, SRE teams frequently lose critical time switching between monitoring platforms and ticketing systems. Context switching like this forces engineers to manually update incident states by copying and pasting data. The inevitable result is increased risk of information gaps and slower Mean Time to Recovery (MTTR).

Powering Security Innovation: Executive Q&A on Splunk Joining AWS Security Hub Extended

To succeed in the AI era, customers need fast, easy access to security solutions that can harness the power of agentic AI and deliver business outcomes. They need seamless access to their data for faster threat detection, simpler incident response, and reduced risk. They need technology vendors to work together and not in silos.

Colsubsidio transforms business process monitoring with Elastic Observability

Colsubsidio is one of the largest and most representative family compensation funds in Colombia. The organization manages and delivers essential social services to millions of users through a broad network spanning health, education, subsidies, recreation, tourism, credit, housing, pharmacies, retail supply, culture, and labor welfare.

Cut Costs, Not Visibility. Use S3 for Low-Cost Log Retention and Faster Response.

Why pay for continuous ingestion of data you rarely use? Learn how to maintain a lean data strategy by keeping long-term logs in cheap S3 storage, while retaining the power to "promote" specific slices into Splunk whenever an audit or investigation arises. See how Promote for Amazon S3 gives you the speed of local indexing without sacrificing speed in investigations.

Catch Every Moment in Kubernetes: Splunk's Observability Advantage

Discover why real-time, unsampled observability is critical for Kubernetes environments with Stephane Estevez from Splunk at KubeCon Europe 2026. Learn how Splunk’s unique approach helps you catch every important moment—even when containers vanish in milliseconds. Watch now for expert insights on cloud-native monitoring, observability, and Kubernetes best practices!

Claude Code + OpenTelemetry: Per-Session Cost and Token Tracking

I was looking at our Claude Code spend in the Anthropic console the other day. Aggregate cost, aggregate tokens — no breakdown by developer, no breakdown by session. I knew my Hackathon team had been using it heavily on building out new features for the OpenTelemetry Distro Builder. But heavily how? I had no idea. Turns out Claude Code has been emitting OpenTelemetry signals the whole time. Per-session cost, token counts, every tool call it makes on your codebase.

Log Management for your Homelab with Fluentd and Aiven for PostgreSQL

Tired of not knowing why your home lab containers are crashing? In this video, we’ll walk through how to set up a consolidated logging pipeline using Aiven for PostgreSQL (with TimescaleDB), Fluentd and Grafana — all running on my upgraded Mac Mini home lab. The key insight: never keep your troubleshooting tools on the same machine that's giving you problems!

Reinventing the Incident Responder's Day: Empowering Tier 2 SOC Analysts with Splunk's Agentic SOC Platform

The Tier 2 SOC Analyst or the Incident Responder (often hailed as the "Sherlock Holmes of the network") faces an increasingly complex and relentless digital landscape. In a world where analysts are being overwhelmed by alerts, held back by fragmented, manual tooling and inefficient workflows, incident responders are charged with the critical task of identifying, analyzing, and mitigating security threats.

Bindplane + VictoriaMetrics: Unified Telemetry for Metrics, Traces, and Logs at Scale

We’re excited to announce new native Bindplane destinations for the VictoriaMetrics ecosystem. It’s now easier to collect, process, and route OpenTelemetry metrics, traces, and logs at scale. You can directly connect VictoriaMetrics’ high-performance storage engines to Bindplane’s vendor-neutral, OpenTelemetry-native telemetry pipeline.

From Alerts to Answers: Introducing Coralogix Cases

Modern incident response doesn’t fail due to a lack of alerts firing. It fails because teams are overwhelmed by the sheer volume and the lack of context around them. Today, most observability and monitoring platforms generate a flood of alerts. Each one is triggered independently, even when they are symptoms of the same issue. Engineers are left trying to reconstruct the full picture while jumping between dashboards, Slack messages, and tickets.

A 4-Month Bug Fixed in <10 Minutes with Olly

In today’s highly interconnected systems, the subtle relationships between services are rarely obvious. Modern, complex architectures generate telemetry that functions less as “flashing signs” and more as faint “breadcrumbs” to be followed across a vast network of signals. In 2025, about two-thirds of outages involved third-party systems like cloud platforms and APIs.

The limits of MCP and how Olly surpasses them

Model Context Protocol (MCP) servers act as adapter layers between clients and AI based workloads. MCP installation into an IDE, such as Cursor, brings a wealth of information directly into the developers primary tool, minimizing context switching and, especially in the world of observability, bringing telemetry closer to the code. MCP is not without its limits. These limits initially seem trivial, but in time, some of the inherent limitations to a basic MCP implementation become apparent.

How Coralogix's Data Pipeline Turns Obscure Data into Clear Business Value

Observability data arrives as a flood of signals, full of potential, but rarely consistent. Error messages and debug logs can reveal what businesses care about: reliability, customer experience, and revenue. The challenge is turning raw technical events into information the whole organization can act on. Many observability systems store data first and structure it later, forcing teams to rebuild context in dashboards and queries, often duplicating logic across services.

Bindplane Blueprints for Elasticsearch: Production-Ready NGINX Log Pipelines for Kibana

We've just released new and easy-to-use Bindplane blueprints designed specifically for Elasticsearch as a destination. These blueprints empower teams to quickly transform raw events such as those from NGINX access and error logs into clean, structured, and ECS-compliant data optimized for high-performance visualization in Kibana.

The 2025 Wake-Up Call for Engineering Teams

For years, organizations tried to solve operational pain by collecting more data, adding more dashboards, and consolidating more tools. But 2025 exposed a deeper mismatch. Systems had become more distributed, AI-assisted, and interdependent than ever before, while teams had shrunk and on-call pressure had intensified. This wasn’t a tooling failure. It was an architectural and cognitive one.

What is OpenTelemetry and Why Do Organizations Use it?

Mining for information about environments is like trying to find gold. Looking for gold can be sifting through silty waters or blasting through a mine. In some cases, the gold nuggets are so small as to be almost invisible, some things look like gold but aren’t, and others are larger nuggets where the miner strikes it rich. Trying to understand how a distributed system works means sifting through vast amounts of telemetry, looking for patterns.

Turn Raw Data into Reliability by Changing Performance Perspectives

In a global microservices architecture, technical performance initially presents as a chaotic stream of disconnected telemetry. For a Technical Program Manager (TPM), success depends on the ability to move past these disconnected individual data points to identify stable patterns. If they have services entering critical states, looking at individual logs or traces is inefficient. Protecting system reliability requires an engine that can automate pattern recognition at scale.

OpenTelemetry Production Monitoring: What Breaks, and How to Prevent It

OpenTelemetry almost always works beautifully in staging, demos, and videos. You enable auto-instrumentation, spans appear, metrics flow, the collector starts, and dashboards light up. Everything looks clean and predictable. However, production has a way of humbling even the most carefully prepared setups. When real traffic hits, and it always spikes sooner or later, you start seeing dropped spans.

How a Singleton Pattern Broke Our Django Logging

With modern tooling and agentic coding assistants, straightforward bugs are almost a relief. If a test can catch it, or a user can reproduce it, chances are you can squash it quickly. The harder category — and the one worth writing about — are the bugs where everything looks correct. Your code runs, no exceptions are thrown, your debug statements confirm the right functions fire at the right times, and yet nothing works.

What is the Model Context Protocol (MCP)

The Iron Man’s J.A.R.V.I.S. is the artificial intelligence (AI) that almost every person wants to see. A conversational technology that answers questions like a friend would. The rise of large language models (LLMs) almost seems to give people the friendly robotic sidekick that generations of children grew up dreaming about.

Unlocking business resilience with full-stack observability in hybrid IT environments

For CIOs and technology leaders across the Gulf Cooperation Council (GCC), full-stack observability is a strategic lever for achieving faster ROI, operational resilience, and digital maturity. By integrating AI-powered insights and automation, IT leaders can streamline operations and align technology outcomes with business goals. Demonstrating ROI within tight timelines is critical, as is leveraging observability to maintain competitive advantage in a rapidly evolving market.

Bindplane | Blueprints for ClickHouse: Optimize Telemetry Before It Hits ClickStack

Chelsea from the Customer Success team walks through the Bindplane Blueprints for ClickHouse guide — showing how to optimize logs, metrics, and traces before they land in ClickStack. You’ll see how to: ClickHouse is powerful. But raw telemetry at scale gets expensive fast. Bindplane acts as the control plane for your OpenTelemetry infrastructure. Blueprints let you apply production-ready processing logic instantly without YAML sprawl or config drift.

Introducing "Explain Flame Graph": Stop Fighting Fires and Start Explaining Them

In a modern observability deployment, it’s simple to get data that helps you understand where your system is failing. However, when we try to understand why, the answer is often buried beneath a mound of stack traces. For many developers, attempting to interpret a flame graph by manually calculating self-time (the resources consumed by the function itself) versus child-frame latency (the time spent waiting on called sub-functions) is both confusing and time-consuming.

Troubleshooting Microservices with OpenTelemetry Distributed Tracing

Distributed tracing doesn’t just show you what happened. It shows you why things broke. While logs tell you a service returned a 500 error and metrics show latency spiked, only traces reveal the full chain of causation: the upstream timeout that triggered a retry storm, the N+1 query pattern that saturated your connection pool, or the missing cache hit that turned a 50ms call into a 3-second database roundtrip.

AI observability: The backbone of mission resilience in the public sector

Downtime cost the public sector $193 million last year — and the financial hit is only the beginning. Beyond the numbers, downtime in the public sector can also lead to severe consequences for citizens: interrupted access to critical online services, delayed benefits, and stalled emergency response. When citizens cannot rely on government services, downtime becomes more than an inconvenience; it becomes a matter of trust. More than uptime, resilience is the new success metric for modern government.

Troubleshooting & RCA with Olly

If troubleshooting still feels harder than it should, check on these two numbers: how many dashboards you have, and how many alerts fire every day. For most teams, it’s hundreds of dashboards and thousands of alerts, a sign of maturity, coverage, and good intentions. On the other hand, we also see that when something actually breaks, that coverage rarely turns into clarity fast enough.

Splunk Attack Range v5 Demo

The Splunk Attack Range is an open source project that lets security teams spin up instrumented cloud environments, simulate adversary behavior, and use the generated telemetry to build and test detections in Splunk. Whether you are a detection engineer tuning rules, a purple team validating coverage, or a developer automating tests, Attack Range gives you a repeatable, cloud-based lab. This post highlights what Attack Range does, how it works, and how to get started - whether you prefer a web UI, a REST API, or the command line.

Will humans be replaced by AI? The truth

Agentic AI doesn’t replace analysts, it augments them. The real value comes from making teams more efficient, not smaller. This is the perspective most people miss. Additional Resources: About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

Agent vs Assistant: The key distinction between Olly and the competition

The market is saturated with agents and assistants, making it difficult to tell them apart. However, the difference between these two approaches is significant. They offer radically distinct levels of impact, reflecting major differences in both their technical complexity and the quality of their inferences. Let’s figure out the distinction.

OpenTelemetry in Production: Design for Order, High Signal, Low Noise, and Survival

A lot of talk around OpenTelemetry has to do with instrumentation, especially auto-instrumentation, about OTel being vendor neutral, being open and a defacto standard. But how you use the final output of OTel is what makes business difference. In other words, how do you use it to make your life as an SRE/DevOps/biz person easier? How do you have to set things up to truly solve production issues faster?

What Agentic AI Is Really Made Of (Most People Miss This)

Agentic AI isn’t just an LLM. Without the right context, it gives generic answers. This is the component that makes its decisions actually useful. Additional Resources: About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

Exploring Splunk Alternatives [2026]: Deep Dive into Log Analysis

Splunk isn't bad software. It's genuinely powerful. But in 2026, a lot of engineering teams are asking a fair question: are we getting $300K worth of value out of this? More often than not, the answer is no. We went through 15 alternatives - read the docs, tested where we could, and talked to engineers who made the switch. This is what we found.

What problem is agentic AI trying to solve?

Agentic AI isn’t limited to security operations. It’s already improving hospitals, financial systems, and service industries by reducing overload and filling skill gaps. Here’s the problem it was actually built to solve. Additional Resources: About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

VictoriaLogs in VictoriaMetrics Cloud: Fast, Cost-Effective Log Management is Here

Yes, you got it right: VictoriaLogs is now Generally Available in VictoriaMetrics Cloud! We believe that this is a huge milestone in our journey to deliver what our users are expecting from us: a complete, managed observability solution. If you’ve been following our quarterly updates, you know we’ve been after this launch for a while. In our latest update a few weeks ago we already announced that we were ready and today we’re making it official.

What is agentic AI? (explained in 60 seconds)

Agentic AI is the next evolution of artificial intelligence. Unlike traditional AI, it can act autonomously and make decisions on its own. Here’s what that actually means, without the hype. Additional Resources: About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

Beyond a Billion Spans: Using Highlights for High-Speed Root Cause Analysis at Scale

In late 2025, we introduced Trace Highlight Comparison. This capability was designed to solve the problem of having too many spans. This causes technical and financial challenges when identifying performance patterns within high-volume telemetry streams. The goal is to avoid massive indexing costs and eliminate the ingestion latency associated with indexing every record. However, knowing these trends is only half the battle.

What you missed at OTel Unplugged 2026 in 8 minutes!

OTel Unplugged 2026 was different by design. Held alongside FOSDEM in Brussels, this was an unconference built by the OpenTelemetry community, for the community. No sales pitches. No product demos. Just honest conversations about what’s working, what’s broken, and where OTel needs to go next. In this recap, you’ll hear short interviews and reflections from engineers, maintainers, and practitioners on.

The Human-Centric Stack: Why Logs Are the Great Equalizer in the Age of AI

In 2026, we are seeing incredible feats of engineering with agentic AI, impacting metrics and distributed traces that map thousands of microservices. Our systems have never been more intelligent and complex. However, as our observability becomes more intelligent, fewer employees know how to manage and troubleshoot complex systems. These employees, who often bear the brunt of an error’s impact, may need to rely on specialists to interpret the system.

Observability trends for 2026 (Part 2): GenAI and OpenTelemetry reshape the landscape

Over the course of my 20 years as a developer, SRE, and now observability product leader, software has typically progressed at a good pace. But now, the emergence of two transformative technologies are fundamentally reshaping enterprise observability: generative AI (GenAI) and OpenTelemetry (OTel). We surveyed over 500 IT decision-makers for a new report:The Landscape of Observability in 2026: Balancing Cost and Innovation.

ISO 27K Without the Bloat: An Open Source Approach

It’s often framed as an enterprise-only exercise: long timelines, expensive tooling, consultants everywhere, and a lot of compliance work that exists mainly to survive an audit. As a ~40-person, engineering-driven SaaS company, we needed the same level of trust and rigor as much larger organizations — but we weren’t willing to accept shelfware, parallel compliance infrastructure, or controls that only exist on paper. We also didn’t stop at ISO 27001.

The Grok-to-AI Evolution: Why Modern SREs Are Moving Beyond Manual Parsing

Grok structures logs. Context engineering connects systems. AI explains behavior. For years, Grok patterns have been the workhorse of the SRE world. Built on regular expressions, Grok helps teams extract structure from unstructured logs. As we explored in "Do You Grok It?", Grok is the key to turning messy log lines into usable fields. It's why our Grok Pattern Reference remains one of our most-visited resources — SREs are hungry for structure.

OpenTelemetry Instrumentation Best Practices for Microservices Observability

OpenTelemetry instrumentation is the foundation of modern microservices observability, but getting it right in production requires more than just enabling auto-instrumentation. This guide covers production-tested OpenTelemetry best practices that help engineering teams achieve reliable distributed tracing, control observability costs, and extract maximum value from their telemetry data.

How to Implement Distributed Tracing in Microservices with OpenTelemetry Auto-Instrumentation

This guide shows you how to implement OpenTelemetry’s auto-instrumentation for complete distributed tracing across your microservices, from initial setup through production optimization and troubleshooting.

Tool Consolidation Is Dead. Long Live Agentic AI.

It’s 2026, and developers have more tools at their disposal than at any point in the industry’s history: CI/CD platforms are richer; observability stacks are deeper; security, data, and AI tooling have exploded into crowded, competitive ecosystems. And yet, delivery is still slow, incidents are still noisy, workflows are still brittle. The problem is no longer tool scarcity or feature depth. It’s integration debt.

How does Coralogix go beyond basic migration?

When a team, division or organization is assessing a new vendor, there are some basic questions that must be answered. At Coralogix, we look at migrations in a different way. It isn’t about transporting the current state of play into a new vendor, often called a “lift and shift”. These are the basics. There is a whole new level of onboarding and support that doesn’t just replicate value across platforms – it expands it.

Elastic 9.3: Chat with your data, build custom AI agents, automate everything

Today, we are pleased to announce the general availability of Elastic 9.3 as the latest version of the Elasticsearch Platform — the world’s most popular open source platform for transforming both structured and unstructured data into trusted answers and outcomes. In addition to including new features that help developers with context engineering and agent building, Elastic 9.3 introduces a broad set of new capabilities to Elastic Search & AI, Elastic Observability, and Elastic Security.

How to Build AIPowered Search with Elasticsearch [2 Min Live Demo]

In this demo, we show how Elasticsearch enables production‑ready GenAI and AI‑powered search applications—from indexing and embedding your data to grounding large language models with RAG. You’ll see how developers can go from raw data to a fully functional GenAI search experience—fast Additional Resources.

Observability vs Monitoring: Getting a Full Picture of the Environment

Driving down the highway, you usually glance intermittently at your speedometer to ensure that you stay within the speed limit, or whatever window above the speed limit you’re willing to drive. While monitoring your speed mitigates the risk of a ticket, you still need to look out for various threats on the road, like cars going through stop signs. By observing your surroundings, you take in real-time information that can help prevent a crash.