Operations | Monitoring | ITSM | DevOps | Cloud

February 2023

Webinar Recap: Taming Data Complexity at Scale

As a Senior Product Manager at Mezmo, I understand the challenges businesses face in managing data complexity and the higher costs that come with it. The explosion of data in the digital age has made it difficult for IT operations teams to control this data and deliver it across teams to serve a range of use cases, from troubleshooting issues in development to responding quickly to security threats and beyond.

CDMs for Enterprise Data: Canonical Data Model Explained

On their own, enterprise applications and systems are not always straightforward. Writ large, they are complex, integrated environments, full of multiple data formats and structures. You spend a great deal of effort and time to define and maintain diverse data models among these integrated components. A Canonical Data Model helps reduce that burden significantly — by promoting a standard and consistent data model between connecting components. This article describes a few things to get you started.

E-Commerce and Log Management

As an e-commerce website owner you care about how your customers behave: why do they come to your website, which items or services are they most interested in, how much time do they spend on certain pages, and is their user experience above par? Also, it is very important to keep your website secure, as you can rest assured that no one wants to leave their payment details on an unsecured website.

Using Cribl Search for Anomaly Detection: Finding Statistical Outliers in Host CPU Busy Percentage

In this video, we'll demonstrate how to use Cribl Search for anomaly detection by finding statistical outliers in host CPU usage. By monitoring the "CPU Busy" metric, we can identify unusual spikes that may indicate malware penetration or high load/limiting conditions on customer-facing hosts. The best part? This simple but powerful analytic is easily adaptable to other metrics, making it a versatile tool for any data-driven organization.

Exploring DORA: Why creating a path to resilience maturity is a critical success factor for financial services organisations

DORA (the Digital Operational Resilience Act) recently came into force and will soon impact thousands of financial services organisations across the European Union (EU). In this blog, my colleague Clara Lemaire and I share some insights about the requirements of DORA, as well as how Splunk can support financial services organisations on their resilience journey. Let’s explore DORA!

How to choose and track your security KPIs

There's no denying that Key Performance Indicators (KPIs) can be critical for any security program, and many of us are fully aware of that. Nonetheless, in practice, confusion still remains about what security KPIs are crucial to track and how to choose the right KPIs to measure and improve the robustness of your security program. Here we'll propose a few ideas about how to select and track the right KPIs for your organization.

How the All in One Worker Group Fits Into the Cribl Stream Reference Architecture

Join Ed Bailey and Eugene Katz as they go into more detail about the Cribl Stream Reference Architecture, designed to help observability admins achieve faster and more valuable stream deployment. In this live stream discussion, Ed and Eugene will explain guidelines for deploying all in one worker group. They will also share different use cases and talk about the pros and cons for using the all in one worker group.

Distributed alerting with the Elastic Stack

Modern computing environments and distributed workforces have produced new challenges to traditional information security approaches. Many traditional threat detection and response strategies rely on homogeneous environments, system baselines, and consistent control implementations. These strategies have been built on traditional environment assumptions that may no longer be true in your environment with the evolution of cloud computing, remote work, and modern culture.

Elastic Synthetics Projects: A Git-friendly way to manage your synthetics monitors in Elastic Observability

Elastic has an entirely new Heartbeat/Synthetics workflow superior to the current workflow. If you’re a current user of the Elastic Uptime app, read on to learn about the improved workflow you can use today and should eventually migrate toward.

FinOps Observability: Monitoring Kubernetes Cost

With the current financial climate, cost reduction is top of mind for everyone. IT is one of the biggest cost centers in organizations, and understanding what drives those costs is critical. Many simply don’t understand the cost of their Kubernetes workloads, or even have observability into basic units of cost. This is where FinOps comes into play, and organizations are beginning to implement those best practice standards to understand their cost.

The Power of Combining a Modular Security Data Lake with an XDR

The average cost of a data breach is expected to hit $5 million in 2023. For many organizations, it is a matter of when, not if, a cybersecurity incident will occur. Attackers are becoming more sophisticated and relying on weak links to exploit company applications and infrastructure. Combine this phenomenon with the fact that the traditional network security perimeter has changed (and all but disappeared). Cloud computing and remote work have driven this trend.

Deciding Whether to Buy or Build an Observability Pipeline

In today's digital landscape, organizations rely on software applications to meet the demands of their customers. To ensure the performance and reliability of these applications, observability pipelines play a crucial role. These pipelines gather, process, and analyze real-time data on software system behavior, helping organizations detect and solve issues before they become more significant problems. The result is a data-driven decision-making process that provides a competitive edge.

Fixing Security's Data Problem: Strategies and Solutions with Cribl and CDW

Cribl's Ed Bailey and CDW's Brenden Morgenthaler discuss a foundational issue with many security programs that lack the right data to detect issues and make fast decisions. Data drives every facet of security and bad data/incomplete data weakens your overall program. Ed and Brenden will discuss common issues and strategies for solving security's data problem.

See how reliability management enhancements expand your SLO value

When we announced the general availability of reliability management in Sept 2022, you saw how crucial this functionality was for the digital customer experience. Unique insights from users helped to improve the experience and usability that we’ve incorporated into our latest release. Now you can use a wide range of features that will help you on your reliability management journey.

Importing your Cloudwatch Metrics into Prometheus

Cloudwatch is the de facto method of consuming logs and metrics from your AWS infrastructure. The problem is, it is not the de facto method of capturing metrics for your applications. This creates two places where observability is stored, and can make it difficult to understand the true state of your system. That’s why it has become common to unify all data into one place, and Prometheus offers an open-source, vendor-agnostic solution to that problem.

Introducing the Cribl Stream Reference Architecture

Join Ed Bailey and Eugene Katz as they unveil the first Cribl Stream Reference Architecture, designed to help observability admins achieve faster and more valuable stream deployment. In this live stream discussion, Ed and Eugene will explain the importance of a quality reference architecture in successful software deployment, and guide viewers on how to begin with the Cribl Stream Reference Architecture by first establishing end-state goals. They will also share different use cases and help viewers identify which parts of the reference architecture are applicable to their specific situation.

What is Syslog and how does it work?

When you’re adding or subtracting fractions, you need to make sure that they have a common denominator, a number that allows you to compare values. In the same way, your IT environment needs a common “language” for your event log data. Your environment consists of various devices running different operating systems, software, and firmware.

Business Resilience: How To Build Resilience Strategically, Tactically & Operationally

The ability to continue business operations for the foreseeable future is a key metric from a financial standpoint. But from a risk management perspective, all dimensions of an organization’s strategic and operational framework must be analyzed in order to… The last part relates to business resilience — and it’s what we’re going to explore here. (This article was written by Joseph Nduhiu. See more of Joseph’s contributions to Splunk Learn.)

How to Create a Dashboard in Kibana

Wondering how to create a dashboard in Kibana to visualize and analyze your log data? In this blog post, we’ll provide a step-by-step explanation of how to create a dashboard in Kibana. You’ll learn how to use Kibana to query indexed application and event log data, filter query results to highlight the most critical and actionable information, build Kibana visualizations using your log data, and incorporate those visualizations into a Kibana dashboard.

The Best Graphite Dashboard Examples

Graphite provides time-series metrics in an open-source database. With Graphite dashboards, you can see key performance indicators (KPIs) as well as other metrics visually. Dashboards typically display data as graphs, charts, and tables and can be customized to meet the specific needs of an organization. Using dashboards, organizations can monitor and analyze various aspects of their performance, such as system utilization, application performance, and resource utilization, using web interfaces.

Best Practices for MongoDB Monitoring with Prometheus

The MongoDB document-oriented database is one of the most popular database tools available today. Developed as an open-source project, MongoDB is highly scalable and can be set up in your environment in just a few simple steps. When running and managing databases, monitoring is a key requirement.

Webinar Recap: Observability Data Orchestration

Today, businesses are generating more data than ever before. However, with this data explosion comes a new set of challenges, including increased complexity, higher costs, and difficulty extracting value. With this in mind, how can organizations effectively manage this data to extract value and solve the challenges of the modern data stack?

Empowering SecOps Admins: Getting the Most Value from CrowdStrike FDR Data with Cribl Stream

Join Ed Bailey and Sidd Shah as they discuss how Cribl Stream can empower Security Operations Admins to make the most of their CrowdStrike FDR data. During the discussion, Ed and Sidd will address the challenges faced by CrowdStrike customers who generate a vast amount of valuable data each day but struggle to leverage it fully due to complexity and size. They will explain how Cribl Stream can help SecOps admins extract the right data for their SIEM, while moving the rest to their Security Data Lake, enabling them to get the maximum value from their data and be cost-effective at the same time.

10 Best Apache Log Analyzers: Free & Paid Tools [2023 Comparison]

Apache is the second most popular web server, after …., with its roots and official release going back as far as 1995. Throughout the years, it gained features, including HTTP/2, caching, and many more, while retaining its most appreciated capabilities: speed, modularity, and great stability. To fully leverage its features, you need to understand the environment, bottlenecks, traffic and user behavior. Just like with every software inside your infrastructure, Apache is no different.

The Best OpenSearch Dashboard Examples

OpenSearch dashboards are a powerful tool for visualising and exploring data stored in an OpenSearch-compatible data store such as Elasticsearch. With OpenSearch's intuitive interface and advanced analytical tools, this visualisation tool makes it easy to gain insights into your data and monitor and alert upon key metrics. Throughout this article, we'll look at some of the most impressive OpenSearch dashboard examples that showcase it’s capabilities and versatility.

Trace-based testing with Elastic APM and Tracetest

This post was originally published on the Tracetest blog. Want to run trace-based tests with Elastic APM? Today is your lucky day. We're happy to announce that Tracetest now integrates with Elastic Observability APM. Check out this hands-on example of how Tracetest works with Elastic Observability APM and OpenTelemetry! Tracetest is a CNCF project aiming to provide a solution for deep integration and system testing by leveraging the rich data in distributed system traces.

How to "Live Tail" Kubernetes Logs

DevOps engineers wishing to troubleshoot Kubernetes applications can turn to log messages to pinpoint the cause of errors and their impact on the rest of the cluster. When troubleshooting a running application, engineers need real-time access to logs generated across multiple components. Collecting live streaming log data lets engineers: The challenge that engineers face is accessing comprehensive, live streams of Kubernetes log data.

Predictions: AI and Automation

Artificial Intelligence (AI) - or more specifically Machine Learning (ML) - and automation were big topics for many of our customers in 2022. Common reasons for the interest in AI and automation were to: increase efficiency, reduce manual processing, minimise human error and - especially for the use of ML - identify ‘unknown unknowns’.

Loki vs Prometheus - Differences, Use Cases, and Alternatives

Loki and Prometheus are both open source tools. While Loki is a log aggregation tool, Prometheus is a metrics monitoring tool. Loki’s design is inspired by Prometheus but for logs. This blog post compares the two common monitoring tools Loki vs Prometheus, to help you understand their key differences. Log management and metrics monitoring are critical aspects of monitoring a software system effectively.

Logging, Traces, and Metrics: What's the difference?

Several tech giants like Amazon and Netflix have jumped from their monolithic applications to microservices. This has allowed them to expand their business interface tremendously and improve their services. Not only them, but most businesses today are dependent on microservices. Twitter currently has about a thousand such services working together, releasing meaningful outputs.

Beginner's Guide to Prometheus Metrics

Over the past decade, Prometheus has become the most prominent open source monitoring tool in the world, allowing users to quickly and easily collect metrics on their systems and help identify issues in their cloud infrastructure and applications. Prometheus was originally developed by SoundCloud when the company felt their metrics and monitoring solutions weren’t meeting their needs.

10 Best Log Management Tools

Logs are imperative for troubleshooting, performance analysis, health monitoring, and application integrity and security. Log management tools clearly understand how users interact with apps and systems and provide insight into improving software reliability, increasing productivity, reducing risks, and ultimately improving the user experience. Through log management tools, users can further integrate and enrich all of their logs, making queries quicker and more effective.

It's time for government to move beyond monitoring and into observability

When thinking about holistic end-to-end observability, it can help to start with what you already have. Many government agencies are already strategically ingesting and storing logs — a key component of observability. More than a year and a half after the release of M-21-31, US government agencies continue to work through the logging maturity models outlined in the memorandum.

The best Elasticsearch training and support available.

Sematext offers professional-level consulting, production support, training, and monitoring tools for your elasticsearch cluster. With over 10 years of experience in the field, Sematext has worked with some of the largest companies in the world to help optimize their Elasticsearch setup. When you work with Sematext, you get expertise that comes straight from the source.

Why Culture and Architecture Matter with Data, Part I

We are using data wrong. In today’s data-driven world, we have learned to store data. Our data storage capabilities have grown exponentially over the decades, and everyone can now store petabytes of data. Let’s all collectively pat ourselves on the back. We have won the war on storing data! Congratulations!

How Structured, Unstructured & Semi-Structured Data Change Your Data Analytics Practice

Many business organizations begin their data analytics journey with great expectations of discovering hidden insights from data. The concept of unified storage — data lake technologies in the cloud — have gained momentum in recent years, especially with the exponential options for cost-effective cloud-based storage services. Big data is readily available. In fact, 2.5 quintillion (2.5 x 10^18 or 2.5 billion billion) bytes generated every day!

How we reduced flaky tests using Grafana, Prometheus, Grafana Loki, and Drone CI

Flaky tests are a problem that are found in almost every codebase. By definition, a flaky test is a test that both succeeds and fails without any changes to the code. For example, a flaky test may pass when someone runs it locally, but then fails on continuous integration (CI). Another example is that a flaky test may pass on CI, but when someone pushes a commit that hasn’t touched anything related to the flaky test, the test then fails.

15 Best Tools to Test and Measure Core Web Vitals [2023 Comparison]

User experience is key to ensuring the success of your website. There are many metrics that help you gauge and improve it, but Core Web Vitals are probably the most important ones. They are a set of real-world, user-centered metrics that quantify key aspects of the user experience. By measuring dimensions of web usability such as load time, interactivity, and the stability of content as it loads, Core Web Vitals help you understand how your website is doing in terms of performance.

Complete Guide on Docker Logs [All access methods included]

Docker logs play a critical role in the management and maintenance of containerized applications. They provide valuable information about the performance and behavior of containers, allowing developers and administrators to troubleshoot issues, monitor resource usage, and optimize application performance. By capturing and analyzing log data, organizations can improve the reliability, security, and efficiency of their containerized environments.

How Can the Right Log Aggregator Help Your Enterprise?

The Internet of Things (IoT) revolution has set the beginning of a new age of data transfer. Each day, a massive number of new devices get added to all kinds of network infrastructures, transferring gargantuan amounts of data back and forth. In the next decade, we expect the number of IoTs to grow to a staggering 80 billion connected devices – practically outnumbering the human population tenfold.

Guide on Structured Logs [Best Practices included]

Structured logging is the method of having a consistent log format for your application logs so that they can be easily searched and analyzed. Having structured logs allows for more efficient searching, filtering, and aggregation of log data. It enables users to extract meaningful information from log data easily. Logging is an essential aspect of system administration and monitoring. Logging allows you to record information data about the application's activity.

How Security Engineers Use Observability Pipelines

In data management, numerous roles rely on and regularly use telemetry data. The security engineer is one of these roles. Security engineers are the vigilant sentries, working diligently to identify and address vulnerabilities in the software applications and systems we use and enjoy today. Whether it’s by building an entirely new system or applying current best practices to enhance an existing one, security engineers ensure that your systems and data are always protected.

Democratizing Machine Data & Logs: How Infor saves millions by leveraging Sumo Logic's data-tiering

Infor developed a decentralized governance model for managing the vast Sumo landscape. A landscape with thousands of users, tens of thousands of Collectors, petabytes of log ingestion. By democratizing log management, we implemented a decentralized governance model for our Sumo account. Consequently,, we succeeded in doubling our log ingestion year-over-year, while reducing our log ingestion cost by more than 50%.

Is Kubernetes Monitoring Flawed?

Kubernetes has come a long way, but the current state of Kubernetes open source monitoring is in need of improvement. This is in part due to the issues related to an unnecessary volume of data related to that monitoring. For example, a 3-node Kubernetes cluster with Prometheus will ship around 40,000 active series by default. Do we really need all that data?

Connecting OpenTelemetry to AWS Fargate

OpenTelemetry is an open-source observability framework that provides a vendor-neutral and language-agnostic way to collect and analyze telemetry data. This tutorial will show you how to integrate OpenTelemetry with Amazon AWS Fargate, a container orchestration service that allows you to run and scale containerized applications without managing the underlying infrastructure.

Root cause log analysis with Elastic Observability and machine learning

With more and more applications moving to the cloud, an increasing amount of telemetry data (logs, metrics, traces) is being collected, which can help improve application performance, operational efficiencies, and business KPIs. However, analyzing this data is extremely tedious and time consuming given the tremendous amounts of data being generated. Traditional methods of alerting and simple pattern matching (visual or simple searching etc) are not sufficient for IT Operations teams and SREs.

A Snapshot of our IT Ops Predictions for 2023

Today executives and customers expect IT and digital services to be available and performant at all times; compromised availability or performance is no longer tolerable. Think about it; when was the last time a digital service was unavailable and it didn’t make the news or social media? When was the last time you visited a website that was unavailable and you waited for the outage to be over, rather than finding an alternative in the moment?

Communicating Context Across Splunk Products With Splunk Observability Events

When an IT or Security issue impacts a development team’s software how are they notified? Is your organization still relying on mass emails that lack context and most engineers have probably already filtered out of their inbox? Communicating between siloed tools and teams can be difficult. How would you like to put IT, Security, legacy processes, and business notifications specific to development teams right into one of their most important tools? Now you can!

Two sides of the same coin: Uniting testing and monitoring with Synthetic Monitoring

Historically, software development and SRE have worked in silos with different cultural perspectives and priorities. The goal of DevOps is to establish common and complementary practices across software development and operations. Sadly, in some organizations true collaboration is rare and we still have a way to go to build effective DevOps partnerships.

Apache Tomcat Logging Configuration: How to View and Analyze Log Files

Apache Tomcat is the Java web server that implements many Java features like web site APIs, Java server pages, Java Servlets, etc. It’s an open-source software widely used in the industry. Tomcat sits on top of your application and is the entry point for reaching your application code. It is crucial to monitor its performance and make sure everything works, get notified when unexpected errors occur, and take action in real-time.

Cyber Resilience: The Key to Security in an Unpredictable World

Join Ed Bailey and Jackie McGuire as they delve into the topic of cyber resilience and its growing significance in today's digital landscape. In this informative video, you will learn what cyber resilience means, why it's important, and how to manage and improve it in an increasingly unpredictable world. With cyber threats becoming more sophisticated and frequent, cyber resilience has become a critical aspect of protecting personal and business assets. This discussion is perfect for anyone looking to better understand the importance of cyber resilience and how to safeguard against potential threats.

Optimizing VPC Flow Logs - Part 2

As cloud deployments scale, Amazon Web Services (AWS) VPC flow logs become an invaluable network visibility and security tool. They are also one of the most voluminous classes of data, making them an expensive choice to add to analytics platforms. With growing infrastructure and traffic, managing these logs presents significant challenges. ‍In part 1 of this series, we took a look at common use cases and problems associated with storing and processing VPC Flow Logs.