Operations | Monitoring | ITSM | DevOps | Cloud

June 2023

Empowering Observability Engineers: Using Mezmo to Overcome Critical Challenges

The dynamic nature of the IT landscape poses complex challenges for organizations, necessitating the involvement of observability engineers. These skilled professionals have become indispensable in addressing critical pain points and optimizing system performance. In this blog post, we delve into the challenges observability engineers face and showcase how Mezmo's comprehensive telemetry solution empowers them to overcome these hurdles and achieve optimal results. ‍

Q&A: The importance of an end-to-end observability platform

Pressure is mounting for organizations to deliver exceptional customer and employee experiences, rapid event resolution, and better business outcomes. Observability will be a key strategy play to achieve end-to-end visibility of complex tech estates, says Pablo Stern, senior vice president of technology workflows at ServiceNow. In an interview at Knowledge 2023, Stern talked about how an observability platform can encompass both DevOps workflows and business processes across the enterprise.

Streamlining Observability: The Journey Towards Query Language Standardization

One of the most captivating discussions I had at KubeCon Europe 2023 in Amsterdam was about standardization of a query language for observability. This query language standard aims to provide a unified way of querying observability data across logs, metrics, traces, and other relevant signals. The conversation shed light on the pressing need for a standardized approach to overcome the challenges posed by the plethora of query languages currently in use.

Lightrun Attendance at FinOps X 2023: Unveiling Key Insights, Highlights and Takeaways from the Show

This week Lightrun attended the annual FinOps X event. The event was sold out and packed with great speakers, practitioners, and amazing atmosphere. Compared to last year which had over 300 attendees, this year the event brought over 1200! Above is a screenshot taken from the venue entrance reminding the audience with the core principles of FinOps.

Lightrun's Product Updates - Q2 2023

During the second quarter of this year, Lightrun persisted producing a wealth of developer productivity solutions and enhancements, aiming for greater troubleshooting of distributed workload applications, reduction of MTTR for complex issues, and cost optimization within cloud-computing. Read more below the main new features as well as the key product enhancements that were released in Q2 of 2023!

Azure Virtual WAN Observability with Kentik Cloud

Unravel the complexities of managing your corporate network in the cloud with Kentik Cloud. This video highlights how Kentik Cloud provides a comprehensive, always up-to-date visualization of your hybrid Azure infrastructure, including your Virtual WAN configuration. Learn how to quickly and easily navigate through your infrastructure's architectural blueprint, delve into the performance of specific VWAN hubs, and access vital utilization details. See how Kentik Cloud can help turn tedious troubleshooting into an efficient, user-friendly process. Experience firsthand the power and convenience of having crucial network insights right at your fingertips.

Putting the Network in Observability

With the accelerating use of DevOps and cloud-native infrastructure, observability is all the rage. Organizations, large and small, are doing their best to make sense of the logs, metrics and traces generated by their applications to identify performance and availability issues. But what about the network? It seems that many organizations forgot that network telemetry has always been the foundation of any monitoring initiative relating to performance, security, or availability. In this Techstrong Learning Experience, Techstrong Research GM Mike Rothman is joined by Phil Gervasi and Rosalind Whitley from Kentik to discuss how network observability adds depth and context to any APM or security analysis environment. Mike also highlights data from a recent network observability survey done by Techstrong Research. In this learning experience, you’ll learn.

Leveraging Calico flow logs for enhanced observability

In my previous blog post, I discussed how transitioning from legacy monolithic applications to microservices based applications running on Kubernetes brings a range of benefits, but that it also increases the application’s attack surface. I zoomed in on creating security policies to harden the distributed microservice application, but another key challenge this transition brings is observing and monitoring the workload communication and known and unknown security gaps.

The Future of Logz.io: Simple, Cost-effective Observability

Asaf and I founded Logz.io in 2015 to provide developers with the ultimate open source log management experience. With our product, logging with the ELK Stack was simple, efficient, and automated for the first time – so customers could save engineering costs and accelerate MTTR.

How to Trial Honeycomb and OpenTelemetry

Insightful proof-of-concepts with a tool can be difficult to undertake due to the demands on valuable resources: time, energy, and people. With a task as grand as observability, how could one truly test if Honeycomb and OpenTelemetry are right for their organization and meet their requirements? For this thought experiment, here’s a comprehensive description of the ideal product evaluation over the course of four weeks, given unlimited resources.

How AIOps Revolutionizes Observability for TechOps Teams

Managing over 1000 services and applications is daunting for any organization’s IT and Tech operations team. With a diverse mix of on-premises legacy systems and modern cloud stacks, the sheer volume of activity can overwhelm even the most skilled ITOps teams. The task is made more difficult by the fact that observability is fragmented. On average, organizations depend on 21 systems that produce metrics, logs, traces, and alerts for various services.

Centralized Observability: What, Why, and How?

Centralized Observability may not be a buzzword but its practicality and importance can’t be denied. Let’s see why is that. As DevOps and IT teams recognize the importance of Observability, it becomes a critical component to monitor the stack and ensure data reliability. That being said, enterprises are rapidly embracing modern data stacks to harness the power of data. Therefore, a host of platforms require data observability as a tool for reliable and trustworthy data management.

Harnessing an observability solution to gain valuable insights into business operations

In my previous articles, I discussed how to design considerations for observability solutions and how observability can augment your security implementation. In this article, I will discuss how an observability solution can provide valuable insights into your business operations through the collected data from various systems, applications, and services.

There Are No Repeat Incidents

People seem to struggle with the idea that there are no repeat incidents. It is very easy and natural to see two distinct outages, with nearly identical failure modes, impacting the same components, and with no significant action items as repeat incidents. However, when we look at the responses and their variations, we can find key distinctions that shows the incidents as related, but not identical.

Streamlining Data Management for Enterprise Security | SpyCloud

In this customer story, Ryan Sanders, lead security engineer at SpyCloud, shares his experience using Cribl to centralize and store data for account takeover protection and online fraud prevention. Ryan discusses the challenges he faced in managing data across multiple platforms and the solutions Cribl provided. Cribl acts as the Swiss Army knife for observability engineers, empowering them to collect data from various sources and perform custom integrations.

Modernizing your APM solution with a unified observability platform

DevOps can get complete visibility into application performance and various components of web application using application performance monitoring (APM). And APM insights depend on how well you can observe your application stack, using comprehensive metrics, distributed traces, and detailed logs. Join our webinar to learn how Site24x7's APM Insight offers in-depth visibility to troubleshoot performance issues and provide reliability to your app users around the world.

VictoriaMetrics bolsters move from monitoring to observability with VictoriaLogs release

Today we’re happy to announce our new open source, scalable logging solution, VictoriaLogs, which helps users and enterprises expand their current monitoring of applications into a more strategic ‘state of all systems’ enterprise-wide observability. Many existing logging solutions on the market today offer IT professionals a limited window into live operations of databases and clusters.

Using the Elastic Agent to monitor Amazon ECS and AWS Fargate with Elastic Observability

AWS Fargate is a serverless pay-as-you-go engine used for Amazon Elastic Container Service (ECS) to run Docker containers without having to manage servers or clusters. The goal of Fargate is to containerize your application and specify the OS, CPU and memory, networking, and IAM policies needed for launch. Additionally, AWS Fargate can be used with Elastic Kubernetes Service (EKS) in a similar manner.

What Observability-Driven Development Is Not

At Honeycomb, we are all about observability. In the past, we have proposed observability-driven development as a way to maximize your observability and supercharge your development process. But I have a problem with the terminology, and it is: I don’t want observability to drive your development.

Why Observability is Better with a Storage-less Architecture

In today’s data-driven world, the need for comprehensive observability has never been greater. Organizations rely on observability to gain insights into their systems’ and applications’ performance, availability, and behavior. However, the traditional approach to observability, which involves ingesting, processing, and storing massive amounts of data, is becoming increasingly challenging and expensive.

Cloud Native Application Observability - Trace-Logs Correlation

There is a brand new feature for Cloud Native Application Observability (formerly known as AppDynamics Cloud) that will reduce the effort it takes to resolve performance issues within business transactions. We are improving modern application troubleshooting by aligning traces that are performing sub-optimally with their associated logs so one can effortlessly discover the root cause. Watch how we quickly identify poor-performing business transactions, their associated traces, and spans, to the relevant logs pertinent to fixing performance issues, never having to switch tools or the context.

OpenTelemetry Security: How To Keep Telemetry Data Safe

Organizations implementing observability in their digital services architecture should be familiar with OpenTelemetry (OTEL) framework. While our OTEL guide provides an in-depth examination of the benefits of this open-source framework, the potential security challenges with OpenTelemetry warrant a separate guide.

Business Observability: Everything Fintech Companies Want to Know

Fintech companies operate in a complex technological and regulatory environment. They rely heavily on cloud-native technologies and microservices architectures to handle financial transactions and data, often at a massive scale. To maximize application reliability, fintech companies need full visibility into their software systems and applications. An agile monitoring solution like observability is crucial to improving performance and user experience.

The 2023 Observability Market Map - Key Trends, Players, and Directions

Cribl has a unique position right in the middle of the observability market, giving us a distinct view of all things security, APM, and log analysis. Observability as a concept has exploded into specialized areas over the past two years, and making sense of the players and market forces, particularly in a difficult macro environment, can be tricky. Let’s break it down.

How To Perform Dynamic Code Instrumentation in a Python Application

Code instrumentation is an essential practice in modern software development. Not only does it aid in debugging, it ultimately impacts the MTTR (Mean Time to Resolve) for software running in production. With changing software architectures and deployment patterns over the years, approaches to code instrumentation have also undergone a significant shift.

Stile Education's Best-of-Breed Observability Strategy

"One of the best things we’ve gotten out of ChaosSearch is the ability to keep all of our data in S3. It’s cheap and easy to keep all of our data available and indexed. We can search through it at any time to dig deeper into problems that crop up." Learn more about how the Stile's team can now retain log data indefinitely, versus saving only a week or two of data in Elasticsearch. That change has increased the team’s capacity to use log data to solve business problems, and unlocked new opportunities to discover deeper product insights.

Improving LLMs in Production With Observability

Quickly: if you’re interested in observability for LLMs, we’d love to talk to you! And now for our regularly scheduled content: In early May, we released the first version of our new natural language querying interface, Query Assistant. We also talked a lot about the hard stuff we encountered when building and releasing this feature to all Honeycomb customers. But what we didn’t talk about was how we know how our use of an LLM is doing in production!

Fundamentals of Searching Observability Data: Understanding the Search Process Can Save Time, Complexity, and Money!

On June 28th I will be hosting a webinar, ‘The Fundamentals of Searching Observability Data’. So why should you attend? Because things have, and will continue to change in the way we manage the IT data collected across the enterprise. A recent study shows that enterprises create over 64 zettabytes (ZB) of data, and that number is growing at a 27 percent compound annual growth rate (CAGR). The scary part?

Featured Post

The Top 5 Trends on SRE Leaders' Minds in 2023: Insights from a Seasoned Executive

I've spent most of my career trying to solve big problems for people. In the early days at New Relic, we were trying to help people scale their systems based without compromising on performance, cost, or the customer experience. Not an easy feat but we gave them a solution that allowed them to accomplish their goals. The key was religiously listening to our customers talk about their wants, needs, hopes and fears. While I am rarely the smartest person in the room, which my partner rarely misses a chance to lovingly remind me, I always do my best to listen to what the brilliant folks in my sphere are talking about.

Simplifying log data management: Harness the power of flexible routing with Elastic

In Elasticsearch 8.8, we’re introducing the reroute processor in technical preview that makes it possible to send documents, such as logs, to different data streams, according to flexible routing rules. When using Elastic Observability, this gives you more granular control over your data with regard to retention, permissions, and processing with all the potential benefits of the data stream naming scheme. While optimized for data streams, the reroute processor also works with classic indices.

Dynamic Observability Tools for API Live Debugging

Application Programming Interfaces (APIs) are a crucial building block in modern software development, allowing applications to communicate with each other and share data consistently. APIs are used to exchange data inside and between organizations, and the widespread adoption of microservices and asynchronous patterns boosted API adoption inside the application itself.

Ep. 3: Who's Watching Your Cloud? Featuring Bill Mulligan

Description: In this episode, Shon dives into the thrilling universe of eBPF with expert Bill Mulligan in this episode of the Cloud Control Podcast. Explore eBPF's transformative impact on cloud computing, development environments, and security. Discover its usage from tech giants like Facebook to everyday Android devices. Venture into the open-source journey of Cilium, an eBPF-based project with diverse applications such as multicluster networking and observability.

Getting Started with Honeycomb Buildevents and GitHub Actions

Buildevents is a small binary used to help instrument builds to generate trace telemetry. It populates the trace with metadata from the GitHub Actions environment so you have details about what occurred throughout the entire build. In this tutorial, learn how to instrument with Buildevents and GitHub actions.

How Our Love of Dogfooding Led to a Full-Scale Kubernetes Migration

The benefits of going cloud-native are far reaching: faster scaling, increased flexibility, and reduced infrastructure costs. According to Gartner®, “by 2027, more than 90% of global organizations will be running containerized applications in production, which is a significant increase from fewer than 40% in 2021.” Yet, while the adoption of containers and Kubernetes is growing, it comes with increased operational complexity, especially around monitoring and visibility.

The Rise of Open Standards in Observability: Highlights from KubeCon

Today’s IT systems are ever more fragmented. It is commonplace to see polyglot systems, written in multiple programming languages, and using a plethora of tools and cloud services as infrastructure building blocks, whether data stores, web proxy or other functions. In this dynamic cloud-native realm, open standards and open specifications have become integral drivers of compatibility, collaboration, and convergence – the Three C’s of Open Standards, if you will.

Understanding Multi Cloud Observability

IT, DevOps, and security teams are figuring out the best ways to manage their complex, ever-growing, ever-changing environments. And one contributing factor to all the complexity is the rise of using multiple cloud services. One cloud service to manage is difficult enough, but adding more to the mix — each with its own interface and set of tools — makes everyone’s job significantly more difficult.

Don't Let Observability Inflate Your Cloud Costs

We saw a shift this year in how the technology sector honed in on sustainability from a cost perspective. In particular, looking at where they’re spending that revenue in the infrastructure and tooling space. Observability tooling comes under a lot of scrutiny as it’s perceived as a large cost center—and one that could be cut without affecting revenue. After all, if the business hasn’t had a problem in the last few months, we mustn’t need monitoring—right?

How Honeycomb Monitors Kubernetes

While Kubernetes comes with a number of benefits, it’s yet another piece of infrastructure that needs to be managed. Here, I’ll talk about three interesting ways that Honeycomb uses Honeycomb to get insight into our Kubernetes clusters. It’s worth calling out that we at Honeycomb use Amazon EKS to manage the control plane of our cluster, so this document will focus on monitoring Kubernetes as a consumer of a managed service.

Setting Up a Data Loop using Cribl Search and Stream Part 2: Configuring Cribl Search

In the second video of our series, we delve into the nuts and bolts of configuring Cribl Search to access the data that we've stored in the S3 bucket. The video guides you step-by-step through the process of configuring the Search S3 dataset provider by using the Stream Data Lake destination as a model for the authentication information. From there, we proceed to walk through the process of creating a Dataset to access the Provider that we've just established. To wrap things up, we demonstrate how to search through the test data that we've previously stored in the S3 bucket.

Getting Started with Honeycomb Buildevents and GitHub Actions

Buildevents is a small binary used to help instrument builds to generate trace telemetry. It populates the trace with metadata from the GitHub Actions environment so you have details about what occurred throughout the entire build. In this tutorial, learn how to instrument with Buildevents and GitHub actions.

Serverless observability, monitoring, and debugging - Overview and best practices

Serverless, as you may already know, is a cloud computing model where the cloud provider dynamically manages and allocates resources to execute code without the need for server provisioning or infrastructure management on the developer. This article overviews serverless observability, monitoring, and debugging, based on distributed tracing and OpenTelemetry (OTel).

Broadcom Recognized as Outperformer in the 2023 GigaOm Radar Report for Cloud Observability

We are excited to share that the AIOps and Observability solution from Broadcom has earned a leader position for platform play and maturity in the GigaOm Radar Report for Cloud Observability, 2023. This report reviewed solutions from 20 vendors on 13 criteria, including across such areas as innovation, understanding of emerging trends, solution capabilities and features, and deployment models.

How FireHydrant Implemented Honeycomb to Streamline Their Migration to Kubernetes

Kubernetes is the gold standard for container orchestration at scale. While massive global companies like Google, Spotify, and Pinterest rely on Kubernetes to run their software in production, so do many small but mighty developer teams. (Full disclosure: Honeycomb joined the Kubernetes brigade last year, when we migrated some of our services.)

My Perspective on CloudFabrix Collaboration with the Cisco Full-Stack Observability Platform

I am thrilled that CloudFabrix is a pioneering design partner for Cisco’s Full-Stack Observability Platform (FSO). The Cisco FSO Platform has been designed with a vision of providing a unified observability experience across all application and infrastructure aspects, thereby dismantling silos. The platform’s choice to adopt OpenTelemetry as the protocol for data ingestion via MELT opens up the possibility for comprehensive insights on the complete stack.

Observability in Nutanix AHV environments and Hyper Converged Infrastructures (HCI)

Today, I’ll cover the benefits of monitoring and observability in Nutanix AHV environments and Hyper Converged Infrastructures (HCI) and how observability can help IT teams run cost-efficient, performant Nutanix deployments. Modern enterprises need infrastructures designed for resilience, cost-effectiveness, and application performance. Organizations are adopting hybrid multi-cloud strategies and looking to simplify and optimize on-premises and data center operations.

Multi-Cloud Made Simple: Announcing Kentik Observability Enhancements for AWS and Google Cloud

Limited visibility into network performance across multi-clouds frustrates even the best teams. That’s why we’re thrilled to announce enhanced AWS and GCP support for Kentik Cloud, enabling network, cloud, and infrastructure teams to rapidly troubleshoot and understand multi-cloud traffic.

Shrink your IT budgets, not your observability needs

Are you getting value for every dollar spent on IT monitoring tools? Amidst the prevailing global economic turbulence, budgets are shrinking, and every dollar spent counts. However, Gartner forecasts a 5.1% growth in worldwide IT spending for 2023. Enterprises implement digital technologies to cope with layoffs and keep their systems up. The million-dollar question is: Is the monitoring output worth the cost of the monitoring solution?

Collecting Kubernetes Data Using OpenTelemetry

Running a Kubernetes cluster isn’t easy. With all the benefits come complexities and unknowns. In order to truly understand your Kubernetes cluster and all the resources running inside, you need access to the treasure trove of telemetry that Kubernetes provides. With the right tools, you can get access to all the events, logs, and metrics of all the nodes, pods, containers, etc. running in your cluster. So which tool should you choose?

DNS observability and troubleshooting for Kubernetes and containers with Calico

In Kubernetes, the Domain Name System (DNS) plays a crucial role in enabling service discovery for pods to locate and communicate with other services within the cluster. This function is essential for managing the dynamic nature of Kubernetes environments and ensuring that applications can operate seamlessly. For organizations migrating their workloads to Kubernetes, it’s also important to establish connectivity with services outside the cluster.

Setting Up a Data Loop using Cribl Search and Stream Part 1: Setting up the Data Lake Destination

In the very first video of the series, we delve into the concept of a data loop and why it is beneficial to use Cribl Search and Cribl Stream to optimize the use of a data lake. The video gives a concise overview of Cribl Search and Cribl Stream, and how they work in tandem to create a data loop. We then provide step-by-step instructions on how to configure the Cribl Stream "Amazon S3 Data Lake" Destination to transfer data from Stream to an S3 bucket that has been optimized specifically for Cribl Search's access. Finally, we demonstrate sending sample data to the S3 bucket and present a before-and-after view of the bucket to showcase the impact of the test data.

Setting Up a Data Loop using Cribl Search and Stream Part 2: Configuring Cribl Search

In the second video of our series, we delve into the nuts and bolts of configuring Cribl Search to access the data that we've stored in the S3 bucket. The video guides you step-by-step through the process of configuring the Search S3 dataset provider by using the Stream Data Lake destination as a model for the authentication information. From there, we proceed to walk through the process of creating a Dataset to access the Provider that we've just established. To wrap things up, we demonstrate how to search through the test data that we've previously stored in the S3 bucket.

Setting Up a Data Loop using Cribl Search and Stream Part 3: Send Data from Cribl Search to Stream

The third video of our series focuses on utilizing Cribl Stream to manage data. The presenter takes us through the process of configuring the Cribl Stream in_cribl_http source in tandem with the Cribl Search send operator to collect data. We are able to witness live data results being sent from Search to Stream. Afterward, we demonstrate creating a Route in Stream to direct the incoming data from Search (via the in_cribl_http) Source to the Data Lake by using the Amazon S3 Data Lake Destination. This step employs a passthrupipeline to ensure that the data is not altered in transit.

Setting Up a Data Loop using Cribl Search and Stream Part 4: Putting it All Together

The final section of our video series showcases how to put the data loop to use with a real-world dataset. We utilize the public domain “Boss of the SOC v3” dataset, which is readily available on GitHub. First, we employ Cribl Search to sift through and explore the BOTSv3 data that is stored in an S3 bucket to locate some specific data.

Observability: Working with Metrics, Logs and Traces

The concept of observability centers around collecting data from all parts of the system to provide a unified view of the software at large. Fault tolerance, no single point of failure and redundancy are prominent design principles in modern software systems. But that doesn’t mean errors, degradation, bugs or even the occasional catastrophe don’t happen.

Customer-Centric Observability: Experiences, Not Just Metrics

Martin and Jess recently conversed with Todd Gardner of RequestMetrics as part of the O11ycast podcast. We don’t normally write blogs based on these conversations, but there were impactful comments in that episode that bear repeating. You can listen to the full conversation if you wish. Let’s get into it!

API monitoring vs. observability in microservices- Troubleshooting guide

Monitoring APIs through enhanced observability has gained traction with the popularity of microservices. Since microservice applications are built as independent and scalable modules, the number of microservices can grow dramatically as the application grows, increasing the complexity drastically. Since APIs work as the connective tissue between microservices, the number of APIs also grows in parallel.

Modernize Your SIEM Architecture

Join Ed Bailey from Cribl and John Alves from CyberOne Security as they discuss the struggles faced by many SIEM teams in managing their systems to control costs and extract optimal value from the platform. The prevalence of bad data or an overwhelming amount of data leads to various issues with detections and drives costs higher and higher. It is extremely common to witness a year-over-year cost increase of up to 35%, which is clearly unsustainable.

A Step-by-Step Guide to Standardizing Telemetry with the BindPlane Observability Pipeline

Adding additional attributes to your telemetry not only provides valuable context to your observability pipeline but also enhances the flexibility and precision of your data operations. Consider, for example, the need to route data from specific geographical locations, like the EU, to a designated destination. With a ‘Location’ attribute added to your logs, you can seamlessly achieve this.

Rollouts in BindPlane OP

Learn how easy it is to edit and roll out changes to your configurations, deploying in batches, while also being able to look back at the entire version history. About ObservIQ: observIQ is developing the unified telemetry platform: a fast, powerful and intuitive next-generation platform built for the modern observability team. Rooted in OpenTelemetry, our platform is designed to help teams reduce, simplify, and standardize their observability data.

Performance Ratings and Experience Scores for Meaningful Alerting and Rapid Observability

Administrators and IT management are increasingly leveraging simple quantifiable KPI indicators such as “Performance Ratings” to gain rapid overviews and track key outcomes. Modern IT architectures are designed and built to scale and be resilient. Systems are now usually built to handle failover and auto-scale up and down to handle varying demand and workloads with very different properties and needs.

What Is a Telemetry Pipeline?

In a simple deployment, an application will emit spans, metrics, and logs which will be sent to api.honeycomb.io and show up in charts. This works for small projects and organizations that do not control outbound access from their servers. If your organization has more components, network rules, or requires tail-based sampling, you’ll need to create a telemetry pipeline.