Operations | Monitoring | ITSM | DevOps | Cloud

The New Era of Autonomous Debugging: Transforming the SDLC

The software world is changing rapidly due to advancements in GenAI. These technologies are disrupting traditional processes and driving automation across every part of the SDLC. The market for AI code tools is estimated to reach $30 billion by 2032. It started with code generation, then moved to testing, QA, automatic pull requests, and beyond.

Unleashing Deep Observability with eBPF-Based Topology in Virtana AM

In today’s dynamic and complex IT landscapes, maintaining visibility into application topologies is crucial for ensuring optimal performance, troubleshooting issues, and delivering exceptional end-user experiences. Did you know that 73% of IT leaders report increased difficulty in managing application performance due to rising complexity?

Monitoring, Observability, & Debuggability Explained

Monitoring tools are great at letting you know when something is broken and the overall impact. We should know, we make an error monitoring tool. Observability tools are good for well, observing. But here’s the thing, you (we) don’t observe code. We (you) push code. So what the collective “we” need is a tool that makes it easy to ship, improve, and maintain reliable and performant code.

See Your Structured Logs in the Explore Data tab

There's a new way to flip through your data in Honeycomb, released this week! It's super for looking at structured logs. It's called: Explore Data. Get directly at the logs, spans, events, or metrics that power the fast analysis you can do with Honeycomb. See all the fields, the whole variety of values — now ordered by timestamp, with pagination. Modify your query and graphs right from the data table. It's all connected!

Making Room for Some Lint

It’s one of my strongly held beliefs that errors are constructed, not discovered. However we frame an incident’s causes, contributing factors, and context ends up influencing the shape of the corrective items (if any) that get created. I’ll cover these ideas by using our June 3rd incident where a database migration caused a large outage by locking up a shared database and making it run out of connections.

Elastic vs Splunk [Detailed Comparison 2024]

Elasticsearch and Splunk are two leading solutions renowned for their capabilities in processing, analyzing, and visualizing large datasets in real-time. Both platforms have carved out significant roles in the fields of data analytics and log management, each offering unique features tailored to different needs. This article aims to provide a comprehensive comparison of Elasticsearch and Splunk, highlighting their strengths and weaknesses, and introducing Uptrace as a compelling alternative.
Sponsored Post

What's new in Avantra 24.2

It's my pleasure to announce the release of Avantra 24.2. The second update of Avantra 24, building upon 24.1 which brought performance and customer requested bug fixes, 24.2 brings new innovations and enhancements to our Avantra platform. With over 300 changes in our development management system, Avantra 24.2 feels like a major release to us and we have something new everywhere you look. Let's dive deeper into the new features.

Why Your Telemetry(Observability) Pipelines Need to be Responsive

At Mezmo, we consider Understand, Optimize, and Respond, the three tenets that help control telemetry data and maximize the value derived from it. We have previously discussed data Understanding and Optimization in depth. This blog discusses the need for responsive pipelines and what it takes to design them.

How Network Observability Helps Lay the Foundation of Autonomous IT Operations

We often hear the term "observability" in the context of DevOps and how SREs use telemetry data. Collecting and analyzing this telemetry data is a vital first step to a successful autonomous IT operations strategy. Observability can help you find out about problems in your system you didn’t know you had—and before your users are impacted—by giving you new visibility that your monitoring systems don’t provide. But any observability initiative must also include network observability.

The CoPE and Other Teams, Part 1: Introduction & Auto-Instrumentation

The CoPE is made to affect, meaning change, how things work. The disruption it produces is a feature, not a bug. That disruption pushes things away from a locally optimal, comfortable state that generates diminishing returns. It sets things on a course of exploration to find new terrains which may benefit it more—and for longer.

Making The Case for Continuous Observability

Software complexity grows exponentially, developer efficiency grows far slower. And debugging often takes up 20-50% of development time. More complex, connected systems means increased data flow at the edge, and in the cloud. That leads to increased exposure to vulnerabilities, cyber threats, malfunctions, and bugs with risks that are hard to assess.

Transform and enrich your logs with Datadog Observability Pipelines

Today’s distributed IT infrastructure consists of many services, systems, and applications, each generating logs in different formats. These logs contain layers of important information used for data analytics, security monitoring, and application debugging. However, extracting valuable insights from raw logs is complex, requiring teams to first transform the logs into a well-known format for easier search and analysis.

Sustainable Computing in Observability with Kunal Nawale

Kunal Nawale, Founder and CEO of SigLens, presents on sustainable computing and observability. Understand the significant energy impact of data centers and how efficient observability can reduce both costs and carbon emissions. Learn about data storage optimization and how SigLens’s open-source solution offers a 90% cost reduction compared to traditional systems like Splunk and Elastic Search.

Get granular LLM observability by instrumenting your LLM chains

The proliferation of managed LLM services like OpenAI, Amazon Bedrock, and Anthropic have introduced a wealth of possibilities for generative AI applications. Application engineers are increasingly creating chain-based architectures and using prompt engineering techniques to build LLM applications for their specific use cases.

Destroy on Friday: The Big Day A Chaos Engineering Experiment - Part 2

In my last blog post, I explained why we decided to destroy one third of our infrastructure in production just to see what would happen. This is part two, where I go over the big day. How did our chaos engineering experiment go? Find out below!

Streamlining Debugging with Lightrun Snapshots: A Superior Alternative to Trace Logging

According to a recent study, failing tests alone cost the enterprise software market an astonishing $61 billion annually. This figure mirrors the vast number of resources devoted to rectifying software failures, translating into about 620 million developer hours lost each year. On average, engineers spend 13 hours to resolve a single software failure, a statistic that paints a stark picture of the current state of debugging efficiency.

Cribl's Blueprint for Secure Software Development.

What does it take to build software for the most security-demanding customers worldwide? At Cribl, building secure products is integral to our engineering identity. We have established a secure software development lifecycle that is both culturally and policy-driven, integrating product security tooling and processes into every architecture review, pull request, and release, whether major or minor.

What Makes for a 'Good' Pair Programming Session?

Software changes so rapidly that developing on the cutting edge of it cannot fall to a single person. When it comes to asynchronously disseminating information about projects, code comments, PR conversations, Slack, RFCs, and other investigatory documents do a wonderful job, but no amount of async communication replaces the magic of two brains bouncing ideas off of each other.

Unleashing the Power of Hybrid Cloud - Introducing Hybrid Observability in HPE GreenLake Flex Solutions

In today's fast-paced digital economy, businesses are constantly seeking innovative solutions to streamline their operations, enhance agility, and drive growth. As enterprise IT infrastructure environments get more distributed and complicated to meet evolving demands, the need for robust IT monitoring, management and automation becomes even more important.

Optimizing Database Performance with Honeycomb Relational Fields

Martin investigates: what database queries are taking the longest? Then he digs into the one taking the most time, and asks: What user-initiated requests trigger this query? This kind of question helps developers focus our efforts where they count. And it's possible in Honeycomb with Relational Fields. This is #observability during development, using #OpenTelemetry #tracing and Honeycomb.

Deploy on Friday? How About Destroy on Friday! A Chaos Engineering Experiment - Part 1

We recently took a daring step to test and improve the reliability of the Honeycomb service: we abruptly destroyed one third of the infrastructure in our production environment using AWS’s Fault Injection Service. You might be wondering why the heck we did something so drastic. In this post, we’ll go over why we did it and how we made sure that it wouldn’t impact our service.

Embark on the Observability Journey

With the advent of byte code instrumentation (BCI) in 2008, application performance management took a giant leap in what is known as "inside-out monitoring," that is, monitoring from inside the application. Before that, application monitoring was largely limited to tracking CPU, memory, disk, and process availability. BCI offered new opportunities in terms of how applications could be monitored and what could be monitored from an application performance perspective.

Observability as Code Explained: Benefits & How to Get Started

Traditional monitoring has become insufficient for managing complex systems. Modern infrastructures consist of numerous interconnected services, and simply monitoring individual metrics and logs fails to provide a comprehensive view. This is where observability becomes crucial.

Mezmo Edge Explainer Video

Ensuring access to the right telemetry data - like logs, metrics, events, and traces from all applications and infrastructure are challenging in our distributed world. Teams struggle with various data management issues, such as security concerns, data egress costs, and compliance regulations to keep specific data within the enterprise. Mezmo Edge is a distributed telemetry pipeline that processes data securely in your environment based on your observability needs.

Control the Chaos: The Rise of Network Observability

Join Kentik's Greg Villain and Steve Meuse and discover how network observability empowers network operators to face the challenges of an ever-changing economy and focus on cost, performance, and reliability. Learn about network observability tools and methods, simplifying complex hybrid network data, the infrastructure decision framework, and how AI is powering the future of network monitoring.

Lightrun Product Updates : H1 2024

Throughout the first half of 2024, Lightrun has focused on developing a range of solutions and improvements aimed at enhancing developer observability and live debugging. These advancements help organizations significantly reduce their MTTR for complex issues while boosting developer productivity. Read more below the main new features as well as the key product enhancements that were released in H1 of 2024!

Top 10 Best Monitoring Tools for IT Infrastructure in 2024

Efficient monitoring tools are crucial for maintaining the performance, security, and reliability of your infrastructure. This comprehensive guide covers the top 10 best monitoring tools for IT infrastructure, offering insights into their features, benefits, and use cases. We'll also provide a monitoring tools list and examples to help you choose the best solutions for your needs.

Confidently Shifting from Logs-centric to a Unified Trace-first Approach: Ritchie Bros. Journey to Modern Observability

Transitioning from a monolithic system to a cloud-native microservices environment, Ritchie Bros. sought to modernize their observability infrastructure to support the transition and fuel future growth. Ritchie Bros. has been a pioneering force in the auctioneering market for nearly 70 years, charting remarkable growth through a strategic mix of organic expansion and acquisitions.

Observability Dilemma: To SaaS or Not to SaaS? That is the Question!

In the ever-evolving IT landscape, the Observability Dilemma casts a strategic shadow: To SaaS or not to SaaS, a question being dealt with by many IT professionals today. As organizations grapple with the complexities of maintaining system health and performance, navigating change while staying secure, choosing between the allure of cloud-based services and the on-premises sanctuary becomes pivotal.

Future-proofing IT: Navigate tomorrow's challenges with full-stack observability ft. Aswim Panigrahi

In this episode of Server Room, we sit down with Aswim Panigrahi, technical evangelist at ManageEngine, to discuss the the strategic utilization of full-stack visibility as a proactive approach to preparing IT infrastructures for the future.

Discover what your applications are really up to with Coroot

Modern Applications can use a lot of external services, some of those interactions are expected, others not so much. There could be many reasons for those unexpected interactions, ranging from security vulnerabilities and various malware to outdated code and various reporting and statistics software may report to its creator or a third party. These unexpected interactions can be a security risk, and may also raise privacy concerns.

Instrumenting Python GIL with eBPF

Every Python developer has heard about the GIL (Global Interpreter Lock) This lock simplifies memory management and ensures thread safety, but it also limits the performance of multi-threaded, CPU-bound programs because threads can’t run Python code in parallel. Here is a great explanation of why Python requires the GIL by Python’s creator, Guido van Rossum: Guido van Rossum: Will Python ever remove the GIL? | Lex Fridman Podcast Clips.

Staffing Up Your CoPE

Getting the right people working in the CoPE is crucial to success because these change agents must limber up the organization and promote the flexibility necessary to perform resilience. We’ll look for teammates who share enough in common to work well together, but who don’t necessarily perfectly overlap so that they can play off each other’s strengths.

Retail Observability | Softcat + Grafana

Grafana empowers retailers to deliver unmatched customer experiences, reduce costs, and optimize delivery with omnichannel observability. Innovate faster, increase agility, and watch your business thrive. Grafana Cloud is the easiest way to get started with Grafana dashboards, metrics, logs, and traces. Our forever-free tier includes access to 10k metrics, 50GB logs, 50GB traces and more. We also have plans for every use case.

How Logz.io Provides Trustworthy Observability through AI

The business of observability is all about data: what you’re observing in the data, how you’re visualizing it, what it indicates about the state of your environment, and how to address issues that may occur. Creating your own perspective for observability, and understanding what you’re seeing, can be difficult.

Why Every Engineering Team Should Embrace AWS Graviton4

Two years ago, we shared our experiences with adopting AWS Graviton3 and our enthusiasm for the future of AWS Graviton and Arm. Once again, we're privileged to share our experiences as a launch customer of the Amazon EC2 R8g instances powered by AWS Graviton4, the newest generation of AWS Graviton processors. This blog elaborates our Graviton4 preview results including detailed performance data. We've since scaled up our Graviton4 tests with no visible impact to our customers.

Intelligent Health Checks: one-click observability for reliability tests

Reliability testing and observability are similar in one important way: engineering teams know they should be doing it, but they’re not sure how to start, or they don’t have the right resources, or they need to focus on competing priorities like feature development and incident response. In an ideal world, reliability and observability would be automated processes that configure, monitor, and run themselves.

CI/CD observability: A rich, new opportunity for OpenTelemetry

Continuous integration and continuous deployment (CI/CD) are the backbone of modern software delivery, but there’s still limited visibility into their processes. Here’s how that’s changing with OpenTelemetry (OTel), and why those changes are so exciting.

Modern Observability in Action at the University of Oxford

The Bennett Institute for Applied Data Science at the University of Oxford is pioneering the better use of data, evidence, and digital tools in healthcare, policy, and beyond. The institute employs an open-source approach with its OpenSAFELY analytics platform, enabling high-impact research that yields actionable insights, drives innovation, and enhances lives globally.

Discover Financial Services cuts costs and accelerates data retrieval with Elastic Observability

Learn how Discover Financial Services helps its customers achieve a better financial future by partnering with Elastic. Discover utilizes Elastic Observability for its centralized logging platform. Users now have improved monitoring capabilities to help solve issues.

End-to-end SAP Observability with Elastic, Google Cloud, and Kyndryl: A deep dive

Tens of thousands of companies in the world, across almost all industries, from midsize to large enterprises, rely on robust, efficient complex SAP systems to power their core operations. From sales to finance, from warehouse management to production planning and execution, business’s continuity, revenue, and customer success highly depend on processes running on enterprise resource planning (ERP) architectures.

5 Best Wi-Fi Heat Mapping Tools + Guide

Wi-Fi has become an essential component of our daily life, allowing for seamless connectivity between several devices. However, maintaining the best possible Wi-Fi performance and coverage may be difficult, particularly in complicated settings like huge stadiums, universities, and workplaces. Wi-Fi heat mapping is useful in this situation.

The Hater's Guide to Dealing with Generative AI

Generative AI is having a bit of a moment—well, maybe more than just a bit. It’s an exciting time to be alive for a lot of people. But what if you see stories detailing a six month old AI firm with no revenue seeking a $2 billion valuation and feel something other than excitement in the pit of your stomach? Phillip Carter has an answer for you in his recent talk at Monitorama 2024. As he puts it, “you can keep being a hater, but you can also be super useful, too!”

Building an AI Assistant in Splunk Observability Cloud

Splunk Observability Cloud is a full-stack observability solution, combining purpose-built systems for application, infrastructure and end-user monitoring, pulled together by a common data model, in a unified interface. This provides essential end-to-end visibility across complex tech stacks and various data types, such as metrics, events, logs, and traces (MELT), as well as end-user sessions, database queries, stack traces and more.

Identify anomalies, outlier detection, forecasting: How Grafana Cloud uses AI/ML to make observability easier

At Grafana Labs, our No. 1 approach when building AI/ML tools is to enable humans (a.k.a. all of us!) to understand complex systems. In other words, we want to make observability still human, but less complicated. (Our second use case? Making social media more fun.) We believe that AI/ML tools in observability should work towards minimizing toil and the need for everyone in your organization to have the same deep domain knowledge about your increasingly complex stack.

Database Observability and Storage Insights

Storage monitoring involves discovering the estate, devices, and network interconnections. Key telemetry requirements include their states, performance metrics, and logs. As the complexity of the environment increases and storage reliability improves, the focus shifts. Understanding the layers above, such as file systems and databases, and their demand for storage services becomes crucial. This article delves into the detailed knowledge required to achieve effective observability.

Leveraging observability to improve digital resilience

With increasing competition and a digitizing landscape, small and medium enterprises (SMEs) in Australia are being forced to level up their game using AI and modernization. This means eventually relying on cloud and AI integration to ensure agility and responsiveness. The diversity of applications and the complexity of tech architecture pose challenges like increasing costs, security risks, and scalability challenges.

What Developers Should Know about Observability

Peter is a serial entrepreneur and co-founder of Percona, FerretDB, and other tech companies. As a leading expert in open-source strategy and database optimization, Peter has applied his technical knowledge and entrepreneurial drive to contribute as a board member and advisor to several open-source startups. His insights into performance optimization and system reliability play a crucial role in shaping Coroot’s functionality.

Unlocking Smiles: HappyCo's Observability Success

With a diverse range of applications, HappyCo sought to advance their system investigations with a modern observability solution while embarking on an application refactor project. Since its start in 2011, HappyCo has experienced rapid growth through both organic expansion and strategic acquisitions. As a result, the company has a diverse range of applications for customers to smile about.

Optimizing observability costs with a DIY framework

Observability costs are exploding as businesses strive to deliver maximum customer satisfaction with high performance and 24/7 availability. Global annual spending on observability in 2024 is well over 2.4 billion USD and is expected to reach 4.1 billion USD by 2028. On an individual company basis, this is reflected by observability costs ranging from 10-30% of overall infrastructure spend. These costs will undoubtedly rise with digital environments expanding and becoming ever more complex.

Green Data: The Role of Observability in Shaping a Sustainable Future

Systems speak in data. Widespread digitization means systems communicate more than ever, while increasingly refined means of recording and interpreting their messages are revolutionizing IT management. Meanwhile, beyond the engine rooms of enterprises, our planet is trying to tell us something, too. In changing temperatures and rising sea levels, we see signs that our relationship with the natural world must change.

BindPlane Flight Plane June 2024

Learn how to make rollouts even better with Progressive rollouts in BindPlane. This video will show you how to create different stages for your agents and roll out configuration changes based on specific labels. About ObservIQ: observIQ brings clarity and control to our customer's existing observability chaos. How? Through an observability pipeline: a fast, powerful and intuitive orchestration engine built for the modern observability team. Our product is designed to help teams significantly reduce cost, simplify collection, and standardize their observability data.

Overcoming Barriers to Achieving ZeroSec Observability

Achieving ZeroSec observability has long been the ultimate goal, yet it remains elusive despite countless hours and sleepless nights dedicated to the cause. A recent discussion with a client underscored the persistent challenges that many organizations continue to struggle with in this pursuit. They had all the right tools in place yet faced significant issues that prevented them from achieving a smooth run of the applications.