Operations | Monitoring | ITSM | DevOps | Cloud

Top 5 EdTech outages detected by StatusGator in May 2025

In May 2025, several EdTech platforms experienced service disruptions, impacting students, educators, and administrators. StatusGator’s Early Warning Signals feature once again provided timely alerts — often before the affected providers posted updates. Here are the top five EdTech outages detected by StatusGator in May.

Top 5 outages detected by StatusGator in May 2025

In May 2025, several widely used platforms experienced outages affecting both enterprise and consumer services. With StatusGator’s Early Warning Signals, users were alerted to disruptions ahead of official announcements, helping teams respond faster and reduce downtime impact. Here are five major outages StatusGator detected in May.

Traceparent: How OpenTelemetry Connects Your Microservices

In a microservices setup, tracking a single request across services quickly gets complex. One service calls another, then a third, and your logs don’t line up. The traceparent header carries context between services, so all parts of a request connect back to the start. For example, when a frontend sends a request to an API, which then calls a database service, traceparent it links those calls in the trace. Without it, you’re left guessing how requests flow.

How to Transform IT Operations with PowerFlow for ServiceNow

IT teams today face a flood of tickets, disconnected tools, and complex hybrid infrastructure. ScienceLogic PowerFlow for ServiceNow simplifies it all—automating workflows, enriching data, and accelerating resolution at scale. In this video, see how PowerFlow brings intelligent automation to transform your IT operations, with real-time ticket enrichment and seamless ServiceNow integration. It’s scalable, efficient, and built for today’s hybrid environments.

VictoriaMetrics Features & Community Call - May 2025

Join us for this new monthly call, where we'll discuss cool features that are either new or that we'd like to highlight to the user community. We'll also look at some of the questions that were asked by users that could be of interest to others. We'll talk about how to optimise data collected by default in-stack helm chart, incl. topics such as: Cardinality explorer Stream aggregation Unused metrics And more! We look forward to seeing you there!

A Leader, Once Again - Thanks to You

The demonstration of sustained excellence is ultimately more impressive and important than any individual breakthrough, though the former often go less celebrated. This week, however, Nexthink is extremely proud to celebrate that we’ve once again been named a Leader in the Gartner Magic Quadrant for Digital Employee Experience (DEX) Tools. It’s a distinction we don’t take for granted—and one we’re honored to have maintained.

How to Pinpoint Root Cause in Real Time

When systems fail, it’s not just about knowing that something went wrong—it’s about understanding why it happened and pinpointing the root cause fast. ScienceLogic Skylar AI automatically analyzes massive volumes of data, detects patterns, and delivers clear, human-readable insights. The result? Your team knows exactly where to start, acts faster, and keeps issues from escalating.

Our latest Pingdom data improvements - to get more from your monitoring

At SquaredUp, we’re obsessed with making monitoring not just powerful, but a genuinely delightful experience for engineers and teams. When we first built our Pingdom plugin, the goal was simple: make website uptime and performance data easy to visualize alongside everything else you care about. But as our users pushed the boundaries—connecting more endpoints, demanding richer insights, and needing faster troubleshooting—we realized our plugin needed to keep up.

How to Detect Cloud Configuration Changes Before They Cost You

In this video, see how ScienceLogic helps IT teams take back control. By delivering real-time insight, intelligent policy recommendations, and automated enforcement, ScienceLogic keeps your cloud environment compliant, cost-efficient, and secure. From real-time change detection to automated remediation, you’ll get the visibility and control needed to move fast and stay ahead of disruption.

How to Block Chat Widgets During Playwright Tests (Drift, Intercom & More)

Chat widgets are great for customer support, but they can wreak havoc on your automated tests. These floating elements often interfere with Playwright tests by covering clickable buttons, triggering unexpected popups, or causing element selection issues. If you've ever had a test fail because a chat widget appeared at the wrong moment, you're not alone. This guide shows you exactly how to block popular chat widgets like Drift, Intercom, Zendesk, and others during your Playwright test runs.

E-Commerce Micro-Friction: The Conversion Killer You're Not Measuring (...But Should Be)

Friction density (the number of stacked micro-frictions in a single session) is the predictor of abandonment we don’t talk about enough. And guess what? It beats any single UX metric by 2× when it comes to predicting lost sales.

How to Automatically Detect Linux Configuration Drifts

When a quick Linux fix during an outage slips through unnoticed, it can silently break compliance and put your infrastructure at risk. In this video, we dive into how ScienceLogic helps IT and security teams detect and resolve these hidden issues automatically—before they become bigger problems. We demonstrate how ScienceLogic identifies unauthorized Linux configuration changes, flags policy violations, and restores compliance through intelligent automation. From configuration drift detection to enriched ServiceNow tickets and automated remediation, it’s all about eliminating the guesswork and staying ahead of risk.

Early Warning Signals: Now visible everywhere!

We’ve just rolled out an exciting enhancement to our Early Warning Signals system: Possible outages that we detect are now visible directly in your Admin board and on your public Status Page. Until now, we’ve notified StatusGator users about possible outages via email — and more recently, Slack — when we detected signs of an outage before any official acknowledgment. Now, these early warnings appear right where they’re most impactful.

Windows Error Logs: Your Guide to Simplified Debugging

When an application functions flawlessly in your environment but crashes unpredictably on a client’s Windows server, the root cause is often buried in system logs—logs many developers overlook. Windows maintains comprehensive error records that document crashes, failures, and system events with precise detail. These Windows error logs serve as an invaluable resource for diagnosing issues in production environments.

What's new in Grafana Metrics Drilldown: advanced filtering options, UI enhancements, and more

Grafana Metrics Drilldown offers a queryless experience for browsing Prometheus-compatible metrics. With Metrics Drilldown — which is part of our suite of Grafana Drilldown apps — you can quickly find related metrics with just a few simple clicks, no PromQL queries required.

Achieving FedRAMP Authorization: Driving Federal IT Efficiency and Security with ScienceLogic Government Cloud

We are thrilled to announce that ScienceLogic has achieved Federal Risk and Authorization Management Program (FedRAMP) Moderate authorization for the ScienceLogic Government Cloud. This milestone represents the culmination of our commitment to delivering secure, reliable, and efficient IT operations management solutions for government agencies.

How Auditd Logs Help Secure Linux Environments

If you manage a Linux server and notice something unusual, auditd logs can help you track exactly what’s happening. This built-in audit system records who accessed the system and what actions they performed. In this guide, we’ll cover setting up auditd, reading the logs, and using them to detect potential security issues early.

Five Creative Uses of Content Monitoring Software

Content monitoring software is designed to track changes on web pages over time—capturing additions, deletions, and modifications across everything from blog posts to landing pages and legal disclaimers. Traditionally, it’s used by businesses and organizations that need to keep tabs on important, frequently updated content—either for compliance, competitive intelligence, or performance reasons. Here are some of the most common use cases.

Leading analyst study reveals how resilience unlocks eCommerce growth

Customer expectations for seamless digital experiences are higher than ever, and any disruption in availability or performance can lead to abandoned purchases and millions in lost revenue. A new commissioned study conducted by Forrester Consulting on behalf of Catchpoint shows retail & eCommerce companies are struggling: To ensure seamless customer experiences, retail & eCommerce companies must employ comprehensive Internet Performance Monitoring or assume the risk of allowing millions in revenue slip away.

Highlights from Google Cloud Next 2025

Google Cloud Next is the biggest event of the year for the Google Cloud community, showcasing the latest and greatest offerings from Google Cloud and hundreds of its partners. As a long-time Google Cloud partner and recipient of three Google Cloud Partner of the Year awards in 2025, Datadog was there in full force, delivering several speaking sessions and running a booth on the expo floor where we met with thousands of attendees. In case you missed it, don’t worry.

Syslog Implementation: Servers, Integration and Best Practices

Syslog is a fundamental protocol for collecting messages and event data from various devices and applications across a network. Think of it as a universal language that allows your servers, routers, firewalls, and software to send their operational insights to a central logging point. Born from Unix systems, Syslog has evolved to become the industry standard, forming the backbone of effective log management and providing a unified view of your infrastructure's activity.

Debugging Errors in Background Jobs

Debugging background jobs is one of those tasks that always sounds easier than it is—until you’re knee-deep in stack traces that offer no real clues. Background jobs love to run in isolated environments, cutting themselves off from all the helpful context you’d normally have. @nikolovlazar shows us how to debug these errors anyway—piecing together the missing context across systems so you can actually fix the problem instead of just guessing.

What's Inside InfluxDB 3.1: New Features for Security, Performance, and Visibility

InfluxDB 3.1 is now available for both Core and Enterprise editions, bringing significant improvements that make managing high-volume, high-velocity time series data even easier, faster, and more secure. InfluxDB 3 Core is the free, open source edition of InfluxDB 3—a high-speed, recent-data engine licensed under MIT and Apache 2. InfluxDB 3 Enterprise is the commercial version of Core, adding support for longer-term historical queries, high availability, enhanced security, and more.

Navigating the SSE Landscape: The 2025 Gartner Magic Quadrant

Having reviewed the 2025 Gartner Magic Quadrant for Security Service Edge (SSE), it is fair to say that it reflects a comprehensive evaluation of vendors delivering integrated, cloud-based security solutions. However, while such assessments provide valuable insights for those looking for full-stack adoption, real-world adoption may require deeper analysis and strategic planning.

Detect hallucinations in your RAG LLM applications with Datadog LLM Observability

Hallucinations occur when a large language model (LLM) confidently generates information that is false or unsupported. These responses can spread misinformation that jeopardizes safety, causes reputational damage, and erodes user trust. Augmented generation techniques, such as retrieval-augmented generation (RAG), aim to reduce hallucinations by providing LLMs with relevant context from verified sources and prompting the LLMs to cite these sources in their responses.

Kubernetes Logs: How to Collect and Use Them

If you’ve worked with Kubernetes, you know logs are essential for understanding what’s happening inside your clusters. However, unlike traditional servers, Kubernetes logs present their unique challenges. Pods frequently start and stop, containers restart regularly, and logs stored locally can be lost quickly. Because of this, managing logs in Kubernetes requires a different approach.

Docker Container Lifecycle: Key States and Best Practices

You’ve probably run a lot of Docker containers, but do you know what happens behind the scenes? The Docker container lifecycle is the path a container follows from being created to running, stopping, and finally getting removed. Understanding these steps helps you figure out why a container might not start or when to restart it instead of creating a new one.

Introducing Session Health in Sentry (Now In Open Beta)

You push a release that touches the checkout flow. Now you’re glued to dashboards and checking Slack, hoping you didn’t introduce a regression that breaks the payment path. You can’t tell if you’ve just shipped a blocker that’s stalling every cart—or some edge case quietly making users bail.

Flying Your Network Blind? | Obkio

We created this video for every IT team still relying on guesswork to manage network performance. Here’s the reality: No monitoring = flying blind No alerts = no prevention No visibility = slow troubleshooting, false assumptions, and frustrated users Even the best IT pros need the right tools, just like pilots need instruments. Have you ever thought about: – Why “no complaints” isn’t the same as “no issues”– The hidden cost of poor visibility– How skill only takes you so far without data to back it up.

Breaking Silos: Pairing InfluxDB 3 with Your Historian for Better Insights

Industrial systems constantly generate time series data—streams of time-stamped values like temperature, flow rate, vibration, or power load. This data powers real-time monitoring, performance tracking, and long-term forecasting across critical infrastructure, energy systems, and manufacturing environments.

ManageEngine Site24x7 monitoring actions are now available within ServiceDesk Plus On Demand

At ManageEngine, we're committed to empowering IT teams with tools that simplify operations and deliver effortless observability for all stakeholders. We're excited to announce the Site24x7 extension for ManageEngine ServiceDesk Plus On Demand now available on the ManageEngine Marketplace. This extension transforms ServiceDesk Plus On Demand from a passive ticketing tool into an active hub for IT infrastructure management.

User experience depends on more than just your website speed.

Even if your Core Web Vitals are flawless, user experience depends on so much more—CDNs, DNS, BGP, APIs, third-party services, and local ISPs all play a role in how users experience your page. If any part of that chain breaks, so does the user experience. And worse? Your teams are left in the dark, scrambling to find the root cause.

Kubernetes observability: How to enrich logs with GeoIP using the Kubernetes Monitoring Helm Chart

When your Kubernetes app suddenly has traffic spikes in a distant country, it can be difficult to determine why. Let’s say, for example, we have an e-commerce app that started to receive an unusual surge of visitors from Australia — something we never anticipated. We search for answers in our logs, but without geographic context, we don’t have the full insights we need.

Build Vega-Lite visualizations natively in Datadog with the Wildcard widget

Datadog dashboards provide a unified view of your applications, infrastructure, logs, and other observability data—making it easy to monitor health, investigate issues, and share insights across teams. While native Datadog widgets support a broad range of visualization types, some use cases call for more customized representations, particularly when you’re working with unconventional data formats, external sources, or specific transformations.

Elastic and AWS collaborate to bring GenAI to DevOps, security, and search

Today, we are happy to celebrate Elastic and AWS committing to a five-year strategic collaboration agreement (SCA). Our collaboration underscores the efforts of Elastic and AWS to provide you with increased speed and greater flexibility as you adopt generative AI technology.

SQL Server Security: Protecting Your Data From Threats

If your organization isn’t focused on data security, it’s time to make some changes, particularly if you rely on SQL Server to manage and store valuable information. Cyber threats, data breaches, and malicious attacks are on the rise—and they are constantly evolving. That’s why it’s essential to have robust security measures in place. SQL Server has several built-in security features, but you must take a proactive approach to protect your data.

AIOps benefits: 5 core ways agentic AI transforms IT

Your systems are getting faster. More complex. More distributed. But your tools are still waiting for something to go wrong before they do anything about it. That’s the real limitation of most AIOps platforms. They highlight issues. They suggest next steps. But they stop short of action—leaving your team to connect the dots, chase down context, and manually fix what broke. Agentic AIOps doesn’t wait. It acts.

Bring a Business Service Perspective to Your Network Monitoring

In recent years, network performance and business performance have become increasingly intertwined. Now, virtually every critical employee and customer service is in some way reliant upon network connectivity. When connectivity falters, those critical processes can be impaired or stopped completely. However, for too many teams, it can be difficult to knowledgeably determine how specific outages or issues actually affect a business service. For example, say an operator discovers a device is down.

The End of the Network Engineer as We Know It?

For decades, the enterprise network was a well-defined fortress and network engineers were its meticulous guardians. However, their visibility and control was largely confined within the parameters of their organization's infrastructure. The cloud revolution and the ubiquity of SaaS applications have shattered these traditional boundaries. Today, for virtually every organization, the internet is the new enterprise network.

Introducing Logz.io Dashboards (Beta): Shaping the future of unified Observability with Open 360

We’re thrilled to announce the Beta launch of Logz.io Dashboards – a major step forward in how engineers and DevOps teams visualize and analyze their telemetry data. For the first time, Logz.io users can now create dashboards that bring together logs, metrics, and traces in a single unified view — making it easier than ever to monitor performance, detect issues, and troubleshoot incidents without switching tools or losing context. This launch is more than just a product update.

How to Add Performance Data Graphs into Your Icinga Instance

This is a guest blogpost by Markus Opolka from the Icinga Enterprise Partner NETWAYS. After forking the Grafana Module for Icinga Web last year, we started thinking about alternative ways to display Icinga performance data graphically in the web interface. Running a separate Grafana instance just to render graphs is a lot of overhead and adds operational complexity — no matter how much you like Grafana. Plus, installing the grafana-image-renderer isn’t always straightforward.

Introducing Netdata Insights

We’ve been thinking a lot about synthesis lately. Netdata already samples every metric every second at the edge. Engineers told us the remaining pain point was synthesis, the ability to pull hours or days or months of high‑resolution time‑series into a concise explanation they could hand to a teammate (or use themselves to debug faster).

SigNoz Community Edition now available with SSO (Google OAuth) and API Keys

One of the biggest asks from our open-source community has been to open-source our SSO support, which was part of our enterprise offering. Today, we’re thrilled to announce that support for SSO with Google OAuth is now part of our latest release. Latest version: v0.85.0 Not only that, we've also shipped another highly anticipated feature for our Community Edition: API Keys for comprehensive programmatic access to SigNoz.

Discover powerful insights with nested metric queries

To gain adequate visibility into your distributed applications, you need to observe those applications at different levels of granularity. This means that you need to be able to query collected telemetry data both at the level of the whole application and at the level of selected components. Thanks to the power of Datadog tagging, you can already do this by aggregating your metrics within any scope of your choosing.

A Fresh Look Without Moving the Cheese

After 12 years of faithful service, the TrackJS interface was starting to show its age. Not that it wasn’t working—it was still doing exactly what our customers needed it to do. But when you’re staring at Bootstrap styles from 2012 and a version of LESS that might be officially defunct, it’s probably time for a refresh.

Server Performance Metrics Explained

Server performance metrics help you figure out what’s going wrong, where your bottlenecks are, and how your system handles load. They give you the data to plan capacity, fix issues before they escalate, and build more reliable infrastructure. In this guide, we’ll go over the core metrics that matter, how to monitor them effectively, and the tools that can help along the way.

3 AIOps Trends for 2025 #aiops

As IT environments grow more complex, teams need smarter, faster ways to stay in control. In 2025, three trends are redefining how modern IT operations teams drive efficiency and resilience: Automation Everywhere: Offload routine tasks with intelligent workflows Predictive Everything: Spot and resolve issues before they impact users AI + Human Collaboration: Empower teams with real-time, AI-driven insights.

Brand email with your logo

StatusGator supports custom email branding on our Enterprise plan and as an add-on to other plans, allowing your customers or end-users to get an email that has your organization logo and sends from your organization’s email address. Previously, this email logo used the same image as your status page. Now, you can upload a custom logo to be used just for your emails. Enjoy improved branding by uploading a logo that fits the email perfectly.

Simplifying Observability: Streamlining Telemetry with a Centralized Pipeline

Modern applications generate a deluge of telemetry data—logs, metrics, and traces—that hold the key to understanding system performance and reliability. However, managing this data effectively is a growing challenge for DevOps teams. Raw telemetry can overwhelm teams with complexity and noise even when collected via robust standards like OpenTelemetry.

How to import Prometheus-style alerts and recording rules to Grafana-managed alerts and recording rules

Grafana Alerting has evolved dramatically since the legacy dashboard-alert days. Today, Grafana-managed alerts power enterprise-scale monitoring in Grafana Cloud and on-prem installations. And over the last two years, we’ve added RBAC, state history, versioning, and much more. At the same time, our own monitoring at Grafana Labs relies heavily on Prometheus-style alerts—a situation that’s not uncommon for our users, too.

From Alert to Fix in 10 minutes: How a Slow Query Took Down Placid.app

This is a guest post from Armin Ulrich, a fullstack developer, and founder of placid.app. He also created the MadeWith* network where he shares his projects and allows other developers to share theirs. There are many things I would rather do at 9pm than tracking down a mission-critical bug, but sometimes you don’t have a choice. Let me tell you the story about a slow query that led to a cascading failure–and how it could have been worse.
Sponsored Post

Extreme automation and the SAP Cloud ERP journey

Cloud ERP arrives as the new holy grail of ERP architecture: a composable, flexible and scalable collection of core business services working together to meet enterprise ERP needs. Of course, getting there for a large enterprise with significant existing complexity across legacy SAP implementations isn't a trivial task. Much has been written about S/4HANA migration, but less explored are the benefits of automation solutions used for the regular operations of SAP to migration projects. These solutions offer a number of accelerators and benefits to migration projects and SAP teams, so it is worth exploring.

How to Reduce Downtime: Keep Your Business Running Smoothly

Downtime refers to any period when your business operations are interrupted or unavailable due to technical issues. Whether it's caused by unscheduled downtime, like sudden system failures, or planned downtime for regular maintenance, it can significantly impact your business continuity. The effects of downtime can be severe, leading to financial losses, decreased productivity, and a damaged reputation.

The ROI of monitoring your Azure environment: Prevent surprises, control costs, boost uptime

Like many cloud providers, Azure offers services that scale with usage. However, unanticipated overutilization of Service Bus, Azure Functions, and SQL databases can incur additional costs. Managing these resources effectively is crucial for keeping the billing framework predictive.

Cloud Cost Management & Trends in 2025: Strategies to Optimize Your Cloud Spend

Cloud computing has become the backbone of modern business operations, powering everything from day-to-day collaboration to large-scale digital transformation initiatives. As organizations deepen their reliance on cloud services, the financial stakes continue to grow. According to Gartner, global spending on public cloud services is projected to reach over $720 billion in 2025, a significant increase from nearly $600 billion in 2024.

Shedding Light on Kafka's Black Box Problem (with OpenTelemetry)

"All language is but a poor translation." — Franz Kafka This quote by Franz Kafka reminds me of the time when I used to look at metrics from “Apache Kafka” topics trying to figure out what was causing the huge lags and manually deleting the messages in certain partitions to get rid of polluted messages. Yep, pretty lost in translation. I wasn’t aware of the power of observability for a Kafka producer-topic-consumer system.

Graylog vs Loki: Key Differences and Use Cases

Logs are a key part of building and running software, but managing them can get complicated fast. As your apps grow and generate logs from many sources, choosing the right tool to store, search, and analyze those logs becomes important. Graylog and Loki are two popular options, each with a different way of handling logs. In this blog, we’ll break down the main differences between Graylog and Loki, how they work, and which types of projects they suit best.

An Easy and Practical Guide to CDN Monitoring

A CDN delivers your content around the world, making sure users get it quickly and reliably. When it slows down or goes offline, users notice right away. Good CDN monitoring gives your team the information needed to fix issues before they affect users. This guide explains the basics of CDN monitoring and shows practical ways to set it up.

Mission Impossible: Find out the Reasons Why Your Network Is Down (and How to Proactively Prevent Network Downtime)

Your mission, should you choose to accept it, is to prevent network downtime before it takes your business offline. The threat is real. One moment, your network is up. The next calls drop, websites freeze, apps stall, and customers vanish. You hear the dreaded question echoing across departments: “Is the network down?” You’re not alone.

How to Choose an APM Solution: 5 Critical Questions for 2025

An APM solution, or Application Performance Monitoring tool, is a software application that helps businesses monitor and manage the performance and availability of software applications. APM tools gather data from systems, servers, databases, APIs, and end-user devices to provide deep insights into the root causes of performance issues. APM solutions have evolved far beyond basic monitoring.

Grafana Campfire - Hiring with AI and more about Grafana MCP (Grafana Community Call - May 2025)

In this Campfire community call, we will talk about the new and the future of AI in the field of Observability space and also discuss about the Grafana MCP server to provide access to your Grafana instance and the surrounding ecosystem. Join me (Usman), Matt Ryer, Carl Bergquist, David Kaltschmidt for this exciting session. Special guests: Sarah Zinger, Cyril Tovena and Ben Sully.

NiCE Expands Microsoft SCOM Services with New Expert Training Options

NiCE IT Management Solutions is excited to announce expanded service offerings and professional training options for Microsoft System Center Operations Manager. In addition to our well-established consulting and monitoring solutions, we now offer custom and standard SCOM training programs tailored to varying skill levels and organizational needs. Our goal: empower IT teams to maximize performance, ensure stability, and deepen expertise in managing modern infrastructure.

Easy Way to Convert Wavefront Metrics Using OpenTelemetry

Once upon a time in the world of metrics, Wavefront was a pioneer. Before Prometheus took over and tools like OpenTelemetry unified tracing and metrics, Wavefront brought something novel to the table: human-readable metrics with real-time querying and tag-based dimensionality. In enterprise environments running VMware or early microservices, it offered a scalable way to understand a system's behavior. But as the telemetry landscape evolved, many systems that spoke Wavefront were left behind.

Ownership change of the ansible-collection-icinga to NETWAYS

After NETWAYS has already taken a leading role in the past in maintaining the Ansible Collection Icinga, contributing features and bug fixes, it’s now official: The Ansible Collection Icinga is moving into the NETWAYS namespace (on GitHub and Ansible Galaxy). The people involved in the repository will remain largely the same.

What are Microservices? A Path to Scalability and Agility

If developing scalable, agile applications is a priority for your business, microservices may provide a compelling solution. But what are microservices exactly? The proper microservices definition refers to a modern architectural approach where an application is built as a collection of loosely coupled services. Each service is independent, self-contained, and designed around a specific business capability.

Why didn't my Playwright test capture video?

If you use Checkly, eventually you'll be looking at alerts about something failing, and wonder how to debug a failed check. For most of us, the first thing we want to see is the video of a failed check run. Sometimes, though, our check doesn’t capture video. This guide will cover three common reasons a video doesn’t show up on a check run. This advice is general for Playwright as well as those running Playwright tests on Checkly.

Inside the Observability Journey: Lessons from CarGurus, Nearform & More

Join us for a dynamic panel from Observability Sessions Boston where leaders from CarGurus, Nearform, and Grafana Labs share their real-world experiences with observability. In this candid discussion, David Frankel (CarGurus) and Joe Szodfridt (Nearform) delve into the challenges of implementing scalable observability practices, moving from centralized models to federated teams, and navigating cloud migration with a focus on performance and cost.

Surprised By Your AWS ELB Bill? Here's What Happened

On May 1st, AWS corrected a long-standing billing bug tied to Elastic Load Balancer (ELB) data transfers between Availability Zones (AZs) and regions. That fix triggered a noticeable increase in charges for many users, especially for those with high traffic volumes or distributed architectures. The problem wasn’t new usage; it was a silent correction to an old error.

VPC Log Format: Custom and Advanced Configurations

VPC Flow Logs come with a default format that gives you basic network traffic details. But you can tweak the format to capture exactly what you need. This can lower costs, speed up processing, and make your logs fit better with what you’re trying to monitor. If you want to improve security, keep an eye on performance, or save money, adjusting your VPC logs can make a big difference. Let’s take a look at some practical ways to customize your logs beyond the default settings.

A Simple Guide to Monitoring and Optimizing Prometheus CPU Usage

Prometheus is supposed to help you monitor your stack, not become the thing you need to monitor. But if you’ve ever seen it spike in CPU and slow everything down, you know that’s not always the case. High Prometheus CPU usage usually shows up when you're scraping too many metrics, using expensive queries, or running with default configs that don’t fit your workload. This guide covers how to track Prometheus CPU usage, what typically causes it, and how to fix it.

SAML authentication in Grafana Cloud: a guide for easy configuration

In my role as Senior Observability Architect here at Grafana Labs, one of the things I focus on is making sure customers are getting the most out of our products. Recently, I noticed a trend where customers were struggling to get SAML authentication configured properly. They were getting stuck on some of the steps needed to configure the users key pair values, which allows users to log in with the correct roles assigned in Grafana.

Harnessing Network Observability to Enhance Grid Resilience

Within the utility sector, a lot is changing. Utilities continue to pursue digital transformation, altering the way services are delivered and operations are managed. What hasn’t changed is the criticality of the services provided. These organizations deliver essential resources like natural gas, electricity, and water—services that we as consumers rely upon constantly for our comfort, sustenance, communications, and more.

Preparing for the Autonomous Future

Throughout this blog series, we’ve followed how AI reshapes network operations – from foundational data harmonization to real-time correlation, from contextual insights to agent-driven automation, and most recently, to conversational access through natural language interfaces. But we haven’t reached the final destination.

How to implement business observability

It sounds simple: You define metrics for success, you track them, and if they fail, you fix them. For decades, this was how businesses monitored their systems. However, a reactive monitoring approach, which alerts businesses about failures only after the issue has already impacted operations, became insufficient as digital architectures grew more complex.

Motadata AIOps - AI-Driven Network Monitoring Software

What positions Motadata AIOps as a standout among the premier network monitoring tools available in the market? In a crowded market of network monitoring tools, Motadata AIOps distinguishes itself through its intelligent and future-proof approach. The Network Observability tool leverages the power of AI to monitor your network and predict and prevent problems before they occur. This helps you achieve unmatched scalability for your growing network needs, while its open architecture and integration capabilities ensure a unified view of your entire IT environment.

Motadata AIOps | Monitoring Infrastructure Using Monitors & Monitor Settings

In the world of IT infrastructure management, having a real-time understanding of the health and performance of your systems is essential. Motadata AIOps introduces the Monitors, a way to provide comprehensive insights into your IT environment, empowering you to proactively manage and optimize your infrastructure.

Real-time detection of BGP blackholing and prefix hijacks

Border Gateway Protocol (BGP) remains the backbone of inter-domain routing on the Internet, but its fundamental trust model leaves it vulnerable to misconfigurations, hijacks, and blackholing. When these issues occur, they often go undetected by the impacted networks—until users report degraded performance or service outages. This post walks through a real-world incident in which a legitimate traffic spike led to an upstream provider mistakenly blackholing a critical IP address.

Understand and manage your Datadog spend with Datadog cost data in Cloud Cost Management

As your organization scales its Datadog footprint, you want to understand what’s driving cost changes and promote cost awareness. But to take meaningful action, you need more than a monthly bill—you need real-time, contextualized cost data tied to services and teams. Without this visibility, it’s hard to assign ownership, prevent cost overruns, or identify which changes are affecting spend.

OpenTelemetry vs Micrometer: Here's How to Decide

In a distributed system, things break in unexpected ways. That’s why observability isn’t optional—it’s how you understand what’s going on under the hood. If you’re comparing tools to instrument your services, OpenTelemetry and Micrometer are two names you’ll run into. Both are used to collect metrics, but they take very different approaches—especially when it comes to flexibility, vendor support, and what you can do with the data.

Track the Right Elasticsearch Metrics Without the Noise

Elasticsearch does a lot right—it's fast, scalable, and makes searches feel simple. But when things slow down or break, figuring out what’s going on can be frustrating. Especially if you’re not keeping an eye on the right metrics. This guide covers Elasticsearch metrics that are worth tracking and how they help you keep your cluster healthy without data overload.

Common Issues with Grafana Login and How to Fix Them

Grafana is a popular choice for monitoring and visualizing metrics, but login issues can quickly block your access and slow you down. Forgot your password? Can’t get into the admin account? Problems after changing authentication settings? These are some of the most common hiccups—and they’re usually easy to fix. This guide covers the frequent login problems you might face and walks you through practical ways to resolve them.

How to Troubleshoot Faster with LM Logs

When an alert fires, your goal is clear: fix the problem—fast. But traditional troubleshooting rarely makes that easy. You’re immediately thrown into decision mode: All the while, the clock is ticking. The longer you’re stuck guessing what to do next, the longer your downtime drags on, and the more non-value-added engineering time you burn.

What is Digital Adoption? Strategies for 2025

In today’s digital-first workplace, it’s not enough to deploy new software. You need your teams to actually use it. That’s where digital adoption comes in. Digital adoption is the process by which individuals not only learn how to use digital tools but also integrate them into their day-to-day tasks in a way that enhances performance. True digital adoption means employees are using the right features, in the right context, to complete work with minimal friction and maximum confidence.

Get Better Visibility Into App Hangs On Apple Devices

App hangs are the worst kind of bug: they don’t crash, they don’t log, and unless you're actively profiling, good luck catching them in the debugger. Maybe the main thread is blocked because it’s decoding a massive image with UIImage(data:). Maybe a background task is holding a lock or waiting on a DispatchGroup that never finishes. Maybe an async flow is stuck waiting on a continuation that never resumes.

Using the OpenTelemetry Operator to boost your observability

If you’ve ever wrangled sidecars or sprinkled instrumentation code just to get basic trace data, you know the setup overhead isn’t always worth the payoff. But what if it was… just easier? That’s where the OpenTelemetry Operator for Kubernetes steps in… and it plays great with Coralogix out of the box!

Observability 2.0 in the Real World: Lessons from SimpliSafe's Engineering Journey

In this candid and insightful talk from Observability Sessions Boston, Laban Eilers, a platform engineer at SimpliSafe, takes us on a practical deep dive into the evolution of observability—from the traditional “three pillars” model to the emerging promise of Observability 2.0.

Set Up Tracing for a Ruby on Rails Application in AppSignal

In this guide, we'll harness AppSignal to detect, diagnose, and remove performance bottlenecks and employ proper tracing in a Ruby on Rails application. From setting up tracing to capturing errors and logging, we’ve got you covered. We'll ensure our application runs smoother than ever, even under the heaviest loads! But first, let's quickly touch on how to define tracing and its benefits.

Enhancing workflow efficiency with Elasticsearch and Red Hat OpenShift AI

Elastic collaborates with Red Hat on the validated pattern to enhance financial analyst workflows with RAG-powered search. We’re excited to share that Elastic and Red Hat have partnered to create validated patterns that integrate Elasticsearch’s generative AI (GenAI) and vector search capabilities with Red Hat OpenShift AI. This integration can run on accelerated hardware on-prem or in IBM Cloud to power retrieval augmented generation (RAG) solutions.

Sneak Peek: MetricFire's New Logging Tool for Scalable, Open-Source Observability

Take a first look at MetricFire’s brand-new logging tool — designed to simplify log ingestion, storage, and visualization using open-source components like Loki, Python, Telegraf and Grok. Collect logs, search across services, and correlate them with your metrics — all inside your existing Hosted Graphite environment. Whether you're an SRE, DevOps engineer, or running logs on a budget, this sneak peek reveals how MetricFire is evolving toward full observability.

What is Amazon Inspector? Monitoring and Alerting with Amazon Inspector

Amazon Inspector is an automated security assessment service that scans AWS workloads for vulnerabilities, misconfigurations, unintended network exposure and compliance risks, helping organizations enhance cloud security, detect threats, and meet regulatory requirements (such as ISO/IEC 27001, HIPAA, NIS 2 and SOC 2 Type 2) in real time. Amazon Inspector discovers and scans Amazon EC2 instances, container images in Amazon ECR (Elastic Container Registry), and Lambda functions.

Is There an Existential Crisis in Network Observability?

We've all been there. Users report that applications are slow, calls are dropping, or that "the internet is broken." Yet, a glance at the network dashboards shows a sea of green—latency looks acceptable, packet loss is minimal, and bandwidth seems fine. This common scenario highlights a fundamental challenge in network observability: the perceived disconnect between the technical measurements we gather and the actual experience of the people using our digital services.

Early Warning Signals now available in Slack

We’re excited to announce that Early Warning Signals are now available in Slack! Early Warning Signals help you detect service disruptions before they’re officially reported. Now, these critical notifications will show up directly in your Slack workspace, keeping your team in the loop without having to check your email.

Top 13 Fluentd Alternatives 2025

Fluentd is popular for its flexibility and extensive plugin support, making it easy to collect, process, and forward logs from many different sources. However, as environments scale and observability needs evolve, teams often seek alternatives that offer lower resource usage, easier configuration, broader telemetry support, or tighter integration with their existing toolchains.

Logz.io AI Agents: Transforming Observability Through Intelligent Automation

Let’s be honest. AI features can sound cool on paper, but too many tools overpromise and underdeliver. At Logz.io, we didn’t want to build “yet another AI chatbot.” We wanted to create something our engineers and yours would actually use when incidents hit, logs explode, or someone asking, “What just happened to production?” Here’s how our AI Agent evolved from a basic chat interface to an incident-resolving, log-analyzing, doc-digging, context-aware assistant.

A New Era of Efficiency: Leveraging AI, Data, and Modernization to Improve Public Services

Greg Reeder from Datadog talks with Martha Dorris, a leader in government customer experience, about how agencies can drive efficiency using AI, real-time data, and observability. They highlight CX wins at the State Department, IRS, and CBP—showing how smarter monitoring and design improve services, reduce costs, and strengthen citizen trust.

Observability vs Monitoring: Enhancing, Not Replacing

In the dynamic world of IT operations, a common misconception has emerged: Observability vs Monitoring is often framed as a battle where one replaces the other. At Icinga, where open-source monitoring is our expertise, we aim to clarify this misunderstanding. Observability doesn’t supplant monitoring—it complements and enhances it. The term “Observability” has become a buzzword in the tech industry, often touted as the modern solution to outdated, static monitoring practices.

Digital Noise Cancellation: What Gigamon Can Teach Us About Listening to the Right Signals

When I’m on the train to work in the morning, I always reach for my noise-cancelling headphones. Not because the world is too loud, but because I want to hear what matters. It’s a small act of filtering signal from noise. And this got me thinking that, increasingly, that same mindset is becoming essential in how we design and manage digital infrastructure. There’s no shortage of data. In fact, there’s too much of it.

5 Critical Steps in the Effective Change Management Process. Guide + Best Practices

Change is constant, but without a structured approach, it can lead to confusion, resistance, and costly disruptions. A well-planned change management process ensures transitions happen smoothly, minimizing risks while keeping teams aligned and operations running efficiently. Whether adopting new technology, restructuring teams, or refining business strategies, organizations that manage change effectively turn challenges into opportunities for growth.

Celebrating 14K Stars on GitHub: Spring Update

Seeing that VictoriaMetrics products are this popular with engineers worldwide is fantastic: Just a little over a year ago, we hit 10K stars, and with the adoption of VictoriaLogs, the star count now went beyond 14K. We don’t take these GitHub Stars milestones for granted: It’s amazing to see these stats grow organically thanks to the community of users out there who use our products. Thank you so much!

Grafana Cloud updates: New observability as code tools, Grafana Drilldown enhancements, and more

We consistently roll out helpful updates and fun features in Grafana Cloud, our fully managed observability platform powered by the open source Grafana LGTM Stack: Loki for logs, Grafana for visualization, Tempo for traces, and Mimir for metrics. With GrafanaCON 2025 — and the release of Grafana 12 — earlier this month, there are a ton of Grafana Cloud updates to share.

Supercharge Telemetry Pipelines: Introducing Sources and Destinations in Cribl Packs

Cribl Packs have always provided a powerful way to package and share configurations across Cribl Stream environments. From pipelines to lookups, knowledge objects to functions—Packs make telemetry pipelines simple and portable. Now, we’re excited to announce a game changing expansion: Sources and Destinations can now be included in Cribl Packs!

Application Performance Monitoring Guide: Strategies, Best Practices, and Tools

With the introduction of cloud services and microservices, applications have become more complicated due to their increased layers of complexity and distributed architecture. While microservices clearly offer speed, they also make things harder for the developers and operations teams. These teams need to plan for the reliable and efficient performance of such applications. To combat these challenges, application performance monitoring (APM) has surfaced as an indispensable discipline.

Mastering Heroku Monitoring in 2025: Best Practices for Optimal Application Performance

In today's fast-paced digital landscape, ensuring the reliability and performance of your applications is paramount. Heroku, a cloud-based Platform-as-a-Service (PaaS), simplifies application deployment and scaling. However, to fully leverage Heroku's capabilities, effective monitoring is essential. This guide delves into best practices for monitoring Heroku applications, providing context, practical steps, and unique insights to enhance your observability strategy.

.NET Logging with Serilog and OpenTelemetry

Debugging modern.NET apps isn’t as simple as scanning logs anymore. With services spread out and systems growing more complex, it's easy to miss the bigger picture. Serilog gives you clean, structured logs. OpenTelemetry brings in traces and metrics to connect the dots. This guide covers how to wire up Serilog with OpenTelemetry, send logs to traces, and build an observability setup that helps you troubleshoot, without digging through disconnected logs for hours.
Sponsored Post

Hidden Risks in Linux Power Monitoring - And How to Fix Them

In today's enterprise IT landscape, Linux on IBM Power Systems plays a crucial role in powering mission-critical workloads. Industries such as finance, healthcare, telecommunications, and manufacturing rely on IBM Power's scalability, performance, and security to handle large-scale data processing, AI-driven analytics, and high-performance computing. As these environments continue to evolve, ensuring peak system performance and reliability is more important than ever.

How we use RUM to make design decisions that enhance user experience

Before we started using Datadog Real User Monitoring (RUM), we relied on frontend logging to gather data about the user experience. Logs gave us some helpful information about exceptions and errors but didn't provide any insight into issues directly related to the user’s perspective.

10 Best Compliance Monitoring Tools for 2025

In 2025, the role of compliance officers and risk managers has never been more complex—or more critical. New regulatory requirements, AI-generated content, and increasingly sophisticated cyber threats have dramatically raised the stakes. Here is just a small set of the pressing challenges facing compliance monitoring officials this year.

Finding the right Cisco Prime replacement: A guide to seamless network configuration management transition

With Cisco Prime Infrastructure approaching its EOL and EOS in September 2025, network administrators are at a crossroads. The transition away from this long-standing network configuration management tool necessitates a strategic evaluation of alternatives that align with organizational needs and budgets.

OpenTelemetry with Prometheus: better integration through resource attribute promotion

With the 3.0 release, Prometheus firmly established itself as the leading metrics database for OpenTelemetry. A lot of work has gone into integrating the two open source projects, including a major Prometheus enhancement we’re really excited about: resource attribute promotion.

Customize your incident response with new features in Grafana Cloud IRM

No matter where or how you work, we all have the same goal when an incident occurs: to get it resolved effectively and efficiently—and as quickly as possible. However, the way we achieve that goal isn’t always the same. We understand that different organizations operate differently, so you need flexibility from your IRM tooling.

Using Website Change Monitoring Software in the Age of AI Content

The rise of artificial intelligence has revolutionized the way businesses create and manage digital content. As of 2024, over 45% of marketing teams are actively using generative AI tools like ChatGPT, Claude, and Jasper to create website copy, blog posts, product descriptions, and more (Salesforce). Meanwhile, Gartner predicts that by 2026, 80% of content on the internet will be AI-generated.

How to Monitor Website Performance Smarter and Faster

Is your website really performing the way your users expect? In today’s digital world, even small slowdowns can mean lost revenue and damaged brand trust. That’s where ScienceLogic comes in. This video shows how ScienceLogic’s website monitoring gives you real-time and historical visibility across regions and infrastructure. From synthetic transactions to full-stack observability, you’ll see how to spot performance issues early, validate autoscaling, and ensure fast, reliable digital experiences.

Breaking the Cycle: How Intelligent Automation Frees IT to Drive Innovation

For decades, enterprise IT teams have operated in a state of controlled chaos. Pressured to keep digital lights on, these teams have spent far too much time buried in logs, swatting away alerts, and fighting fires one incident at a time. The familiar mantra—“do more with less”—has translated into a culture of reactive operations, where innovation takes a backseat to survival.

Optimize cross-platform mobile apps with Datadog RUM and Kotlin Multiplatform support

Mobile developers are increasingly adopting Kotlin Multiplatform to share business logic across iOS and Android. While Kotlin Multiplatform reduces duplication of code-writing efforts, it also introduces blind spots. Developers often lack real-time visibility into how shared code performs across platforms, making it harder to troubleshoot issues and monitor user experience.

Introducing the Datadog Developer Hub

Finding the right integrations, libraries, and open source tooling to extend a product has long been a challenge for developers. While Datadog has a vast offering of monitoring and observability solutions, many teams need to customize their setup in some way—whether by extending the Datadog Agent, integrating with third-party services, or using SDKs to interact with the Datadog API.

Monitoring AI Proxies to optimize performance and costs

Businesses deploying LLM workloads increasingly rely on LLM proxies (also known as LLM gateways) to simplify model integration and governance. Proxies provide a centralized interface across LLM providers, govern model access and usage, and apply compliance safeguards for smoother operations and reduced complexity—making LLM usage more consistent and scalable.

Turning Network Telemetry into Network Intelligence

By applying data engineering and machine learning to raw network telemetry, it’s possible to surface insights that would otherwise go unnoticed. Learn how this approach helps teams detect anomalies in real time, forecast capacity needs, and automate responses across complex, multi-domain environments.

Hybrid Cloud Monitoring: A Comprehensive Guide to Strategies, Best Practices, and Tools

Modern infrastructures are no longer confined to on-premises servers alone. Instead, they span cloud environments, containers, microservices, and globally distributed systems. This landscape, known as a hybrid cloud environment, has become the new norm for organizations, primarily because it offers the scalability of the cloud and ownership over specific elements afforded by an on-premises setup.

Want AI to be better at debugging? It's all about context

More code is being shipped today than ever before, accelerated by AI powered code gen tools. We’re in a golden age for builders. But here’s the thing: software still breaks in production. From a recent study by Microsoft, AI models struggle to debug software. It’s because most of these code gen tools lack the one thing every good developer relies on: context. To debug anything, you need context. Having AI tools doesn't change that.

Rollbar and ilert: Real-time error monitoring meets smart incident response

We’re excited to share that Rollbar is now part of the ilert integration catalog! This new technical partnership allows software teams to detect application errors in real time with Rollbar and instantly respond using ilert’s powerful alerting and incident management features. What is Rollbar? Rollbar is a comprehensive, real-time error monitoring and debugging platform designed to help development teams detect, diagnose, and resolve issues faster—before they impact users.

Forecasting with InfluxDB 3 and HuggingFace

Machine learning models must do more than make accurate predictions; they also need to adapt as the world around them changes. In real-world systems, data distributions shift due to seasonality, equipment wear, user behavior changes, or other external forces. If your models can’t keep up, the result is poor predictions. This can lead to outages, inefficiencies, or missed opportunities. That’s why forecasting systems need to be monitored and resilient, not just accurate.

Logs in Sentry: Now in Open Beta

You’re looking at an error in Sentry—a failed payment in your Flask backend or an unexpected null in your Node API. You’ve got the stack trace. The request details. Even the full trace. What you don’t have: the logs your app emitted right before everything went sideways. With Sentry Logs (now in open beta), you can send application logs straight to Sentry and see them automatically connected to the errors and traces you already use.

Bringing Custom Crash Responses to Unreal Engine

Show a customized, crash-specific message when your game crashes. Locked in an intense battle, hanging on for dear life, on the verge of nigh-impossible victory and then… boom! Positively, absolutely, unquestionably, no one wants a crash to interrupt their favorite game. Crashes are a frustrating yet inevitable part of gaming, and the only thing worse than being on the receiving end of a crash is being on the receiving end of the same crash repeatedly.

From Cost Centre to Compounding Advantage

Most teams still treat bugs like little fires to put out. A ticket gets logged. Someone investigates. A fix gets pushed. Then it’s onto the next one. But here’s the thing nobody tells you: Every bug is a chance to get smarter. And in 2025, the best teams aren’t the ones logging the fewest bugs. They’re the ones learning the most from every bug they fix.

Top 11 Application Logging Tools for DevOps Engineers in 2025

When something breaks in production, logs are usually where you start. They help you figure out what happened, where, and why. But with microservices architecture, logging isn't simple anymore. In a traditional monolithic application, logs live in one place. With microservices, they're scattered across multiple services, containers, and sometimes even data centers. What used to be a simple grep command now feels like solving a mystery without most of the clues.

Grafana Tempo vs Jaeger: Key Features, Differences, and When to Use Each

Both Grafana Tempo and Jaeger are distributed tracing tools designed for modern microservice architectures. Jaeger, released as an open-source project by Uber in 2015, has matured into a graduated CNCF project. Tempo, announced by Grafana Labs in October 2020, is a newer entrant focused on high-volume tracing with a unique storage architecture. Before comparing these tools in detail, let's quickly review what distributed tracing is and why it matters.

Is SCOM dead? Not even close - It has just evolved

Is SCOM dead? Not even close - It has just evolved System Center Operations Manager (SCOM) is far from dead. While a growing number of monitoring alternatives have emerged in recent years, SCOM in 2025 remains a critical tool, especially for organizations running hybrid environments. Thanks to its stateful, object-oriented monitoring model and a rapidly evolving ecosystem of modern Management Packs (MPs).

IT Performance Challenges: Why They Persist-and How to Solve Them for Good

IT Ops Problem Solver Series – Part 2: This article is a summary of a full report in our IT Ops Problem Solver Series. In this series, we’ll tackle the biggest problems facing IT Ops leaders and explore how some of Galileo’s clients are addressing them. In this part of the series, we delve into IT performance challenges and how to address them effectively.

Introducing Native Mobile Support in Honeycomb for Frontend Observability

You shipped your latest release. You tested it on emulators, QA devices, and the latest OS versions. But now it’s live and running on thousands or millions of mobile devices, across a jungle of screen sizes, hardware specs, OS versions, and network conditions. A user reports a crash on an old Samsung device over 3G. Someone else complains the app feels “sluggish” after updating. You dig through logs. Rebuild test cases. Ping the backend team. Try to reproduce. Yet, still no answers.

Learning from LFX Mentorship @ CNCF - Jaeger

Hariom Gupta Follow 4 min read· 1 hour ago -- Listen Share Starting this journey was both exciting and fulfilling — and now, here I am at the finish line, having successfully completed the LFX Mentorship Program and reflecting on the experience through this blog. The past three months have been incredible — surpassing my expectations in so many ways.

SigNoz Launch Week 4.0 - OpenTelemetry Powered Innovations That Redefine Observability

OpenTelemetry is rapidly becoming the backbone of modern observability, but true innovation happens when you build directly on its latest capabilities. For Launch Week 4.0, we’re excited to showcase five powerful features; each crafted to help you get more value from your telemetry, make debugging faster, and deliver a unified observability experience. Here’s a quick look at what’s new, why it matters, and how SigNoz is pushing the boundaries of what’s possible with OTel.

Evaluating Synthetic Monitoring Platforms: What to Look for in 2025

Synthetic monitoring simulates user interactions with applications to proactively identify performance issues before they impact real users. Modern distributed systems require sophisticated monitoring capabilities to effectively test microservices, APIs, and complex user journeys across diverse environments. This article provides a framework to evaluate synthetic monitoring platforms in 2025.

Azure Monitor offers Grafana dashboards natively for immediate real time operational monitoring

The Grafanaverse just got a little bit bigger. Today at its annual Build conference, Microsoft introduced Azure Monitor dashboards with Grafana, a new service that provides Azure users with Grafana dashboards natively integrated in the Azure Portal at no additional cost and with little administrative overhead required.

Guide to Monitoring Apache Flink Using OpenTelemetry and MetricFire

Apache Flink is an open-source, distributed stream processing engine built for real-time, high-throughput data pipelines. It excels at processing continuous data streams with low latency, making it a great fit for use cases like fraud detection, log analytics, real-time dashboards, personalized recommendations, and IoT telemetry.

AI's Unrealized Potential: Honeycomb and DORA on Smarter, More Reliable Development with LLMs

Charity Majors, CTO and Co-founder at Honeycomb, and Phillip Carter, Principal Product Manager at Honeycomb, recently hosted a webinar with DORA's Nathen Harvey on AI's unrealized potential. As part of this, we created a 3-minute highlight reel of the webinar that you can watch.

Why a No-Index Observability Architecture is Essential

When was the last time you asked about the architecture behind your observability provider? For most IT professionals whether in development, operations, or security, it’s not a question that naturally comes up. Yet, this architectural detail could be the difference between insight at scale and runaway costs. People are drawn to the features, the shiny things. They promise to unlock insight, drive faster response times, and tighten security.

Getting Started with SolarWinds Orion Dashboards

SolarWinds is a popular IT infrastructure monitoring tool deployed on-prem, most well-known for its network and server monitoring capabilities. While it offers rich telemetry, it’s easy to miss the bigger picture. SquaredUp turns this complex monitoring data into clear, shareable dashboards that make it easier to spot trends, catch issues early, and keep everyone on the same page.

Tracing Funnels - Define funnels between spans | SigNoz Launch Week 4.0 Day 5

Build funnels directly on your traces and get instant answers to questions like: What fraction of spans made it from event A to event B? Between which spans are most requests failing? What is the latency between key spans? Traditional observability tools let you inspect traces and spans, but they can’t aggregate or analyze how requests flow across multiple services or stages in your system. In asynchronous, distributed architectures, the root span rarely tells the full story-and there’s no way to measure conversion, drop-off, or latency between arbitrary steps across all traces.

Future-Proof Your MariaDB-Based Services

We’re excited to announce the release of the NiCE MariaDB on Linux Management Pack, designed to deliver advanced monitoring and performance insights for organizations running MariaDB on Linux infrastructure. As MariaDB continues to power business-critical applications across industries, visibility into its performance, availability, and health becomes essential.

SOC 2 Type 1 Compliance: Netdata is committed to Security and Trust

We are pleased to announce that Netdata has successfully achieved SOC 2 Type 1 attestation! Following an independent examination performed by AssuranceLab CPAs LLC, the report confirms that—as of April 25, 2025—the design of Netdata’s controls meets the Security, Availability, and Confidentiality Trust Services Criteria defined by the AICPA. At Netdata, the security and integrity of the monitoring data our users entrust to us are paramount.

Synthetic Testing Examples: User Flow Testing, APIs Validation, Custom Metrics, Log Ingestion, and More

Starting from scratch with synthetic testing of your web properties and APIs can be difficult. Questions like “what should we be testing?” will very quickly become exercises in figuring out “how can we actually do that?” which may involve sifting through various elements of the DOM or JSON responses. But there are shortcuts to synthetic testing mastery!

Create a status page for your production service in 5 minutes

“When are we going to tell users about this?” By the time your incident response team, it’s already too late. During an outage, communicating about downtime with your user base has three main drawbacks: Instead, it’s better to create a status page that automatically shares the status of all your services in a format that users can easily understand. You’ll build trust with your users as you proactively share service status, lessening the perceived impact of incidents.

Monitoring Oracle Cloud Load Balancer: Unlock peak performance with Applications Manager

Imagine you’re running a popular online learning platform that experiences a surge in traffic during peak hours, right before exams. Students worldwide are logging in simultaneously, watching videos, submitting assignments, and taking tests. If your Oracle Cloud Load Balancer isn’t distributing traffic efficiently or back-end servers are struggling to keep up, students could face slow loading times or service outages.

A Mindset Shift: Making Observability Integral to DevOps Practices: Datev & OpenTelemetry | Grafana

In the evolving landscape of DevOps, observability is no longer optional—it’s a fundamental pillar of success. During this session, Gunter from Datev explores the critical mindset shift required to make observability an integral part of DevOps practices.

The Control Plane Highway: Networking's Hidden Infrastructure

When we discuss networks, we typically envision data packets racing along physical wires like vehicles on a highway. But beneath this visible traffic flows another critical pathway that few recognize: the control plane highway. This unseen infrastructure, where routing information flows between devices, makes the data highway possible. Before user data can flow, millions of paths must be established, creating a parallel network of equally vital importance.

Enhance Security with SAML: Pandora FMS Now Supports Azure Entra ID

In modern enterprise environments, access management is key to ensuring security and regulatory compliance (ENS, ISO 27001, NIS2, etc.). That’s why Pandora FMS has added support for Azure Entra ID, enabling authentication through SAML (Security Assertion Markup Language). With this integration, we provide simplified and secure access to our platform using Single Sign-On (SSO).

Auto Scaling of Kubernetes Workloads Using Custom Application Metrics

Orchestration platforms such as Kubernetes and OpenShift help customers reduce costs by enabling on-demand, scalable compute resources. Customers can manually scale out and scale in their Kubernetes compute resources as needed. Autoscaling is the process of automatically adjusting compute resources to meet a system's performance requirements. As workloads grow, systems require additional resources to sustain performance and handle increasing demand.

Level Up Your Network Visibility: DX NetOps Topology is Now Generally Available

The wait is finally over! We are thrilled to announce that DX NetOps 24.3.9 marks the official general availability (GA) of DX NetOps Topology, a key milestone in our network observability journey. After a successful early access program with many customer deployments, we are excited to bring this highly anticipated solution to the broader community. DX NetOps Topology is designed to provide you with the insights and operational efficiency needed to manage both traditional and software-defined networks.

JVM Metrics: A Complete Guide for Performance Monitoring

Your Java app slows down during peak load. A microservice crashes, but logs aren’t helpful. These aren’t rare events—they’re common signs something’s off inside the JVM. For Java developers and DevOps teams, JVM metrics offer clues to what’s going on. This blog covers the key metrics to track, what they tell you, and how to use them to troubleshoot performance issues in a practical, no-nonsense way.

Best Practices to Ensure Effective Downtime Communication

When systems go down, users don't just lose access, they lose trust if they're left in the dark. That's why having a clear plan for downtime communication matters just as much as restoring service. Whether you're managing a cloud platform, SaaS tool, or any digital service, how you respond during a disruption can shape your reputation long after the issue is resolved. While downtime is inevitable, confusion and frustration don't have to be.

Database monitoring in Financial Services: why this high-stakes sector requires a scalable, more comprehensive solution

IT and data teams in Financial Services must meet the more exacting demands for data integrity, compliance, performance, high availability and security that are expected in the sector. These demands require a dedicated, comprehensive, and scalable monitoring solution to help teams succeed in this high stakes environment.

Transforming Observability: Simpler, Smarter, and More Affordable Data Control

At Mezmo, we’ve always believed that observability should empower innovation, not hold it back with complexity and unpredictable costs. However, as organizations scale and data volumes continue to explode, the old ways of managing telemetry data aren’t sustainable.

Tracing Funnels - Define funnels b/w spans in your distributed systems

Distributed tracing has long been the go-to for understanding the performance of microservices and asynchronous systems. But as systems grow in complexity, simply viewing individual traces and spans isn’t enough. Teams need to answer questions like: SigNoz Tracing Funnels is here to change that, bringing the clarity of product analytics-style funnel analysis to backend traces, and doing so in a way that’s never been available before.

Linux Security Logs: Complete Guide for DevOps and SysAdmins

Security logs are the quiet sentinels of your Linux systems, recording critical information that can mean the difference between detecting an intrusion and discovering a breach months too late. For most DevOps professionals and system administrators, these logs contain valuable insights that often go untapped. While they're essential for compliance, their real value lies in providing visibility into your system's security posture and operational health.

Prometheus vs Zabbix: A Hands-On Technical Comparison and a Modern Alternative

When choosing a monitoring tool, two popular names often come up, Prometheus and Zabbix. Both are powerful and widely adopted but come with different approaches and learning curves. Prometheus is favored in cloud-native environments for its time-series data model and flexibility, while Zabbix has long served traditional IT infrastructures with its rich agent-based monitoring. But what if you are looking for a simpler, more unified solution?

You, Me, and BugSplat's MCP

Let's face it - from an experienced developer's perspective, most software trends are, put lightly, incredibly annoying. The last thing a grizzled, old, technical wizard wants to hear is some half-brained junior developer telling them to switch their SQL server to MongoDB, replace the PHP EC2 with serverless Python, or rewrite their entire front-end with HTMX. The hype-train is so intense that even watching TV feels risky, as you might see something as absurd as an ad for AI toothpaste.

Tracealyzer Was Just the Beginning

If you’ve been building embedded systems for a while, chances are you know Percepio for Tracealyzer. And we’re proud of that. For over a decade, Tracealyzer has been helping engineers visualize and solve complex RTOS issues faster, with over 30 ways to slice and understand system behavior. But in 2025, embedded systems demand more. They’re always on. Always connected. And increasingly, always business-critical.

CI/CD Observability Powered by OpenTelemetry | SigNoz Launch Week 4.0 Day 4

Tired of guessing why your releases stall, which PRs are stuck, or where flaky tests are wasting your team’s time? Most teams obsess over production monitoring, but what about the bottlenecks that often hide in the CI/CD pipeline slowing delivery, draining productivity, and introducing risk before code ever ships. With CI/CD Observability, you can: So, stop flying blind in your delivery process and make every release faster, more reliable, and fully transparent!

State of the Observability Databases with Dee Kitchen (Grafana Office Hours #30)

In this Grafana Office Hours, we talk about the state of observability databases (Grafana Loki, Mimir, Tempo, and Pyroscope) and where they're going. We talk about current and upcoming architectural changes in all four, how we're making them more performant, how compatible they are with OpenTelemetry, and what we're working on next for each database. In this conversation are Dee Kitchen (VP of Engineering for Databases) and Senior Developer Advocates Jay Clifford and Nicole van der Hoeven.

Monitoring your MCP Server in Production (with Sentry)

So you're building an MCP server for your project or service, to allow AI chatbots and agents to interact with it? Great! You've decided to build it using Cloudflare Workers, have written the code, shipped it, and the first users are getting onboard: you're officially running it in production. That's when problems start. I'm not here to dissuade you from shooting your shot, but let's make sure you've got your bases covered in production when something inevitably goes wrong.

CI/CD Observability Powered by OpenTelemetry

Modern engineering teams spend a lot of time and resources in setting up monitoring of their production systems - tracking uptime, catching errors, and responding to incidents before customers ever notice. But what about the journey before code reaches production? For most teams, observing the CI/CD pipeline is either an afterthought or completely overlooked. While we recognize its importance, do we truly understand how well our CI/CD process is functioning?

Understanding Your App's Health With Core Mobile Vitals

Mobile apps are a little different from services run on servers. You build your mobile app, you ship it off to the world, and then it gets run by the end user on their own machine. If your app is running poorly on some percentage of users’ devices, you may never know. That’s where observability comes in. There are certain important metrics that every mobile app has in common.

Top 5 Benefits of a Status Page Aggregator

According to the 2024 State of SaaSOps report, organizations now use an average of 112 SaaS applications. That’s 112 potential points of failure. Manually checking or subscribing to each of those status pages is not scalable. Even small teams often rely on 30+ services spanning infrastructure, communication, payments, and security. A status page aggregator like StatusGator consolidates service statuses from hundreds or even thousands of providers into a single, unified view.

AWS Lambda's INIT billing update: What's changing and why it matters for your cloud costs

Starting on Aug. 1, 2025, AWS will bill for the initialization (INIT) phase of Lambda functions, bringing a key change to how you are charged for serverless workloads. This billing update will impact functions using managed runtimes with ZIP archive packaging, which previously excluded the INIT phase from the billed duration. For teams that rely heavily on AWS Lambda, this is a small but significant change. The INIT phase, while short, could introduce costs that were previously invisible.

The Datadog Agent: Why it's essential for monitoring your infrastructure and applications with Datadog

If you’re a Datadog customer, you’re likely using our platform to gain visibility into your infrastructure and applications and to troubleshoot using logs, metrics, and traces when issues arise. To support these efforts, you’ll want access to the most granular telemetry signals and intuitive workflows that streamline your investigation.

3 ways to drive software delivery success with Datadog DORA Metrics

Delivering software quickly and reliably is the main focus of modern DevOps. But to improve your delivery performance, you need to understand it, and that starts with measurement. Teams primarily measure performance in this area by using DORA metrics—deployment frequency, change lead time, change failure rate, and time to restore service*. These metrics help teams understand trends in their software delivery practices in quantifiable terms that they can track and improve over time.

Unify your FinOps and engineering workflows in Datadog Cloud Cost Management

As your applications scale across cloud and SaaS providers, allocating costs and optimizing workloads become increasingly important—and challenging. Without access to cost data in their daily workflows, engineering teams can’t easily understand the cost of their resources and identify where they can reduce their spend. And while FinOps teams have access to cost data, they often review this information in silos.

Improve user access and admin controls with the latest platform updates from Sumo Logic

By centralizing your mission-critical logs, metrics, traces, and events from all of your systems into one platform, Sumo Logic enables teams across development, security, and operations to operate from a single source of truth. While this unified approach is crucial for fast issue identification and minimizing downtime from infrastructure failures or security breaches, not everyone on your team needs access to every bit of data.

7 Best Network Configuration Management Tools

If you want a secure, efficient, and compliant network, network configuration management is a must. Whether managing a small network or being responsible for a large enterprise system, having the right solution can make all the difference. Network configuration management tools provide valuable insights into devices on your network, and they can help quickly restore previous configurations in the event of a failure, misconfiguration, or security incident. What is network configuration management?

Debugging Microservices

Debugging microservices is tough, especially when you're juggling multiple services and relying only on logs. This video cuts through the complexity by showing you how to implement distributed tracing using Sentry. You'll see a practical demonstration in a food ordering app (built with React and Go) of how tracing can give you a clear view of your entire request flow, from the initial button click to the final operation across all your services.

vmalert - Maximize Your Monitoring - Tech Talk #5

This time, we're diving into a critical component for operational excellence: vmalert. Effective alerting is the backbone of proactive monitoring, enabling teams to detect and respond to issues swiftly before they impact users. But setting up truly effective alerting – alerts that are reliable, actionable, and low-noise – requires understanding the tools and best practices.

Getting started with ServiceNow dashboards

ServiceNow is a cloud-based platform that streamlines IT service management, operations, and various business workflows across organizations. Dashboards in ServiceNow can play a valuable role by offering a clear view of key metrics, trends, and performance indicators. While there are dashboards locally in ServiceNow portal, they often fail to provide a fuller picture of the impact of the incidents in context with other key metrics from external tools.

CI/CD Observability Powered by OpenTelemetry and SigNoz

Most teams have strong monitoring for production, but what about the journey before your code gets deployed? The CI/CD pipeline is where bottlenecks, flaky tests, and process gaps silently waste your team’s time. Until now, this part of the workflow has mostly been a black box. We’re excited to announce CI/CD Observability in SigNoz - a new way to track, analyze, and improve your software delivery process, powered by OpenTelemetry.

8 Network Statistics IT Pros Should Know to Understand and Optimize Network Performance

Slow Zoom calls, dropped VPN connections, and lagging applications sound familiar? These common network frustrations often stem from underlying performance issues that could be diagnosed and resolved with the right data. For IT professionals, raw network metrics alone aren’t enough. To truly optimize performance, you need network statistics: aggregated, analyzed, and interpreted insights that turn numbers into actionable decisions.

Visualize Amazon Aurora, Zendesk, and more: What's new in Grafana data sources

One of our biggest goals at Grafana Labs is to help you unify and derive value from your data, regardless of where that data lives. As a result, we’re fully committed to making Grafana an open, composable, and extensible observability platform. Last week at GrafanaCON 2025, where we celebrated the launch of Grafana 12, we highlighted one of the key ways we deliver on this promise of openness and extensibility: our broad ecosystem of Grafana data sources.

Introducing SCIM provisioning in Grafana: Enterprise-grade user management made simple

We’re excited to share that SCIM provisioning is available in public preview for Grafana Enterprise and Grafana Cloud Advanced! This powerful feature, introduced last week at GrafanaCON 2025 as part of the Grafana 12 release, transforms how organizations manage users and teams in Grafana, bringing automated user lifecycle management and enhanced security to your observability platform.

Third party API Monitoring powered by OpenTelemetry semantics

In today’s cloud-native world, third-party APIs are everywhere. Payments, notifications, search, AI, analytics as modern applications are built on a web of external services. But what happens when one of those APIs slows down, starts throwing errors, or gets rate-limited? Suddenly, your users are facing issues, and you’re stuck asking.

AI at the Edge: Why Smart Data Placement is the Key to Unlocking Its Power

As organizations increasingly deploy AI solutions, I am seeing more and more that the strategic placement of data—particularly at the edge—is becoming paramount to unlocking AI’s full potential. This is a viewed shared by our partners at Riverbed, as highlighted in a recent white paper, Accelerating AI and Data Movement at the Edge. Edge computing enables businesses to perform complex operations at production sites by positioning compute resources nearer to users and operations.

Third party API Monitoring Powered by OTel Semantic Conventions | SigNoz Launch Week 4.0 Day 3

Is it the third-party API or my code? Your service suddenly slows down, or errors spike, and you’re stuck guessing if it’s your own logic or an external API you don’t control. We’ve seen this pain across teams: dashboards don’t tell you which vendor or endpoint is the culprit, and debugging turns into a maze of guesswork. Rate limiting, vendor errors, or integration issues often slip through until users complain.

Introduction To Browser Checks | Grafana Cloud Synthetic Monitoring

Learn how to set up browser checks using Grafana Cloud Synthetic Monitoring. In this video, we walk through how to create a browser check and analyze test results. Browser checks simulate real user interactions to track critical workflows and catch issues early.

Tracing Funnels - Define funnels b/w spans in your distributed system

Build funnels directly on your traces and get instant answers to questions like: What fraction of spans made it from event A to event B? Between which spans are most requests failing? What is the latency between key spans? Traditional observability tools let you inspect traces and spans, but they can’t aggregate or analyze how requests flow across multiple services or stages in your system. In asynchronous, distributed architectures, the root span rarely tells the full story-and there’s no way to measure conversion, drop-off, or latency between arbitrary steps across all traces.

IIS server: Uses, benefits, and challenges

Internet Information Services, commonly referred to as IIS, is Microsoft's web server software. It is built to host websites, applications, and services for Windows systems. If you are considering IIS for hosting your website or applications, let us take you through the basics of IIS, its benefits over the other options available, and the common pitfalls organizations face when they opt for IIS. IIS has evolved a lot since its GA release in 1995.

Introducing Metrics Explorer | SigNoz Launch Week 4.0 Day 2

Ever tried to build a metrics dashboard and thought, “Wait, what metrics am I actually sending?” We heard this from users again and again-so we built Metrics Explorer. For the first time, you get a real-time, interactive view of every metric coming into your system: Whether you’re onboarding a new integration, debugging an alert, or just exploring your data, Metrics Explorer makes it easy to understand and work with your metrics-no more guesswork, just clarity.

Level Up Your Confidence & Problem-Solving Skills! | How Gaming Boosts Workplace Success

Gamers learn more than just strategies—they learn self-confidence and resilience. Success in-game translates to success at work, with a new mindset: there’s always a way to solve the problem, even if it means leveling up your skills!

Investment Trends in Infrastructure Monitoring Market: What Users Should Know

In recent months, the IT monitoring landscape has seen notable investment activity: These developments are part of a larger trend, with investment firms showing growing interest in the IT monitoring sector. Although such activity isn’t unusual in the tech industry, the current intensity and frequency indicate that IT monitoring is emerging as a key area for growth-oriented strategies.

Contextual Observability: Using Tagging and Metadata To Unlock Actionable Insights

Observability isn’t about collecting more telemetry — it’s about making that telemetry data meaningful. Contextual observability transforms raw telemetry into actionable insights by enriching it with consistent tagging and metadata. Without context, telemetry data remains fragmented, troubleshooting slows, and aligning with business priorities is nearly impossible.

Baseline configuration management: Why it's critical for network stability

Imagine this: You've onboarded 30 new switches, 15 firewalls, and 20 routers into your network. You assume they all follow company policy. But months later, half of them are misconfigured, a few are running vulnerable firmware, and one rogue device is exposing ports it shouldn't. That’s not poor luck—that’s poor baseline configuration.

Leading analyst firm reveals the real cost of internet disruptions

‘Without the internet, Digital Experiences do not exist,’ begins Increase Revenue and Improve Customer Experience with Internet Performance Monitoring, a study commissioned by Catchpoint to quantify the financial damage from internet outages. At first glance, that might seem painfully obvious—like pointing out that water is wet. But pause for a moment and consider: the digital experience today isn't just an aspect of business; it is the business. Suddenly, the stakes feel very different.

Comprehensive Guide to Developing and Deploying a Python API with Docker and Kubernetes (Part I)

In the evolving landscape of software development, containerization and orchestration have become pivotal. Docker and Kubernetes stand at the forefront of this transformation, offering scalable and efficient solutions for application deployment. This guide provides a detailed walkthrough on developing a Python API, containerizing it with Docker, and deploying it using Kubernetes, ensuring a robust and production-ready application.

Workshop: Mobile App Monitoring Platforms Don't Have To Be Noisy

Debugging mobile apps shouldn’t mean drowning in alerts or spelunking through logs just to figure out why your app stuttered or froze. Most tools flood you with noise and leave you guessing. In this workshop, we’ll show you how to use Sentry to cut through the noise and zero in on what actually matters—whether it’s jank from blocked main threads, ANRs in production, dropped frames during scroll, or regressions that somehow made it to production.

Your incident response plan is obsolete-unless it includes agentic AIOps

Why are we still handling IT incident response like it’s 2014? Every day, ITOps teams are flooded with alerts, spread thin across hybrid systems, and stuck trying to stitch together visibility from solutions that don’t talk to each other. The incidents keep coming, but the tools aren’t getting smarter—and the humans are burned out. Even with best practices in place, response is often slow, inconsistent, and reactive. You chase symptoms instead of solving problems.

How to easily connect Prometheus to Grafana Cloud

Prometheus is one of the most popular open source monitoring tools due to its powerful flexibility for collecting time series metrics. But raw metrics aren’t always helpful on their own. That’s where Grafana Cloud comes in. By connecting Prometheus to Grafana Cloud, you get rich visualizations, alerts, and dashboards that make your data actionable without having to manage any additional infrastructure.

Reality Bites: 7 Key Disadvantages of Real User Monitoring

Real estate professionals have said for years that the three most important factors about a property are location, location, and location. Well, for organizations with a web presence — which these days is the vast majority, and 100% of e-commerce companies — the three most important factors about their site are visitor experience, visitor experience, and (let’s all say it together!) visitor experience.

Tracing Just Got a Whole Lot More Useful: Search, Visualize, and Alert with Sentry's new Query Engine

For a while, tracing in Sentry was... fine. You could open up a slow transaction, poke around, find the N+1, and feel like a hero. But if you wanted to answer more complex questions - like why your payment API was getting slower in Europe, or which CDN was silently tanking your image loads - things got harder. We didn't really build it to help with answering broad questions.

5 Must-Have Python Plugins for InfluxDB 3 Core & Enterprise

InfluxDB 3 is our latest time series database built for real-time analytics and high-volume data. Its Python Processing Engine lets developers run custom scripts known as plugins to process data, trigger alerts, or integrate with external systems via HTTP web requests. To demonstrate what’s possible, we’ve developed several plugins, all of which are available in the influxdb3_plugins GitHub repository. This public repo is open for anyone to use, modify, and contribute to.

An Introduction to Ecto for Elixir Monitoring with AppSignal

Database performance can make or break your Elixir application. While Ecto provides a powerful toolkit for database interactions, understanding how these operations perform in production is critical. Whether you're dealing with slow queries, connection pool issues, or mysterious N+1 problems, the ability to effectively monitor and optimize your database operations can be the difference between a sluggish application and one that delights your users.

How to Monitor PowerShell Activity and Detect PowerShell Exploitation Vulnerabilities

Why should you monitor PowerShell?…. PowerShell is a powerful automation tool, however its capabilities also make it a prime target for exploitation by cyber attackers. Implementing a robust, automated PowerShell monitoring solution is now essential to detect and prevent exploitation attacks before they compromise your systems. PowerShell is a powerful scripting tool that can automate tasks and manage systems, but its flexibility also makes it a target for abuse.

Metrics Explorer - Search, Query, and Analyze all your Metrics at one place

If you’ve ever found yourself staring at a dashboard dropdown, wondering, “What metrics am I even sending to my observability tool?”, you’re not alone. For most engineering teams, answering even the most basic telemetry questions is about as hard as catching a Mewtwo. Frustratingly elusive and way more complicated than it should be, like: We built Metrics Explorer to finally answer all of these questions instantly, and in one place.

Community and Collaboration: Lessons from Gaming - SolarWinds TechPod 098

This episode explores the intersection of gaming, particularly MMOs, and its impact on workplace dynamics. Dr. Melika Shirmohammadi and Mostafa Ayoobzadeh join hosts Chrystal Taylor and Sean Sebring to discuss their motivations for studying the positive aspects of gaming, the skills that can be transferred to professional environments, and the social stigma surrounding gaming as a hobby.

Debug Logs and Analyze Trends with Log Data Rehydration

Everyone in your organization needs logs to perform the critical functions of their job. Developers need them to debug their applications, security engineers need them to respond to incidents, and support engineers need them to help customers troubleshoot issues. These various use cases create general requirements for enriched log data, often including accessing insights from outside typical retention windows.

Users are complaining, but your internal monitoring is showing green across the board?

Chances are the issue is somewhere between you and your users. To deliver seamless digital experiences, you need to monitor the entire Internet Stack. From DNS and BGP to CDNs and third-party services, Internet Performance Monitoring (IPM) helps you find and fix what traditional tools can’t see.

Key metrics for monitoring Airflow

Airflow is a popular open source platform that enables users to author, schedule, and monitor workflows programmatically. Airflow helps teams run complex pipelines that require task orchestration, dependency management, and efficient scheduling across many different tools. It’s particularly useful for creating data processing pipelines, orchestrating task-based workflows such as machine learning (ML) training, and running cloud services.

Microsoft Outlook rolls out stricter email authentication requirements for high-volume senders to enhance security

Microsoft Outlook.com (which includes hotmail.com, live.com, and outlook.com) is implementing new email authentication procedures in an attempt to improve email security and preserve customer confidence. These modifications, which came into effect on May 5, 2025, are intended especially for high-volume senders, or those who send more than 5,000 emails every day.

Unifying OpenTelemetry & Datadog | #Observability #OpenTelemetry #datadog

Previously, teams had to choose between adopting the OpenTelemetry Collector’s capabilities and fully leveraging our advanced features. On This Month in Datadog, we’re spotlighting our OTel Collector distribution, which unifies OTel and Datadog. Check out the link in our bio to watch the new episode.

Track GitHub Copilot Usage with Datadog #GitHubCopilot #Datadog #DevTools

Easily track GitHub Copilot usage across your organization with our new integration. On This Month in Datadog, we’re covering this integration, Datadog CoTerm, and the new Optimization page in Datadog Real User Monitoring. Check out the link in our bio to watch the new episode.

Deep Temporal Observability | SigNoz Launch Week 4.0 Day 1

If Temporal powers your business-critical workflows, you know how tough it is to get real visibility into what’s happening under the hood. Most tools only show basic Prometheus metrics-leaving you guessing about bottlenecks, failures, and performance issues. Join us for a live demo of SigNoz’s industry-first Temporal integration. We’ll show you how to: Whether you’re running Temporal in production or just exploring workflow orchestration, this session will show you how to move from “just metrics” to true, unified observability.

Angular OpenTelemetry Setup and Troubleshooting

Implementing observability in Angular applications presents unique challenges. Understanding how users experience your application and identifying performance bottlenecks requires specialized tools and approaches. This guide covers implementing OpenTelemetry in Angular applications, with practical code examples for instrumentation, data collection, and integration with observability backends.

Ubuntu Cron Logs: A Complete Guide for Engineers

Troubleshooting failed cron jobs without proper logging can be frustrating. Ubuntu cron logs record the execution of scheduled tasks, helping you identify what's working and what isn't. This guide covers what engineers need to know about Ubuntu cron logs – from finding them to analyzing their contents and setting up effective monitoring solutions.

OpenAI's 'AI in the Enterprise' Report: A Must-Read - But One Crucial Piece Is Missing

We are standing at the threshold of one of the most transformative technological shifts in modern enterprise history. AI is no longer on the horizon – it’s here, it’s powerful, and it’s already reshaping the way businesses think about productivity, creativity, and competitive advantage. OpenAI’s recent report, ‘AI in the Enterprise‘, offers a concise and thoughtful roadmap for leaders seeking to implement AI within their organizations.

IBM's AI Just Replaced 94% of HR functions - What's Stopping You?

At IBM’s Think conference this week, the company made a bold announcement: 94% of its HR functions are now handled by AI, a shift they claim will generate $3.5 billion in savings over the next two years. These are staggering numbers. And while the cynic in me can’t ignore that this announcement was made at what is, effectively, a sales conference – especially one that coincided with the launch of IBM’s AI Agent Store – the scale of those numbers deserves attention.

Business Process Automation, Explained

Business process automation no longer sits on the sidelines. What was once an emerging technology is now the engine behind modern business operations. In fact, around 60% of companies already use automation tools in their workflows, according to Duke University. This is not just companies — developers are also contributing to this shift by adopting low-code, no-code, and digital process automation platforms. These new tools remove barriers that once slowed innovation.

Metrics Explorer - Search, Query, and Analyze all your Metrics at one place

Ever tried to build a metrics dashboard and thought, “Wait, what metrics am I actually sending?” We heard this from users again and again-so we built Metrics Explorer. For the first time, you get a real-time, interactive view of every metric coming into your system: Whether you’re onboarding a new integration, debugging an alert, or just exploring your data, Metrics Explorer makes it easy to understand and work with your metrics-no more guesswork, just clarity.

Deep Temporal Observability - Correlate Metrics with Logs & Traces

Temporal lets you orchestrate complex, reliable workflows, but when something breaks or slows down, the built-in dashboards only give you a list of events and some basic filters. You can see what happened and filter by attributes like workflow type or namespace, but you can't drill deeper. There's no way to jump straight from a metric spike to the exact trace or log line you care about.

Gotta Go Slow

The last few months have been wild. Some of the busiest of my life, actually: For context: I’m Canadian, and all of this happened during the continued threats of annexation. All this to say, it’s been rough. I anticipated this would be a challenging time and that I would be exhausted. So, the plan became: do all the demanding things, take my sabbatical in May, and use April as an ‘in-between’ period with a bit less pressure.

We Saw That IT Outage Coming-And Stopped It: Why AIOps Deployment is a Game-Changer

It’s 3:12 AM. Somewhere in a company’s global cloud infrastructure, a server cluster begins to show unusual read/write patterns. Traditionally, IT teams wouldn’t notice until dashboards light up with red alerts—often too late to avoid an outage that costs thousands, even millions, in lost revenue and trust. But this time, it’s different.

Managing monthly reports with the API

On the first of every month we generate an extensive PDF report for every site. This report contains a summary of all check results for the month and is a snapshot available to you and your team via email and the Oh Dear dashboard. We keep the report history so each month can be viewed in a browser or downloaded as a PDF. This report can also be emailed to any email address - not just team members - perfect for keeping your customers informed.

Building a Culture of Observability Through Ownership

There’s a problem in engineering culture that we don’t talk about enough: observability is an afterthought. It’s treated as tooling, not thinking. As a checkbox, not a habit. And that mindset gap creates real consequences: longer outages, frustrated teams and massive business costs. Atlassian’s Incident Management for High-Velocity Teams overview cites a 2014 study by Gartner, that the average cost of IT downtime is $5,600 per minute.

Securing IoT Devices with Firewall Monitoring: A Comprehensive Guide

The proliferation of Internet of Things (IoT) devices has transformed various sectors, offering enhanced efficiency and connectivity. However, this expansion also introduces significant security challenges. Implementing robust firewall monitoring is essential to protect these devices and the networks they inhabit.

Splunk Observability Cloud's AI Assistant in Action | Practical Examples | Part 2

In this video, we'll explore practical ways to utilize the AI Assistant in Splunk Observability Cloud. Through real-world scenarios, learn how the AI Assistant can help you interpret metrics, contextualize data, onboard new team members to your organization, and automate tasks via the Splunk Observability Cloud API. AI Assistant in Splunk Observability Cloud enhances observability by providing actionable insights and streamlining workflows.

Real-Time Monitoring Solutions for Modern Web Applications

Web applications have evolved from simple static sites into complex distributed systems spanning multiple servers, services, and geographical locations. This evolution has created new challenges for monitoring these applications effectively. Today's web stacks require comprehensive visibility across all layers to ensure optimal performance and reliability.

What Is an API Outage? Why It Happens and How to Avoid It

APIs are a big part of how modern applications or services work. They act as bridges, allowing systems to talk to each other and share data. Whether it's logging into an app or making an online payment, an application programming interface helps make that process smooth. But what happens when an API suddenly stops working? Even a short outage can cause a disruption. It can break features, delay operations, and impact users and businesses alike.

SQL analytics - unified querying across any API

SQL is just for querying relational data, right? Well, not necessarily! With our SQL Analytics feature, you can run SQL queries over all types of data from all kinds of backend stores. This gives you incredible flexibility and power – you can even combine different types of entity (e.g. a pull request and a pipeline run) in a single query. Equally, I could have datasets with job tickets from Jira, ServiceNow and Zendesk and combine them in a single query.

From Logs to Metrics Part 2: Building an Open-Source Logs-to-Graphite Pipeline

Monitoring doesn't always need to be complex. In this guide, we'll show you how to transform some raw logs into usable metrics using a lightweight, open-source setup. We'll also use the Telegraf agent to convert logs into Graphite metrics that you can easily visualize and alert on. This is ideal for system admins, DevOps beginners, or anyone interested in building more innovative monitoring pipelines from scratch.

Third party API Monitoring Powered by OpenTelemetry Semantics

Is it the third-party API or my code? Your service suddenly slows down, or errors spike, and you’re stuck guessing if it’s your own logic or an external API you don’t control. We’ve seen this pain across teams: dashboards don’t tell you which vendor or endpoint is the culprit, and debugging turns into a maze of guesswork. Rate limiting, vendor errors, or integration issues often slip through until users complain.

VictoriaMetrics Components: Getting Started

This article introduces the key components of VictoriaMetrics and explains how they work together as part of a complete monitoring system. VictoriaMetrics is a top-tier monitoring solution known for its speed and low-resource consumption. It includes components for monitoring, alerting, data visualization, querying, scraping, incremental backups, and more.

How to benchmark Elasticsearch performance with ingest pipelines and your own logs

When setting up an Elasticsearch cluster, one of the most common use cases is to ingest and search through logs. This blog post focuses on getting a benchmark that will tell you how well your cluster will handle your workload. It allows you to create a reproducible environment for testing things out. Do you want to change the mapping of something, drop some fields, alter the ingest pipeline?

CloudWatch vs OpenTelemetry: Choosing What Fits Your Stack

Choosing the right observability setup isn’t just a checkbox—it affects how quickly you can detect issues, debug them, and keep your systems reliable. CloudWatch and OpenTelemetry take different paths to that goal: one is a managed service tightly coupled with AWS, the other a flexible, open-source framework that's becoming a go-to in modern monitoring stacks.

Windows Monitoring with Sysmon: Practical Guide and Configuration

One might think that, considering how effective some companies are at logging everything we do to serve us ads, they’d at least apply that to help us understand what’s happening on our systems and monitor their performance and security. But in the case of Windows, traditional logs fall short — and that’s where the importance of Sysmon comes in. Sysmon is a Windows service that logs operating system activity into the event log.

Establishing SD-WAN Observability to Fuel SASE Success

For today’s enterprises, ensuring optimized network connectivity and robust network security represent key imperatives. Given that, it makes sense that there’s rapidly growing use of solutions like secure access service edge (SASE). In fact, the SASE market is expected to grow to $5.9 billion by 2028. SASE delivers converged network and security capabilities. SASE is a cloud-based offering that is primarily delivered on an as-a-service basis.

Process Monitoring - Huge Value from a Quick Task

DX Unified Infrastructure Management (DX UIM) from Broadcom is a comprehensive solution for monitoring an organization’s entire IT infrastructure. The product provides IT administrators and operations teams with a centralized view of their infrastructure to ensure availability and performance of servers, network devices, storage systems, virtualization environments, applications, and cloud services.

Making Network Intelligence Accessible to Everyone

For years, network operations have relied on complex query languages that demand specialized knowledge. Extracting insights from network data often meant writing intricate commands in formats like SQL, a skill reserved for seasoned IT professionals. But what if anyone, regardless of expertise, could ask a simple question and get immediate, accurate answers from their network?

This Month in Datadog: OpenTelemetry Collector distribution, GitHub Copilot integration, and more

Datadog is constantly elevating the approach to cloud monitoring and security. This Month in Datadog updates you on our newest product features, announcements, resources, and events. To learn more about Datadog and start a free 14-day trial, visit Cloud Monitoring as a Service | Datadog. This month, we put the Spotlight on the Datadog Distribution of the OpenTelemetry Collector.

An ultimate step-by-step guide on Checkmk Cloud Monitoring

Checkmk launched Checkmk Cloud (SaaS) in February 2025, which is a fully managed, cloud-based version of their monitoring technology. This solution, designed for ease of use, allows enterprises to start monitoring their IT infrastructure with no installation, maintenance, or manual upgrades required. The SaaS version is compatible with both cloud-based and on-premises systems, bringing them together under a single, straightforward platform.

Internet Latency: What Is It, How to Measure It, and How to Improve It

Internet latency, the often-overlooked delay between sending and receiving data, can mean the difference between a flawless video conference and a frustrating, glitchy mess. Measured in milliseconds (ms), these microscopic delays accumulate, creating tangible performance issues across all online activities.

OpenTelemetry PHP: A Detailed Implementation Guide

Monitoring complex PHP applications can be challenging. When systems span multiple services and environments, traditional logging approaches often fall short. OpenTelemetry offers a solution - an open-source, vendor-neutral framework that standardizes how we collect and export telemetry data. This guide covers practical implementation steps for DevOps engineers working with PHP applications.

The Best Open-Source Dashboard Tools for 2025: Expert Guide to Choosing the Right One

Table of Contents In today’s digital operations, dashboards aren’t just nice-to-haves—they’re essential. Teams across engineering, product, operations, and business intelligence rely on real-time data visibility to monitor systems, analyze trends, and catch anomalies before they escalate. For many organizations, open-source dashboard tools offer the best combination of flexibility, transparency, and cost-efficiency.

Reducing MTTR with Cloud Pathfinder

Learn how to quickly identify and resolve cloud connectivity issues with Kentik’s Cloud Pathfinder. We demonstrate how Cloud Pathfinder simplifies troubleshooting by automatically mapping cloud network paths, pinpointing misconfigured security rules or incorrect routes, and providing actionable insights powered by integrated AI analysis. Reduce mean time to resolution (MTTR) and gain instant visibility into your cloud infrastructure with Kentik.

The Azure Metrics That Actually Reduce Cloud Costs

This is the fourth blog in our Azure Monitoring series, and this time, we’re digging into cost efficiency. Azure makes it easy to scale, but just as easy to overspend. Idle VMs, forgotten disks, and silent data transfer fees add up fast. The result is budget overruns that catch teams off guard and force reactive cuts. This blog breaks down the Azure metrics that actually help you reduce waste, improve visibility, and keep cloud spend aligned with business priorities. Missed our earlier posts?

Introducing Coralogix Continuous Profiling

Debug faster, improve application performance, and lower your cloud costs - without slowing down production. Traditional profiling solutions come with a heavy price—added latency, excessive resource consumption, and performance degradation. At, we’re changing the game with Continuous Profiling, the first of its kind to offer real-time, kernel-level visibility into application performance without any code changes or production impact.

Agentic AI in financial services: The rise of autonomous intelligence

Agentic AI is coming to financial services. Elastic provides the data foundation and tools to make it work. In a recent talk at Stanford University, Jamie Dimon, chairman and CEO of JPMorganChase, addressed the firm’s use of AI and ended with mentioning that agentic AI was the next frontier of AI at the firm, inferring it wasn’t ready to be deployed yet. Let’s break down why that may be the case and what the financial services industry can do to become more comfortable with agentic AI.

Unleash SaaS Data With the Webhookevent Receiver

There are many vendors, Honeycomb included, where actions on the application can emit a web request that goes to another service for coordination or tracking purposes. Many vendors have pre-built integrations, but some have a fallback that says “Custom Webhook” or similar. If you’re looking to create a full picture of your request flow, you would want these other services to show up in your trace waterfall.

Grafana Cloud Migration Assistant: from self-hosted to the cloud in minutes

Moving your existing Grafana instance to Grafana Cloud just got dramatically simpler. Today, we’re excited to announce the general availability of the Grafana Cloud Migration Assistant, a powerful yet intuitive tool designed to streamline your migration journey. Traditionally, migrating from Grafana OSS or Grafana Enterprise to Grafana Cloud required technical expertise with Grafana’s HTTP API or command-line tools like Grizzly.

Grafana Alloy at 1: What's new and what's next for our OpenTelemetry Collector distribution

It’s been a year since we launched Grafana Alloy, our OpenTelemetry Collector distribution with built-in Prometheus pipelines and support for metrics, logs, traces, and profiles. OpenTelemetry is quickly becoming an industry standard for telemetry collection, processing, and delivery, and we’re committed to making Alloy the best possible collector for telemetry data, whether you’re using it with Grafana Cloud or not.

Track MongoDB Performance Metrics Without the Noise

When your MongoDB database slows down, it affects your entire application stack. Performance issues can range from minor inconveniences to major outages, making a solid understanding of MongoDB metrics essential for any DevOps engineer. This guide covers the key performance metrics you need to monitor in MongoDB, how to interpret what you're seeing, and practical steps to resolve common issues.

The Complete Guide to Observing RabbitMQ

Message queues quietly power a lot of what happens behind the scenes in distributed systems. RabbitMQ is no exception—when it’s working, you don’t notice it. But when it’s not, things break in ways that are hard to trace. This guide walks through what you need to monitor in RabbitMQ, how to set it up, and how to troubleshoot when things go wrong—so you’re not stuck guessing when messages go missing.

How to Control and Optimize Azure Costs Without Losing Visibility

This is the ninth post in our Azure Monitoring series, and it’s all about taking control of your cloud costs without losing visibility. We’ll unpack why Azure bills tend to spiral, where native tools fall short, and what it really takes to cut spend while keeping performance on point. You’ll walk away with practical ways to spot waste early, act fast, and stay ahead of surprise invoices. Missed the earlier posts? You can catch up anytime.

What You Didn't See During the GrafanaCON 2025 Keynote Livestream...

Our GrafanaCON co-chairs take you on a backstage tour through GrafanaCON 2025 Day 1 — sneak peeks, activities, and the conference magic. Grafana Cloud is the easiest way to get started with Grafana dashboards, metrics, logs, and traces. Our forever-free tier includes access to 10k metrics, 50GB logs, 50GB traces and more.

Cloud quotas: How to make cloud management easy

In the past, a cloud architect's pain point was usually deciding between these two options: To tackle this confusion, major cloud service providers (CSPs) launched quotas (in their own words). To give you examples, here are the different terminologies used by the three major public CSPs: The main ingredient of a well-oiled cloud setup that significantly impacts cloud operations is understanding and managing cloud quotas, also known as service quotas.

Top 3 tools for DORA metrics reporting: SquaredUp vs Power BI vs Jira

What is it that makes a high-performing software engineering team successful? This was the challenge undertaken by the DevOps Research and Assessment (DORA) team around 2015, who created a set of metrics that could provide a reliable, data-driven way to measure and improve software delivery performance.

Meta-monitoring Loki (Loki Community Call May 2025)

In this Loki Community Call, we talk about the need for meta-monitoring Loki: why Loki needs to be monitored, what to watch out for, and how to do it. We talk about different ways to get information from Loki that allow you to make it reliable, consistent, and performant, including a Helm chart to deploy a meta-monitoring stack on Kubernetes. We discuss the Loki mixin for Grafana and how to use it to visualize data about Loki. On the call are Jay Clifford, Nicole van der Hoeven, and Dylan Guedes from Grafana Labs.

What Is a Network Assessment, and What Is a Network Audit?

These days, networks are larger and more complex than ever. It’s all too easy to fall short when managing performance, security, and compliance. That’s where network assessments and network audits can help. Both network assessments and network audits can give you a more comprehensive understanding of your network and its current strengths, weaknesses, and threats. As a result, you can quickly identify and resolve issues.

It's not just about fixing problems, it's about detecting them before they escalate.

IT teams can’t solve what they can’t see. Undetected issues impacting end users lead to lost revenue, brand reputation damage, and frustrated customers. That’s why proactive monitoring is critical. By simulating end-user experiences, you catch small issues before they snowball into major incidents—saving time, money, and operational headaches.

The Power of Great Design: Introducing the Enhanced Administrative UI for InfluxDB Cloud Dedicated

Managing your InfluxDB Cloud Dedicated environment just got easier. We’ve introduced an admin UI to streamline everyday tasks, so you can spend less time navigating settings and more time working with your data. The update is built for speed and usability. Whether you’re creating tables, managing tokens, or checking database status, the new UI helps you move faster with: This update is all about reducing friction for developers and teams managing time series infrastructure at scale.

What Is Snort, How It Works, and Its Integration with SIEM for Cybersecurity

You can’t defend against what you can’t see. That’s why the first essential requirement in cybersecurity is to know everything happening in your systems. To achieve this, we implement an IDS (Intrusion Detection System)—a solution that tirelessly monitors every corner of your network like the Eye of Sauron, instantly alerting you to breach attempts and suspicious behavior. Among IDS options, Snort stands out as one of the most popular.

Network Stress Testing: What It Is & How to Run One

You’ve optimized your QoS settings, fine-tuned your firewall, and even upgraded your bandwidth, but what happens when your network gets hit with 10x the normal traffic? Will it hold up, or will it buckle under the pressure, leaving your users staring at spinning wheels and timeout errors? If you’re an IT pro, you know outages don’t happen during idle hours. They strike when traffic spikes.

Laravel just works. Now your performance monitoring does too.

You remember that first time spinning up a Laravel app? Routes, auth, ORM, queues, all wired up without much effort. It’s one of the reasons Laravel feels productive out of the box. But when something starts slowing down, an Eloquent query drags, a job takes forever, or cache misses creep up, it’s not always obvious where to look. Laravel gives you the tools, but connecting the dots between them is usually on you.

Kubernetes Alerting That Won't Burn You Out

Kubernetes production environments require robust alerting to catch problems before they impact users. While monitoring shows system state, proper alerting tells you when something needs attention. This guide outlines 15 key Kubernetes alerts that help DevOps teams avoid outages and minimize downtime. For each alert, we provide implementation guidance and troubleshooting steps to resolve common issues quickly.

Splunk Observability Cloud's AI Assistant in Action | Practical Examples | Part 1

In this video, we’ll provide practical, real-time examples demonstrating how to effectively use the AI Assistant in Splunk Observability Cloud. You'll learn how the AI Assistant can quickly identify unknown issues in your environment, perform detailed root cause analysis, analyze service performance and deployment impacts, and even help manage infrastructure costs and compliance. TOC.

Google's Agent-to-Agent (A2A) Protocol is here-Now Let's Make it Observable

Can your AI tools really work together, or are they still stuck in silos? With Google’s new Agent-to-Agent (A2A) protocol, the days of isolated AI agents are numbered. This emerging standard lets specialized agents communicate, delegate, and collaborate—unlocking a new era of modular, scalable AI systems. Here’s how A2A could transform your workflows, and why making it observable is just as important as making it possible.

Here are 10 ways to prevent website downtime

Every minute of website downtime cost large organizations an average of $9,000. That’s half a million dollars every hour, damn. And that’s just the average. If your organization heavily relies on your website to do business, that cost can increase even further. Needless to say, preventing website downtime is a top priority.

Essential Python Monitoring Techniques You Need to Know

Python powers critical applications across countless organizations, from data processing pipelines to web services that handle millions of requests. While Python's readability and extensive ecosystem make it a developer favorite, its performance characteristics require thoughtful monitoring. As systems grow in complexity, understanding what's happening inside your Python applications becomes increasingly important.

Grafana 12 release: observability as code, dynamic dashboards, new Grafana Alerting tools, and more

Grafana 12 is here! During the opening keynote of GrafanaCON 2025, we unveiled dozens of new reasons to fall in love with everyone’s favorite dashboarding and visualization tool—especially if your job is to keep teams, services, and, of course, a whole lot of beautiful Grafana dashboards organized. Grafana 12.0: Download now!

GrafanaCON 2025: A guide to all the announcements from Grafana Labs

GrafanaCON 2025 is in full swing in Seattle, where members of our open source community have gathered to explore the latest updates to Grafana Labs’ OSS projects, share their inspiring use cases, and build lasting connections at our biggest community event yet.

A Detailed Guide on Docker Container Performance Metrics

Docker containers isolate application environments, making performance monitoring essential for visibility and stability — especially at scale. To manage production effectively, teams need clear insights into resource usage, bottlenecks, and failure points. This guide covers key Docker metrics, how to collect them, and how to use that data to keep your containerized systems running smoothly.

Optimising OpenTelemetry Pipelines to Cut Observability Costs and Data Noise

Fat bills from observability vendors and tons of not-so-insightful telemetry data have turned out to be a very common issue today. This often leaves teams having to explain the lack of clear ROI, despite the growing costs. If you’re using OpenTelemetry to record your observability data, there are some practical methods you can apply to keep those costs from piling up.

Modern Logging, Smarter Pricing: Why Graylog's Consumption Model Just Makes Sense

In the world of log management and security analytics, one thing is abundantly clear: data volumes fluctuate. Yet most pricing models haven’t caught up. Traditional ingest-based licensing models force organizations to size their license needs based on a worst-case capacity scenario—the “high-water mark”—whether those spikes are rare and/or expected.

Observability Best Practices: Balancing Sustainability and Cost in a Data-Driven World

Imagine this: Your IT team has invested in cutting-edge observability tools to keep systems running smoothly. But does that imply you are following observability best practices? As your business grows, so does the flood of logs, traces, and metrics—along with a skyrocketing cloud bill. What started as a way to gain better visibility is now a major expense, and suddenly, you’re asking: Are we paying too much for too little value? This challenge is becoming all too common.
Featured Post

How to decide between cloud and on-premise monitoring

Application performance monitoring systems tend to be available in two modes: on-premise and cloud-based SaaS. Which is the "right" choice? Well, it depends on your situation, but overall cloud-based SaaS offerings have significant benefits when compared to on-premise. However, it's not always so simple. The right selection depends on the facts on the ground. Using my experience working for a large-scale cloud solutions department, I've put together some key things you'll want to consider before you make a decision, starting with some benefits and challenges.

IT Monitoring News | May '25 Edition

Welcome to the May edition of the NiCE bi-monthly monitoring news! As we move further into the year, we’re here with a fresh roundup of updates, insights, and resources from the world of IT monitoring. Whether you’re looking to stay informed, fine-tune your tools, or catch up on what’s new, this edition has you covered. Enjoy the read!

Easiest Way to Monitor Loki Performance With Telegraf

Loki is a powerful, scalable log aggregation system designed by Grafana to efficiently collect, store, and query logs. It’s often deployed alongside Prometheus as part of modern observability stacks. Loki’s design emphasizes cost-effective storage by indexing only metadata, which makes it a great choice for high-volume environments. But while Loki excels at log ingestion and indexing, many teams overlook the critical task of monitoring Loki itself.

AppSignal Closes $22 Million Growth Investment

In August 2012, we set out to build a fairly priced, developer-centric APM and logging platform that we’d love to use ourselves. Soon after, AppSignal was born and quickly gained traction. Despite operating in a competitive market, we’ve become a household name among many of our peers, now serving over 2,000 organizations across six continents.

We built AI-powered Root Cause Analysis that actually works

Figuring out why things break still sucks. We’ve got all the data: metrics, logs, traces, but getting to the actual root cause still takes way too long. Observability tools show us everything, but they don’t really tell us what’s wrong. So why do we even need to automate root cause analysis? First, time. Outages are expensive. And if your system has hundreds or thousands of services, digging through everything by hand just takes way too long.

Prometheus native histograms in Grafana Cloud: More precise, easier to use, and better compatibility

Histograms help you monitor and visualize the distribution of values for key metrics, such as response times or request sizes of a service. They’re frequently used to gain insights into data patterns, anomalies, and trends, making them an important tool for observability.

Maintaining Effective IT Infrastructure Monitoring in the Public Sector

Public sector organizations have needs very different from their commercial counterparts. Cybercriminals go after public sector organizations because they hold confidential, often classified, information—the exact data state-sponsored and other criminal groups salivate over. Based on tax payments, these organizations serve and answer to the public. Progress WhatsUp Gold offers ample out-of-the box monitoring features, helping you monitor more of what matters to your organization.

SQL Server Observability: Monitoring, Troubleshooting, and Best Practices

For DevOps teams managing mission-critical databases, SQL Server observability is a fundamental capability that provides comprehensive insight into database performance and health. Effective observability practices enable teams to identify potential issues before they impact end users and provide the context necessary to resolve problems efficiently. SQL Server observability involves collecting and analyzing metrics, logs, and traces to build a complete picture of database behavior.

The Definitive Guide to OpenTelemetry Exporters for High-Performance Monitoring

In modern distributed architectures, observability has shifted from optional to necessary. OpenTelemetry has emerged as the standard framework for telemetry data collection, with exporters serving as the critical bridge to your backend monitoring systems. For developers at any stage—those new to observability practices or those refining existing monitoring setups—a solid grasp of OpenTelemetry exporters will significantly reduce debugging time and improve system visibility.

Emerge Tools is now a part of Sentry

Today I'm thrilled to announce that Emerge Tools is joining Sentry. Emerge builds best-in-class mobile tooling trusted by some of the most important brands in the world. You’ve probably seen the work of the team through their relentless efforts to improve mobile builds, efforts we’ve always admired here at Sentry. It was no surprise that when we finally met Emerge founders Josh and Noah we found that we shared a similar view of the world and hit it off instantly.

WhatsUp Gold IT-Infrastruktur-Monitoring: Überblick, Entwicklungen & Neuerungen

Erleben Sie in unserem exklusiven Webinar, wie Sie mit WhatsUp Gold Ihr IT-Monitoring auf das nächste Level heben. Wir zeigen Ihnen die neuesten Funktionen, spannende Weiterentwicklungen der letzten zwei Jahre und geben einen exklusiven Ausblick auf das, was kommt – darunter das leistungsstarke Network Traffic Analysis Plus (NTA+).

What Can SCORCH, SCSM & SCVMM Do for SCOM? Find Out at Our Expert Session!

SCOMathon 2025 | Panel Session by Axians, NiCE, and Kelverion Are you making the most of Microsoft System Center 2025? Join us for a power-packed expert discussion where we break down how SCOM, SCORCH, SCSM, and SCVMM work together to supercharge your IT operations!

Mastering Network Configuration for Stability and Security

Your network is the central nervous system of your business. Its performance, reliability, and security have a direct impact on your organization’s operations, revenue, and reputation. Yet, lurking within this critical infrastructure is a common source of disruption and risk: network configuration changes.

Stop Playing IT Whack-a-Mole: The Smarter Way to Prevent Outages Before They Happen

The challenges facing IT operations teams today are bigger than ever before. Hybrid cloud adoption, sprawling infrastructure, the explosive growth of telemetry data, and the accelerating pace of digital business have pushed traditional monitoring approaches to their breaking point. Yet for many organizations, the operational model remains stubbornly reactive: a never-ending game of IT whack-a-mole, where teams are trapped responding to incidents instead of preventing them.

Bring third-party incidents into Better Stack

Incidents in cloud and SaaS tools block users just as hard as faults in your own code. The fix comes faster when the same on-call queue covers both. IsDown now plugs straight into Better Stack through a native API connection. Every outage that IsDown detects shows up as an incident in Better Stack, follows your existing escalation rules, and clears automatically once the vendor recovers.

Agentic AIOps: Why Agent-Driven Solutions Are Defining the Future of IT Operations

AIOps is overdue for reinvention. The last decade promised faster resolution and smarter alerts—but most tools are still built on outdated assumptions: linear workflows and deterministic rules. Now, a new model is emerging. Not reactive. Not rule-based. Agentic. Agentic AIOps is about taking action. Products like LogicMonitor’s Edwin AI go beyond recommendations—they correlate, decide, and remediate in real time.

Logz.io Integration for AWS and Kubernetes Observability

Ever feel like you’re flying blind in your AWS environment? You’re not alone. In the sprawling universe of microservices, containers, and serverless functions, trying to troubleshoot without proper observability is like trying to find a bug in a datacenter… with the lights off… while wearing sunglasses.

Reporting CSP Errors in Honeycomb With the OpenTelemetry Collector

The HTTP Content-Security-Policy response header is used to control how the browser is allowed to load various content types. It is used to control which URLs, fonts, images, scripts, and more can be loaded onto the page. It’s a great defense against XSS (cross-site scripting), clickjacking, and cross-site vulnerabilities. The header can also specify a URL that will be used to send reports on violations of these properties.

How Docker Logging Drivers Work

Troubleshooting containerized applications can quickly become complex when logs are scattered across multiple systems. Most DevOps teams face this challenge daily—what starts as a simple container deployment often evolves into a complex logging puzzle. This guide explores Docker logging drivers in depth, covering configuration options, best practices, and practical solutions.

React Logging: How to Implement It Right and Debug Faster

React logging is the practice of recording relevant information about your application's behavior during runtime. Unlike traditional server-side logging, React logging happens in the browser and focuses on frontend concerns: component lifecycle events, state changes, user interactions, performance metrics, and network requests. Effective logging creates breadcrumbs that help you understand application flow and quickly pinpoint problems.

Monitoring the Impossible & Other Use Cases - Webinar by 2Steps Tech with David Dick (co-founder)

2Steps is changing the landscape of proactive monitoring. Now, in this lunch-and-learn, you get a deeper dive on the platform and how organisations are using it for previously-unsolved problems. Observability professionals have described 2Steps saying, “There is no better way to do it,” “It’s incredibly valuable,” “Nothing can really compare,” and “The only reason lots of businesses aren't doing this already is they simply don't know about it.”

Unlocking the Power of LLMs and AI Agents for Network Automation

Artificial intelligence is reshaping how enterprises manage and secure their networks, but not all AI is created equal, and not all Large Language Models (LLMs) are ready for the job. While tools like ChatGPT and Google Gemini are transforming communication and productivity, applying general-purpose LLMs to something as specialized and high-stakes as network operations is an entirely different challenge. Networks are dynamic, complex, and context-heavy.

Kubernetes Monitoring in 2025: The Complete Guide to Cluster Visibility

Modern cloud-native applications rely on Kubernetes as their leading container orchestration platform. The adoption of Kubernetes in 2025 has achieved remarkable heights, making it the primary operator of vital enterprise systems across financial technology and healthcare organizations. Kubernetes environments continue to grow increasingly complex, and their dynamics are evolving, so monitoring has become an essential strategic practice.

Grafana Alerting Overview Plus New Features Coming to Grafana 12 | Grafana Labs

In this walkthrough, Grafana’s Ryan Kehoe dives into the biggest improvements designed to help teams create, manage, and route alerts with less friction and more power. Whether you're wrangling multi-source queries or managing alerts across large environments, these updates are for you.

Cribl Edge: Unify Telemetry Collection | Lightboard Demo

Cribl Edge is a vendor-neutral, intelligent agent designed for the variety and scale of today’s modern architectures. With a unified telemetry collection system, you can have hundreds of thousands of agents at your fingertips to automatically discover and collect data from your Windows, Linux, and Kubernetes environments. Featuring a rich UI, centralized fleet management, and seamless upgrades, it’s time to transform your agent management.

Getting started with Jenkins dashboards

Jenkins is an open-source automation server widely used for continuous integration and continuous delivery (CI/CD), enabling developers to automate the building, testing, and deployment of software projects. Jenkins requires a good layer of visualization as it provides real-time visibility into pipeline performance, build statuses, test results, and deployment progress.

How Quick User Tests Help Us Make Better UI Decisions in Icinga Web

Designing user interfaces for Icinga Web is always a bit of a balancing act. Once we’ve worked through all the technical and conceptual details of a new feature, it can be tough to step back and see things from a fresh user’s point of view. We as developers know too much — and that makes it hard to guess how others will understand what we’ve built.

Easily Query Multiple Metrics in Prometheus

In monitoring setups, working with a single metric rarely tells the complete story. The real power of Prometheus lies in its ability to query multiple metrics simultaneously, creating connections between different data points that reveal the true state of your systems. This guide will walk you through everything you need to know about crafting effective multi-metric queries in Prometheus – from basic concepts to advanced techniques that will help you monitor and troubleshoot your infrastructure.

Observe VMWare vCenter Cluster and Cloud with Confidence: Achieve Full Stack Observability with DX Operational Observability (DX O2)

As enterprises continue their cloud and container journeys as part of modernization efforts, they are realizing “hybrid reality” is here to stay. For many, moving all services to clouds or containers is not a viable option. As a result, at least some services will be required to remain on premises. This presents unique challenges and ongoing complexity for monitoring and observability.

Apache Logs Explained: A Guide for Effective Troubleshooting

Apache logs are a critical tool for monitoring your web server, but they can often feel overwhelming. For DevOps teams, understanding these logs is essential for diagnosing issues and maintaining system reliability. In this guide, we'll explore the setup and analysis of Apache logs, offering practical tips to help you make sense of them and use them effectively for troubleshooting and optimization.

A Practical Guide to Monitoring Ubuntu Servers

Running Ubuntu servers without proper monitoring can lead to unexpected issues. For DevOps engineers and SREs, effective tracking is crucial for maintaining system health and performance. This guide covers everything you need to know about monitoring Ubuntu servers, from the basics to advanced strategies, helping you keep your systems running smoothly, whether you manage a single server or a large fleet.

Monitoring & Debugging a Checkout Flow in Flask & React

When your checkout flow breaks, customers disappear faster than most ‘cutting-edge’ JS metaframeworks. Thankfully, setting up observability for your critical paths—like a customer checkout—is painless with Sentry. Let's walk through how we instrumented, monitored, and fixed a major issue, with minimal effort.

Mission-Critical Visibility: How Observability Empowers the DoD

Tech is entering another wave of innovation with AI. With accelerated innovation comes increased complexity in already disparate environments. For Defense, those complexities are compounded by the need to maintain and operate mission critical infrastructure with highly sensitive data in air-gapped environments, often running on custom digital systems and applications. Accelerating the speed of innovation with leading technology is key for the military to maintain its competitive edge.

Redoing My Progress WhatsUp Gold Home Lab with Proxmox: A Journey of Failover, Backup and Recovery

Greetings, tech enthusiasts! I hope you’re all doing well. Today, I’m thrilled to share the story of my recent adventure in rearchitecting my home lab with Proxmox. This journey has been a rollercoaster of unexpected challenges, valuable lessons, and rewarding successes. I built a resilient and efficient setup that exceeded my initial expectations by leveraging modern virtualization and storage technologies.

Top 5 EdTech outages detected by StatusGator in April 2025

In April 2025, leading EdTech platforms experienced outages that impacted students, educators, and administrators worldwide. StatusGator’s Early Warning Signals played a key role in identifying and reporting issues before official sources did, enabling schools and institutions to respond swiftly. These real-time alerts helped reduce disruption during critical learning and administrative operations. Here are the top five EdTech outages detected by StatusGator in April.

Why no one talks about querying across signals in observability?

In today’s complex distributed systems, observability has evolved from a nice-to-have feature to a mission-critical engineering discipline. Engineering teams across organizations depend on robust observability to maintain system reliability and quickly diagnose issues when they inevitably arise. However, current observability tooling significantly lags behind user expectations by failing to support a critical capability: querying across different telemetry signals.

Top 5 outages detected by StatusGator in April 2025

In April 2025, several major services faced outages that disrupted businesses and users globally. StatusGator provided early detection and real-time updates, helping users stay informed before official announcements. With its Early Warning Signals feature, StatusGator alerted users to potential disruptions even before the affected services acknowledged the issues—giving users a critical edge in responding to outages. Here are the top five outages detected by StatusGator in April.

Simulate Real User Workflows | Introduction to Grafana Cloud Synthetic Monitoring

Just because your app is up doesn’t mean it’s working. Behind the scenes, users could be facing failed checkouts, broken workflows, or slow page loads — and you may not know until it’s too late. In this video, we’ll show you how Grafana Cloud Synthetic Monitoring helps you proactively simulate real user behavior and monitor the performance of your critical user flows, websites, and APIs from locations around the world — so you can catch issues before your users do.

Azure DevOps agent pools: diving deeper

Most of the time the build and deployment pipelines we create will run on compute provided by the Azure DevOps cloud and the only decision we need to make is whether to select a Windows or Linux Agent. Sometimes though, the specification for the VM that Azure DevOps spins up may not be right for our needs. We may need more memory or a particular OS version. This is when custom agents and Agent Pools come into play.

ScienceLogic Named a Leader in AIOps: Paving the Way for Autonomous IT Operations

The challenges plaguing IT operations are not new. The exponential growth of hybrid and multi-cloud environments, increasing data volume, complexity, and accelerating pace of change have made traditional approaches to IT operations unsustainable.

Dynamic Demands, Dynamic Solutions: IT's Role in the Next AI Workflow Evolution

I have just finished reviewing the Microsoft Work Trend Index Annual Report for 2025, which offers fascinating insights into the next wave of organizational evolution. I am particularly excited about the section titled ‘Journey to the Frontier Firm’ and what is possible in phase three, where employees will harness the power of multiple AI agents, creating an ‘agentic swarm’ capable of executing tasks at a scale and speed previously unimaginable.

Top Microsoft Teams Metrics: How to Measure & Improve Call Quality

As an IT professional, you know that Microsoft Teams is only as good as the network it runs on. Poor call quality (choppy audio, frozen video, or sudden disconnections) can disrupt productivity and frustrate users. But how do you pinpoint the root cause? The answer lies in monitoring Microsoft Teams performance metrics.

Monitor the full end-user experience: k6 browser checks in Synthetic Monitoring are generally available

We continue to evolve Grafana Cloud Synthetic Monitoring to help you simulate even the most complex transactions and user journeys, and proactively monitor the performance of your web applications and APIs. In line with this effort, we’re excited to share that k6 browser checks in Synthetic Monitoring are now generally available.