Operations | Monitoring | ITSM | DevOps | Cloud

Get Real-Time Third-Party Service Outage Alerts in Slack with StatusGator

When your team relies on multiple SaaS tools, even a small outage in a third-party service can disrupt workflows, slow down projects, and frustrate customers. Knowing about issues the moment they happen, and even before they’re officially reported. That’s where StatusGator’s Slack integration comes in. With StatusGator, you can receive real-time service status alerts in Slack.

Get Third-Party Outage Alerts in Microsoft Teams with StatusGator

When your company depends on dozens of SaaS tools, such as AWS, Atlassian, Zoom, or Microsoft 365, any cloud outage can ripple through your entire operation. The faster your team learns about an external service disruption, the faster you can respond. With StatusGator’s Microsoft Teams integration, your team can receive real-time third-party outage alerts in Microsoft Teams. The service also includes Early Warning Signals that detect potential issues before providers officially announce them.
Sponsored Post

8 Challenges of Microservices and Serverless Log Management

As organizations increasingly adopt serverless architectures and embrace the benefits of microservices, managing logs in this dynamic environment presents unique challenges. In this blog, we're taking a closer look at the differences between serverless and traditional log management, as well as 8 challenges associated with log management for serverless microservices.

MSP Dashboards That Deliver: HaloPSA + SquaredUp in Action

In this webinar HaloPSA and SquaredUp come together to show MSPs how to unlock real-time visibility, streamline reporting, and deliver dashboards that drive client satisfaction and operational efficiency. In this session, you’ll hear directly from both Halo and SquaredUp teams as they: Whether you're already using HaloPSA or exploring smarter ways to surface your service desk data, this webinar will give you the insights and tools to take your MSP reporting to the next level.

Tech Talk - Leveraging Automated Threat Analysis Across the Splunk Ecosystem

Find out how Splunk Attack Analyzer can help you quickly and efficiently investigate potential malware and phishing incidents by automatically tracking each stage of complex attack chains and expediting your response efforts. Hear directly from Product Manager Aditya Raj as he demonstrates how to combine Splunk Attack Analyzer with Splunk Enterprise Security and Splunk SOAR for even greater threat detection and response power.

4 Common OpenTelemetry Challenges and How Site24x7 Helps Overcome Them

OpenTelemetry (OTel) is transforming observability by standardizing and unifying how telemetry data such as metrics, logs, and traces are collected from distributed systems. However, while it unlocks new opportunities for monitoring and troubleshooting, adopting and operating OpenTelemetry comes with real-world challenges. Here’s what you need to know about these limitations, and how Site24x7 provides a holistic, simplified observability solution for your organization.

Store and search logs at petabyte scale in your own infrastructure with Datadog CloudPrem

As AI workloads and cloud-native applications expand, organizations are generating more log data than ever. Each service, container, and model inference produces continuous telemetry that must be stored, secured, and analyzed. As telemetry grows more complex, teams must balance full visibility with new retention and residency needs.

Automating your synthetic test infrastructure with Datadog Synthetic Monitoring and Terraform

Testing ecosystems contain massive amounts of data, including outlined test scenarios, prerequisite configurations, and the tests themselves. As a result, these ecosystems are prone to data sprawl. This makes it difficult to prevent configuration drift and quickly spin up new tests, especially at the frequency needed to support a fast-growing application. Teams can handle these challenges by treating their tests as part of their application infrastructure.

Store and search logs at petabyte scale in your own infrastructure with Datadog BYOC Logs

As AI workloads and cloud-native applications expand, organizations are generating more log data than ever. Each service, container, and model inference produces continuous telemetry that must be stored, secured, and analyzed. As telemetry grows more complex, teams must balance full visibility with new retention and residency needs.

The Hidden Cost of "Modernization": When Upgrades Become Extortion

Across the IT and observability landscape, enterprise leaders are facing a troubling pattern. A trusted vendor announces a “modernization initiative,” often following a major acquisition or a shift in ownership. Overnight, pricing structures change, license models disappear, and long-time customers are pressured into multi-year bundles under the banner of innovation. What’s being framed as progress often feels more like pressure.

Network Path Monitoring: How to Monitor Network Paths

Your users are complaining about slow application performance. Your monitoring dashboard shows all devices are green, routers operational, switches functioning, and bandwidth utilization is normal. Yet something is clearly wrong. The problem isn't your equipment; it's the path between your users and their destinations. This is where network path monitoring comes in.

Sumo Logic Dojo AI overview

Stop the firefighting and get instant answers with Sumo Logic Dojo AI. When a security incident hits, you risk losing money and time as you wait for investigations and troubleshooting. Discover how Dojo AI agents simplify investigations by surfacing potential threats, providing actionable insights, and guiding you to the root cause faster using natural language.

Introducing the Splunk Technology Add-on for Ollama: Illuminating Shadow AI Deployments

Without strong visibility and governance, local LLMs risk replicating the fragmented, unsupervised sprawl once seen in shadow IT, complicating security postures and making it difficult for organizations to ensure proper oversight and compliance as these powerful AI tools become embedded in daily workflows. To address this challenge, The Splunk Threat Research Team has released the Splunk Technology Add-on for Ollama that provides comprehensive monitoring and observability capabilities specifically designed for local LLM deployments.

Sumo Logic Academy - Training and Certification Overview

In 2025, Sumo Logic revamped its education and certification program, introducing industry-aligned assessments, digital badges and many free training offerings, including industry leading free instructor led classrooms and interactive hands-on labs. This video walks through all Sumo Logic Academy program offerings.

From KubeCon EU to KubeCon NA: Bindplane's OpenTelemetry Contributions and Highlights (Mar-Oct 2025)

Bindplane engineers have stayed deeply involved in the OpenTelemetry community this summer. With KubeCon+CloudNativeCon North America in Atlanta coming up I wanted to dive into all the work that has been done and give the engineers a well deserved shoutout. Here’s what we built, fixed, and contributed since KubeCon+CloudNativeCon Europe in London this March.

What's New in InfluxDB 3.6: Ask AI, Simple Quick Start, and Smarter Automation

InfluxDB 3.6 is now available for both Core and Enterprise. This release introduces the 1.4 update to InfluxDB 3 Explorer, featuring the beta launch of Ask AI, along with new capabilities for simple startup and expanded functionality in the Processing Engine. InfluxDB 3 Core is free and open source, optimized for recent data, and licensed under MIT and Apache 2. InfluxDB 3 Enterprise extends Core with long-term data retention, clustering, fine-grained security, and management capabilities.

SQL expressions in Grafana: Combine and manipulate data from multiple sources

One of Grafana’s greatest strengths is its ability to provide a consistent monitoring experience for all your data sources. But not everyone wants to go through the process of transforming that data and setting up a data warehouse to make that happen, especially for complex analyses.

Energy-Efficient Computing: How To Cut Costs and Scale Sustainably in 2026

With AI the centerpiece of technology and innovation today, energy efficient computing is quietly becoming one of the most urgent challenges. In this article, we will discuss what makes energy efficient computing relevant for your organization, especially when modern resource-intensive AI workloads play an important role in driving your business operations and services.

Observability 2025 Decoded: What the DZone Report Means for SLO-Driven Ops

DZone’s 2025 Intelligent Observability Trend Report captures a real inflection point: teams are shifting from “more data” to outcome-driven practices that improve resilience and accountability. The survey was gathered between August 28 and September 25, 2025, from a global pool of developers, architects, and IT professionals.

Prometheus native histograms in Grafana Cloud: Get more precision from your Grafana visualizations

In May, we announced the public preview of Prometheus native histograms in Grafana Cloud, unlocking greater precision, ease of use, and compatibility for analyzing latency, duration, and other distributions. Since then, we’ve seen incredible adoption across industries—from financial services companies to e-commerce platforms. Last week, during PromCon EU 2025, the Prometheus developers announced that native histograms are now stable, after three years of intense testing and improvements.

This Halloween, the Scariest Monsters Are in Your Network

In the spirit of Halloween, let's talk about monsters. Not the kind that hide under your bed, but the ones that live inside your network infrastructure. For those responsible for keeping the lights on, these creatures aren't fictional; they are a daily reality. Your environment can feel like an episode from the Real Ghostbusters, teeming with things that snarl, bite, and cause chaos at the worst possible moments. Forget silver bullets; trying to fight them one by one is a losing battle.

How to Monitor Kubernetes With Grafana OSS or Grafana Cloud | Ask the Experts | Grafana Labs

Wondering how to monitor Kubernetes with Grafana? In this “Ask the Experts” episode, Coleman walks through the easiest setup — from Grafana Cloud’s built-in Kubernetes Monitoring plugin to open source Grafana with Mixins and Helm. One command, and your cluster data comes alive.

APM for Banks and Fintech: Ensuring Stability in High-Transaction Apps

The financial services industry is undergoing a major transformation. According to the McKinsey & Company 2025 Global Payments Report, digital payments continue to dominate, generating approximately $2.5 trillion in revenue from around $2.0 quadrillion in value flows across 3.6 trillion transactions worldwide. In another survey conducted by JP Morgan says that, more than 30 percent of financial professionals reported that faster payments are having a positive impact on their organizations in 2025.

Getting started with Site24x7 alert management

Struggling with alert overload or missed notifications? Learn how Site24x7 helps you manage alerts effectively, from setting thresholds and tracking key metrics to routing notifications, automating actions, and leveraging AI-powered Zia thresholds. Follow a real-world DevOps scenario to see how your team can respond faster, smarter, and more efficiently.

Tech Talk - Observability Unlocked: Kubernetes Monitoring with Splunk Observability Cloud

In this Tech Talk, discover how they’re leveraging Splunk Infrastructure Monitoring (IM) to supercharge their Kubernetes operations, detect issues within minutes, and resolve them 90% faster — all while optimizing and scaling like pros.

Migrating from Librato to Hosted Graphite on Heroku - Full Tutorial

Librato on Heroku is being sunsetted, so what's next? In this tutorial, we walk through: Why Hosted Graphite by MetricFire is the best upgrade from Librato on Heroku Step-by-step migration: move your Heroku dyno, router, Postgres, Redis & custom metrics into Hosted Graphite A side-by-side comparison: metrics ingestion, dashboards, alerts, and integrations.

Deploying Loki on Kubernetes via Helm (Loki Community Call - October 2025)

This Loki Community Call is about deploying Loki on Kubernetes via Helm charts. We talk about why you might want to use Helm to deploy on Kubernetes, best practices for deployment, and which Helm chart you should use! We are Jay Clifford and Nicole van der Hoeven, Developer Advocates at Grafana Labs, and we have invited Grafana Champion and Loki Helm Maintainer Jan-Otto Kröpke, Principal Cloud Architect at QualityOperations GmbH, to talk about the state of the Loki Helm Chart.

From Maintenance to Monitoring: Digital Tools for Better AC Management

The traditional approach to AC upkeep relied heavily on reactive maintenance, where technicians addressed issues only after they occurred. The digital age has ushered in an array of tools that make routine maintenance easy and employ advanced technologies for proactive monitoring. By integrating these tools into AC management practices, homeowners and businesses can improve system efficiency, reduce energy costs, and extend the lifespan of their equipment.

FinOps for Hybrid IT: Extending Visibility Beyond the Cloud

Controlling IT spend used to mean managing cloud invoices. Today, it’s far more complex. Modern enterprises run workloads across multiple platforms — cloud, virtualized, and on-premises — each with its own cost structures and dependencies. That’s why FinOps for hybrid IT has become essential. Extending FinOps principles beyond cloud services enables organizations to see how every part of the infrastructure contributes to cost, efficiency, and business value.

Azure status integration is here!

We’re thrilled to announce the launch of our Azure status integration, bringing Microsoft Azure’s real-time service health and incident data directly into your StatusGator dashboard and status page. With this new integration, StatusGator automatically imports Azure outages and service status updates from your Azure subscription — giving you a complete, centralized view of your cloud infrastructure alongside every other service you monitor.

Streamline IT Event Management

Managing Microsoft SCOM events can often feel overwhelming, with countless alerts and notifications making it difficult to identify what truly matters. Working with our partner Kelverion, we are excited to introduce a Routing & Remediation solution for SCOrch that helps your teams focus on the issues that count, while automating the rest.

Playwright Check Suites Are Now GA - But What Does That Mean For You?

There are only a few companies that successfully invest in actively monitoring real user flows in production. I’ve been puzzled by the state of the art for many years, because I’m an anxious developer that always needs to know that production is “all right”. How can it be okay for all of us to wait for error logs, thrown exceptions or customer complains to learn about production issues?

5 Best Practices for Incorporating AI Into Your Team

Honeycomb’s Jessica Kerr and Fred Hebert recently hosted a webinar with Courtney Nash of The VOID where they dug into one of the biggest questions in tech right now: How do we build systems (and teams) that actually learn with AI, not just use it? The conversation was surprisingly optimistic about what happens when we stop treating AI as a productivity tool and start seeing it as a teammate. You can watch the full webinar here, or read on below for a quick recap.

SolarWinds Day Keynote: The Future of IT Has Arrived

Get a front-row seat to the future of IT operations — powered by AI, automation, and full-stack observability. During the SolarWinds Day Keynote, on October 8, we unveiled our latest innovations in AI and automation for predictive insights and end-to-end visibility. Discover how these advancements empower IT teams to optimize performance, streamline ITSM, and simplify database management—all without disrupting existing systems.

Find and Fix Fastify Slowdowns with AppSignal for Node.js

In part one of this series, we set up basic performance monitoring for our Fastify application using AppSignal and explored key performance indicators. Now that we have our monitoring foundation in place, it's time to leverage these insights to actively improve application performance. You'll learn how to detect performance regressions, find optimization opportunities, and implement custom instrumentation with OpenTelemetry.

Sidecar or Agent for OpenTelemetry: How to Decide

Getting telemetry out of a distributed system isn’t the hard part. Getting it out cleanly, without noise, drop-offs, or odd performance side-effects — that’s where things get interesting. Before you worry about processors or storage costs, you need a clear plan for where the OTel Collector should run. Most teams narrow this down to two options: a sidecar that sits next to each service, or a node-level agent that handles data for everything running on the node. Both patterns are solid.

APM in 2026: The New Standard for Business Reliability and Growth

Global IT spending is expected to reach a record $6.08 trillion by 2026, with software investments growing by 15.2%. This shows how critical application performance has become for businesses today. For almost 80% of companies, even one hour of downtime can cost more than $300,000. In a world where every digital experience affects your revenue and brand reputation, keeping your applications performing well is no longer optional.

Datadog named Leader in 2025 Gartner Magic Quadrant for Digital Experience Monitoring

We are thrilled to announce that, for the second consecutive year, Datadog has been named a Leader in the 2025 Gartner Magic Quadrant for Digital Experience Monitoring. We believe that this recognition reflects our continued focus on helping customers observe, secure, and act on everything that matters across their technology stack.

Rechain improves performance visibility and gets 4x faster issue resolution with Scout Monitoring

Rechain is a SaaS Product Lifecycle Management (PLM) platform built with Ruby on Rails for fashion brands which helps modern apparel teams manage design, production, and supply chain workflows from one intuitive, cloud-based solution. ‍

From Error to Insight: Our Brand Refresh

Software teams do their best work when they can move quickly without losing control. That reality has shaped how our product has evolved, and it needed to shape how our brand shows up too. Our refresh is not a new coat of paint. It is an honest reflection of what Rollbar is today and where we are going: a code-first observability platform that helps builders understand what is happening in their code and why, so every release is better than the last.

Product Update - Turn Off Alerts, Use Microsoft Teams, and Custom Domains

Over the last few months IncidentHub has added several new features to make it easier to fine tune your alerts. IncidentHub now also integrates with Microsoft Teams and supports custom domains for your public status pages. Let's take a comprehensive look at what's new.

Whose Fault Is It When the Cloud Fails? Does It Matter?

On Monday, October 20th, a significant portion of the digital services we use every day became inaccessible. For hours, banking, communication, and entertainment applications were unavailable. The root cause was later identified as a major outage within Amazon Web Services (AWS), the infrastructure that powers a vast number of online services. The initial response for any business affected by such an event is a frantic effort to diagnose the problem. Is it our application? Is our network down?

Your Root Cause Analysis is Flawed by Design

There’s a nagging feeling of déjà vu that haunts every network operations leader. You invest significant time and resources to resolve a major performance issue. Your best engineers isolate a culprit—a misbehaving load balancer, perhaps—and after a frantic effort, service is restored. You close the ticket, confident the problem is solved. Then, two weeks later, it’s back.

Top 4 Inefficiencies For Dev Teams Resolving Issues

Every hour developers spend troubleshooting is an hour they’re not building features, innovating, or delivering value to customers. Yet in most organizations, issue management and debugging remains one of the biggest drains on productivity and release velocity. That frustration is exactly what led our founders, themselves developers, to create Lightrun.

How Generative AI is shaping the future of enterprise applications

The next golden age of artificial intelligence has arrived, but the path forward is far from certain. Technology leaders are presented with a tremendous opportunity to revolutionize their business — that is, if they can find a way to tap into the full potential of their organization's data. In Episode 4 of Elastic's new limited series, Generation AI, Elastic's Sr. Director, Enterprise Applications, Jay Shah, shares how he believes generative AI will shape the future of enterprise applications.

Sliding Through Log-Time Space

This post kicks off a new series written by the Graylog Development Team. In these updates, we’ll highlight the features and fixes that make daily work in Graylog smoother. We want to show the work we care so much about and present the challenges we faced and overcame. Today, we’re starting with one of those minor but functional enhancements: Graylog time-range stepping.

Redefining Frontend Observability with Datadog RUM

Discover how Datadog is redefining frontend observability with Real User Monitoring (RUM). In this demo, see how RUM helps teams detect, investigate, and resolve frontend issues that directly impact user experience and business outcomes. With RUM Without Limits, you get full visibility into every user session, giving you an accurate and comprehensive view of your users’ experiences. Monitor performance, track errors, and understand how your application behaves in real time.

Scaling Java Web Applications: Choosing Between Microsoft Windows and Linux OS

Java is one of the most widely used platforms for supporting web applications. According to RedMonk and TIOBE rankings, Java has consistently remained in the top 4 most popular programming languages worldwide, with millions of developers actively using it. Industry-standard application servers such as WebLogic, WebSphere, Tomcat, and JBoss all run on Java and power a large share of enterprise workloads and Java web applications.

Why Brand Monitoring is Essential in the Age of Programmatic AI

Recently, crowdfunding giant GoFundMe made headlines after (among other things) automatically generating some 1.4 million donation pages for U.S. 501(c)(3) nonprofit organisations. These pages were created without prior consent, using publicly available IRS data and partner-feeds. According to reports, many nonprofits discovered these pages only when alerted by a donor or curious patron; they had no advance knowledge and had to manually un-publish or “claim” the page.

Monitor the Performance of Your Ecto for Elixir App with AppSignal

In part one of this series, we learned how to implement batch updates and advanced inserts in Ecto to dramatically improve database performance. But implementing these optimizations is only the first step. Ensuring they continue to work effectively in production requires professional monitoring and observability. This guide will show you how to use AppSignal for Elixir to monitor your Ecto application's performance when dealing with batch data operations.

Reality Bytes: The DEX Equation (Productivity Savings + Nexthink's $3bn Milestone)

This week, Tim and Tom are joined by RB regular and DEX Hub editor Sean Malvey to unpack Nexthink’s first Workplace Productivity Report, “Cracking the DEX Equation.” Drawing on data from nine million endpoints, the report quantifies the real productivity impact of digital employee experience — revealing where enterprises lose nearly half a million hours a year to poor DEX, and how small score gains deliver measurable ROI.

Redefining NetOps: Agent Systems and Practical AI from the ONUG AI in Networking Summit

AI in networking isn’t theoretical anymore. It’s here, reshaping how we operate. At the ONUG AI Networking Summit, we saw firsthand how agent systems are moving from hype to hands-on reality, from secure automation to data-driven root cause analysis. The future of NetOps isn’t dashboards and tickets — it’s intelligent agents, observability, and measurable business outcomes.

Transform and Migrate Logs with Datadog Custom Processor

See how Datadog’s new Custom Processor in Observability Pipelines helps you transform and migrate logs from platforms like Splunk and Sumo Logic with precision and control. This demo walks through real examples of using VRL (Vector Remap Language) to enrich log data, rewrite timestamps, apply quotas, and securely process archives.

Faster, more collaborative data exploration: Introducing saved queries in Grafana Cloud

Writing queries is one of Grafana’s most powerful features, but it can also be one of the most time-consuming. Whether you’re exploring logs or building new dashboards, you often find yourself and your team rewriting the same queries over and over again. This is why we rolled out saved queries, a feature that makes it easy for everyone on your team to save, share, and reuse queries, eliminating the need to start from scratch each time.

LogicMonitor Is FedRAMP Moderate Authorized: How We Support Federal IT

Federal agencies need observability that doesn’t create new compliance problems. Today, that’s possible. LogicMonitor Envision is now FedRAMP Moderate Authorized with a formal Authorization to Operate (ATO). That means unified, AI-powered visibility across your hybrid infrastructure—on-prem, AWS GovCloud, Azure Government, and edge—without starting your security review from scratch.

OTel Updates: Consistent Probability Sampling Fixes Fragmented Traces

You're sampling 1% of traces in production. A payment request fails at 3 AM. Logs show an error in order-service, but the full picture isn't there because different services made different sampling decisions. order-service kept the trace; payment-service didn't. So you end up checking logs and timestamps across a few services to piece things together. This happens because the usual probability sampling approach makes a separate choice at each service boundary.

You Came Looking for Restorepoint. You're in the Right Place.

Restorepoint earned recognition for solving one of the hardest and most important challenges in network operations: keeping every device configuration backed up, verified, and ready to recover. It gave teams confidence that their infrastructure was consistent, compliant, and protected from the smallest misstep. ScienceLogic acquired Restorepoint in 2021 to build on that strength and extend it.

Artificial Intelligence as a Service AIaaS (AIaaS): What is Cloud AI & How Does it Work?

Today, organizations looking to build AI products and services using large language models (LLMs), agentic AI, and generative AI often start by investing in artificial intelligence as a service (AIaaS), also known as cloud AI. AIaaS provides a scalable, flexible, and cost-effective way for businesses of all sizes to access advanced AI technologies without the need for extensive in-house expertise or infrastructure.

Webinar Snippet: How to Use Obkio's Visual Traceroutes

Obkio recently launched the brand new Visual Traceroute Tool. In this snippet from the feature release webinar, Solution Architect, Sam, demonstrates how to use Obkio's integrated Visual Traceroutes in Monitoring Sessions. Watch as he walks through the timeline feature, hop-by-hop visualization, and path change detection, all designed to help you identify exactly where and when network issues occur.

Webinar Snippet: How to Add Obkio's Visual Traceroutes in A Dashboard

Obkio recently launched the brand-new Visual Traceroute Tool. In this snippet from the feature release webinar, Solution Architect, Sam, shows you how to integrate Visual Traceroutes directly into your custom dashboards for even more powerful network visibility. See how easy it is to add traceroute widgets to your monitoring dashboards, giving your team instant access to hop-by-hop analysis alongside your other key graphs and network metrics.

Webinar Snippet: How to Use Obkio's Network Map (Visual Traceroutes)

Obkio recently launched the brand new Visual Traceroute Tool. In this snippet from our feature release webinar, Solution Architect, Sam, dives into Network Maps and shows you how to read and interpret the visual data to quickly identify network bottlenecks and performance issues. Learn how to leverage the map view to understand your network topology, spot problematic hops, and make faster troubleshooting decisions with visual context.

Why Email Servers Get Blacklisted?

An email server gets blacklisted when it's identified as a potential source of spam, malware, or suspicious activity. Blacklists use automated systems and user reports to flag servers that violate mailing or security standards. Once listed, legitimate messages may bounce, land in spam folders, or never reach recipients at all. Understanding why this happens is essential to prevent future listings and protect the sender's reputation.

Datadog vs Grafana (2025) - Costs, Use Cases, and Key Differences

When engineering teams evaluate observability tools, the "Datadog vs. Grafana" debate is one of the most common. The choice is difficult because they represent two fundamentally different philosophies. Datadog is a comprehensive, all-in-one, managed SaaS platform. It offers a "buy" solution where you get a unified experience for metrics, logs, and traces out of the box. Grafana is an open-source, highly flexible visualization layer.
Sponsored Post

Avantra + Ansible: Better Together for Enterprise SAP Automation

Enterprises trust Ansible for fast, reliable infrastructure automation, including terraform for automated cloud provisioning. Many organizations using Ansible leverage Ansible SAP playbooks for SAP infrastructure automation. Avantra extends the scope of SAP operations using Ansible, adding observability, ITSM and ALM solution integration, and orchestration across the SAP estate. Avantra and Ansible together provide a closed-loop solution where monitoring, automation and proof of outcome live in one place across on-premise, hyperscaler and private cloud ERP implementations.

Observability Masterclass | AI-Driven Observability for Enhanced System Performance

Tuesday, October 28, 10:00 - 11:00am CDT In today’s relentless digital world, achieving peak system performance isn’t just a goal—it’s mission-critical. Join SolarWinds and GigaOm for an electrifying webcast featuring renowned Observability authority Jon Collins, VP of Engagement and Field CTO at GigaOm.

Shadow AI on Trial: The Phantom Threat to Compliance

Every law firm I meet can explain its information security policy in minutes. Far fewer can tell me which AI tools their staff actually used last week, and what data those tools touched. That gap is where Shadow AI sits, such as unsanctioned, unmonitored use of generative AI slips in. It promises speed, but it quietly creates exposure: confidentiality breaches, weak auditability, and a risk to governance when the regulator (or a client’s GC) asks hard questions.

A Visionary in the 2025 Gartner Magic Quadrant for DEM

ITRS has been named a Visionary in the 2025 Gartner Magic Quadrant for Digital Experience Monitoring — for the second year running. We’ve spent another year working alongside you to solve real problems, scale what works, and prepare for what’s next. If you’re delivering resilient, high-performing digital experiences while meeting evolving compliance demands, our direction is shaped by your needs.

Query Distinct Tag Values in Under 30ms with the InfluxDB 3 Distinct Value Cache

The Distinct Value Cache (DVC) available with InfluxDB 3 Core and InfluxDB 3 Enterprise lets you cache distinct values of specific columns and query those values in under 30ms. The DVC is an in-memory cache that stores distinct values of one or more columns in a table. It is typically used to cache distinct tag values, but you can also cache distinct field values.

Integration & Data Ingestion: Strengthening AIOps Observability

Large enterprises face the challenge of managing high-volume, very diverse data streams that span both legacy and modern, digital systems and applications. To gain timely, accurate insight across this kind of complexity, IT teams need observability platforms that can do more than just monitor - they must also unify, contextualize and enrich data so teams can act effectively to protect the availability of the services their customers rely on.

Streams: Elastic's New AI That Turns Log Chaos into Clarity

Elastic just made every SRE’s life easier. With the new Elastic Streams, AI automatically organizes, structures, and analyzes billions of logs, helping you find issues, detect anomalies, and fix problems in minutes, not hours. See how Elastic’s deep generative AI core turns chaos into clarity for Site Reliability Engineers and developers worldwide.

External Request Monitoring: The Silent Pillar Every APM Needs

The global market for application performance monitoring (APM) is growing fast. Market research shows the industry is expected to rise from about USD 7.52 billion in 2023 to nearly USD 19.62 billion by 2030, with a compound annual growth rate (CAGR) of around 15.1%. This rapid expansion reflects how digital transformation, hybrid cloud adoption, and third-party integrations are reshaping performance monitoring needs. It’s no longer enough to track just internal code paths and database queries.

What is the Role of IT Ops? Key Responsibilities Explained

The IT ops role serves as the backbone of modern technology infrastructure, ensuring systems run smoothly, securely, and efficiently. IT operations teams manage everything from server maintenance to incident response, making them essential for business continuity. Understanding what IT operations professionals do helps organizations build stronger technical teams and improve their infrastructure management.

Don't count integrations, count dashboards and alerts

Vendors often compete by saying how many extensions or quick start packs they have. The implicit promise is: more integrations equals better observability. But that misses the point. What really matters is the quality and coverage of dashboards and alerts that you actually use to maintain system health, prevent outages and improve user experience. At Coralogix we believe that what you do with integrations is far more important than how many you have.

AI monitoring is coming to Oh Dear

Would you know if your checkout form stopped working overnight? Or if a recent deploy broke your login flow? Traditional monitoring can't catch these issues - it only tells you if your site is up, not if it actually works. AI monitoring lets you describe what should work in plain English, and we'll test it like a real user would - clicking buttons, filling forms, checking content. No scripts to maintain, no complex setup.

Now in the API: History, Custom Monitors, and Subscribers

Last month, we introduced the StatusGator API v3, a complete overhaul of our API designed to give developers more flexibility, an improved data model, and deeper integration options for monitoring the status of hundreds of services. Today, we’re excited to share three major additions to v3: the Board History API, Custom Monitors API, and Status Page Subscribers API.

WebGL Application Monitoring: 3D Worlds, Games & Spaces

WebGL has turned the browser into a real-time 3D engine. The same technology behind console-quality games now powers design platforms, architectural walkthroughs, and virtual conference spaces—all without a single plugin. These 3D experiences blur the line between web and desktop, blending high-fidelity rendering with persistent interactivity and complex real-time data streams. But with that complexity comes a new operational challenge: how do you monitor it?

Top tips for smoother IT incident management

Top tips is a weekly column where we highlight what’s trending in the tech world and share ways to stay ahead. This week, we’re talking about something every IT team knows too well—incidents. Whether it’s a sudden server crash, a network outage, or a system slowdown right before an important client call, incidents always seem to strike at the worst possible time. No matter how strong your IT setup is, issues are bound to happen.

DNS Outages Expose Hidden Risks. Edwin AI Finds Them Faster.

The recent AWS outage exposed how fragile the internet remains. Amazon traced the hours-long disruption to a DNS error—a small failure with massive reach. For most organizations, DNS operates quietly in the background. When it fails, every digital service connected to it stops. One of LogicMonitor’s valued customers, IG Group, faced a similar event less than ten hours after enabling Edwin AI.

How to Use the Power BI Desktop InfluxDB 3 ODBC Connector

The challenge of storing, processing, and alerting on your time series data is only part of the battle when it comes to deriving value from time-stamped data. While InfluxDB 3 addresses those hurdles with the database and Python processing engine, data analytics teams still need to be able to visualize their data and build dashboards to complete the time series story.

[Workshop] Fixing Your Frontend: Performance Monitoring Best Practices

​The holiday season is here. Is your frontend ready for the traffic spike, or are you preparing for a debugging nightmare? ​In this live, hands-on workshop, we'll dive into the best practices for modern error and performance monitoring in Sentry. In this live hands on session, we’ll cover: ​Instrumenting Sentry and alert rules to surface and fix critical errors fast ​Optimizing site performance using Web Vitals like TTFB and LCP.

Why Your APM Needs Observability - Metrics, Logs, and Traces Explained

Modern software applications are increasingly complex. Microservices, cloud infrastructure, and distributed architectures make it challenging for developers, DevOps engineers, and SREs to maintain high performance and a seamless user experience. Traditional Application Performance Monitoring (APM) provides critical insights into how applications perform, but alone, it often leaves blind spots when it comes to diagnosing issues or understanding the full system behavior.

Auvik Named a Leader Across G2's Fall 2025 Reports for Network Management

In G2’s Fall 2025 reports, Auvik earned top recognition as a leader in network management tools across small-business, mid-market, and enterprise categories. IT professionals rated Auvik highly for implementation, usability, results, relationship, and overall Grid® performance, reflecting one thing above all: real-world trust from the IT professionals who use Auvik every day.

Meet Olly - The Coralogix AI Observability Agent (Demo)

Olly is Coralogix’s AI-native observability agent that makes observability data fast, accessible, and actionable—for everyone. Traditionally, teams have spent valuable time piecing together dashboards and writing queries to troubleshoot issues. Olly changes that by letting you ask real questions in natural language and delivering instant, intelligent answers from across your logs, metrics, and traces.

OpenTelemetry Spans Explained: Deconstructing Distributed Tracing

In a microservices architecture, a single user request can pass through multiple services before completing. When performance drops or an error occurs, tracing that journey is the only way to locate the source. Distributed tracing provides that visibility. At its core are OpenTelemetry Spans — units of work that capture what each service does during a request.

Introducing The Next Phase Of Synthetic Monitoring: Playwright Check Suites

We've been running Playwright in production since the beginning. Today, we're going all in. When we first launched Browser Checks with Playwright support, we proved something critical: the most popular test automation framework since Selenium isn't just for testing—it's the foundation of modern production monitoring. But that was just the beginning. Today, we're announcing Playwright Check Suites—our bet on the future of monitoring and the most significant evolution in Checkly's history.

Enhanced Flexibility and Security Monitoring - New in DataStream

This update delivers significant advances in operational flexibility and security monitoring capabilities. It addresses the evolving needs of security teams across diverse deployment environments, from air-gapped networks to those prioritizing automation and simplicity, while expanding integration options and improving visibility into data flows.

Why do you only use Playwright for pre-release testing and not for production monitoring, too?

We've been running Playwright in production for years. Today, we, at Checkly, are going all in with Playwright Check Suites. Playwright Check Suites is our latest step towards uniting testing and monitoring into a single workflow. It's our biggest advancement yet! Here's why this matters: We're not adapting Playwright anymore. We're running it natively in production with full `playwright.config` support, complete custom dependency control, and support for every tag, spec, or configuration.

How to solve authentication failures when you have an Azure setup

It is not just your business. Enterprises worldwide face recurring technical issues related to authentication failures and access problems. These errors often pop up, especially in scenarios with service connection setups, pod/start failures, or integration issues. Most of the time, these errors indicated failed deployments, pods failing to pull images, or intermittent authentication/access errors.

How to Replace Synthetics with the httpcheck Receiver

A 200 OK doesn't always mean everything is okay. You've probably seen it: your health check endpoint returns success, but your users are staring at an error page. Maybe the database connection pool is exhausted, or a critical downstream service is timing out, but your API dutifully returns 200 because technically it responded. This is the reality of monitoring HTTP endpoints in production—status codes alone don't tell the whole story.

10 Proven APM Best Practices to Reducing Latency and Improving Response Time

Speed defines user loyalty. Recent market research indicates that organizations adopting advanced application performance monitoring (APM) tools are achieving measurable gains in user engagement, retention, and revenue. “ A 2025 performance study found that businesses tracking latency and response time proactively reduced customer churn by up to 30%. ” As applications expand across distributed architectures, microservices, and cloud environments, performance gaps become harder to diagnose.

Top 11 Ruby APM Tools for 2025: A Performance-Driven Selection

Observability has become a core part of running Ruby applications at scale. Knowing how your app performs — from request latency to background job execution — helps catch slowdowns early and improve reliability. This blog walks through some of the most useful APM tools for Ruby in 2025. Each section highlights what the tool does well, where it fits best, and what kind of visibility it brings to your application's performance.

Unpacking the Elements of Site Uptime (by way of Jeopardy!)

Picture this: you’ve achieved your second lifelong dream of being a contestant on Jeopardy! Now it’s time for the fateful “final answer.” The good news? You’ve got a comfortable lead over your fellow contestants, and a correct response means eternal bragging rights. The bad news? Miss this one, and everyone — your family, coworkers, dentist, mechanic — will remind you of it forever. The lights dim. The audience holds its breath.

Declarative Configuration in OTel (Grafana OpenTelemetry Community Call #1)

We’re kicking off a brand-new Grafana OpenTelemetry Community Call! Join us as we dive into getting observability into your apps and infrastructure with Grafana, powered by OpenTelemetry. In this session, we’ll dive into Declarative Config — the new way to make OpenTelemetry onboarding simple and powerful. Instead of juggling environment variables or boilerplate in your startup code, declarative config gives you a clean, language-agnostic approach that works across SDKs and unlocks future possibilities like remote configuration. Join us with Marylia Gutierrez (OTel JavaScript approver & core contributor) to explore.

How Atlassian built a smarter observability system with Grafana and OpenTelemetry

Discover how Atlassian built OpsDeck, an observability platform powered by Grafana, to automate incident detection, improve response time, and reduce troubleshooting from one hour to under a minute. Hear how the Observability Insights team scaled OpenTelemetry, broke silos, and built smarter workflows for both engineers and support.

Demystifying WMI Permissions

Network administrators are always seeking to gain a deeper understanding of their Windows-based environments. Windows Management Instrumentation (WMI) enables their network monitoring tools to access system information, manage configurations and automate tasks. It provides a vital role in network monitoring by providing a standardized interface for querying and controlling system components. A complex set of permissions governs WMI access.

Kubernetes monitoring & observability trends 2026 | Future of Kubernetes observability

Kubernetes continues to dominate as the container orchestration standard, but the way we monitor and observe clusters is rapidly evolving. As we head into 2026, Kubernetes monitoring is moving toward actionable insights, cost-aware observability, and security-first approaches. This blog dives deep into what engineers, architects, and platform teams should watch for in the year ahead — with real-world examples for context.

Clarity in the Dojo: The power of the Summary Agent

In the dojo, not every role is about throwing punches. Some roles are about awareness, the unmistakable voice that tells the fighter when to move, where the strike is coming from, and why the opponent matters. That’s the role of the Summary Agent in Sumo Logic Dojo AI. Unlike a traditional agent, it doesn’t launch queries or carry out actions on its own. Its purpose is to narrate, not act. In doing so, it becomes the foundation for every other decision in the dojo.

Get organized, actionable insights from complex test environments with Datadog Test Suites

Modern teams often run hundreds of synthetic tests across multiple services, environments, and user journeys. While these tests provide deep visibility, managing them as a flat list can quickly become overwhelming, especially as organizations scale and teams specialize.

The next evolution of WebPageTest has arrived, and it's a game-changer

Now fully integrated into Catchpoint’s Internet Performance Monitoring (IPM) platform, WebPageTest is no longer just a testing tool; it’s your full-stack performance command center. From AI-powered insights to automation and Smartboards, the new WebPageTest gives digital experience teams everything they need to move beyond page speed and master end-to-end performance. Test smarter, detect faster, and optimize every layer of performance with a unified, AI-powered platform built for experts.

Grafana and Grafana Cloud release cycle: An end-of-year update

With the end of the year fast approaching, we want to let you know about some important dates for our upcoming release freezes. Our annual release freeze helps ensure stability for everyone during the holiday season, which is a critical time for many of our customers. This pause helps us protect our on-call teams and maintain a smooth experience for you.

AI Agent for Cloud Cost Optimization: From Blind Spots to Smarter Spend

Cloud has become the backbone of digital enterprises, but managing its cost footprint is proving increasingly difficult. With multiple providers, diverse pricing models, and ever-changing workloads, organizations often find themselves facing spend leakage and unanticipated overruns. The stakes are high—not only in terms of IT budgets but also in ensuring cloud resources deliver maximum business value.

How to bridge speed and quality in experiments through unified data

Metrics are fundamental to experimentation for two reasons: They set the basis for evaluating ideas and interventions, and they can suggest where to look next. As such, many teams collect a wide variety of metrics, from application performance data to revenue trends. However, doing so often means manually knitting together data from multiple sources and formats. Even then, data silos can make it challenging to understand the full impact of experimental changes. In this post, we’ll explore.

The Network Engineers You Can't Hire? They Already Work for You

In my conversations about managing large, complex networks, one topic is now constant. The issue isn't budgets or new technology; it's about personnel. Specifically, it's the increasing difficulty of finding and retaining skilled professionals. If you are feeling this pressure, you are not alone. The search for technical talent is a universal challenge.

What's New in Network Observability for Fall 2025

As your partner in network observability, we’ve worked together to help you manage an increasingly complex digital landscape. You’ve built a powerful monitoring foundation, but the pace of change doesn’t slow down. Your network continues to expand across hybrid clouds and multi-vendor SD-WAN, and the demands on your team grow with it.

Datadog Cloud Cost Management: Make cost a key metric for engineers

See how Datadog Cloud Cost Management puts cost and efficiency KPIs directly in front of engineers in their daily workflows. In this short demo, you’ll learn how to: Datadog unifies cost, performance, and business metrics in one platform, so FinOps, engineering, and finance teams can make cost-aware decisions together.

5 Log Management Best Practices for Your Organization

At Logz.io, we speak with hundreds of companies every month. One thing is consistent across the board: everyone ships logs. But the challenges are equally common: What are the best practices for logging? How do we reduce noise? How should we architect our logs to make them truly useful? The reality is that logs are noisy for everyone. The best time to standardize your logging practices is when you write your first line of code—though that rarely happens. The second-best time is now.

Grafana Tempo 2.9 release: MCP server support, TraceQL metrics sampling, and more

Grafana Tempo 2.9 is now available, delivering MCP server support, TraceQL performance improvements, and more. Watch the video below to see the Tempo MCP server in action and learn how to speed up TraceQL metrics queries, or continue reading to get a quick overview of these and other updates. The Grafana Tempo 2.9 release notes and changelog provide more in-depth details and include all of the changes that came with this release.

Two Factors, Double Security?

“Please enter the code we just sent you.” – most people have seen this message when logging into an online service. Two-Factor Authentication (2FA) is no longer reserved for banks or enterprises. It’s now common in email, social media, and shopping accounts. The idea is simple: in addition to a password, you need a second factor so that attackers can’t break in with just one piece of information. But what methods are actually used – and how secure are they really?

Your network isn't infrastructure anymore. It's a product.

In my last blog, I’ve discussed a common problem: metrics like mean time to resolution (MTTR) mean nothing to business leaders. Celebrating a faster fix for an outage that still cost the company thousands in lost sales is a conversation that goes nowhere. You might as well be speaking a different language.

We've refreshed and expanded the StatusGator Help Center

We’re excited to share a major update to the StatusGator Help Center — redesigned to make finding answers and learning new features faster and easier than ever. We’ve reorganized our documentation, added new guides, and improved formatting so you can navigate with ease — whether you’re just getting started or managing advanced integrations.

Latency & Leadership with Mehdi Daoudi

Leadership is about more than telling people what to do. It’s about inspiring belief in your vision for the future. Sometimes there’s a delay between the time you share the vision and when the rest of the team “gets it”. The Latency & Leadership series hopes to shorten that lag time by creating a platform for leaders in the tech space to share their ideas, their passion, and their vision.

Elastic recognized as a finalist for Innovation in Customer Portals in 2025 TSIA STAR Awards

We are proud to announce that Elastic has been named a finalist by the Technology & Service Industry Association (TSIA) in the 2025 STAR Awards program for Innovation in Customer Portals that Improve Digital Customer Experience. This award recognizes Elastic’s ability to embrace AI innovations to enhance our digital customer experience.
Sponsored Post

Hidden Cost of Siloed Monitoring Tools

In today's complex IT environments, organizations often rely on a patchwork of specialized monitoring tools. One platform might monitor databases, another cloud workloads, a third enterprise applications, and yet another the infrastructure itself. While each tool addresses a specific need, this fragmented approach introduces hidden costs that can undermine operational efficiency, inflate budgets, and slow response times when critical incidents occur.

The Hidden Risk of DNS - Lessons from the AWS Outage & Why You Need DNS Spy Monitoring NOW

On October 20, 2025, much of the internet came to a halt. Apps wouldn’t load. Payments failed. Cloud dashboards went dark. From Fortnite to Alexa, Snapchat, and countless business platforms, users across the world were suddenly offline — all because DNS broke inside Amazon Web Services’ (AWS) US-East-1 region.

Amazon Isn't Eating Its Own DNS Dog Food

On October 19-20, 2025, Amazon Web Services (AWS) experienced a significant outage (AWS status) affecting its US-EAST-1 region in northern Virginia. The root cause was DNS resolution failures for DynamoDB’s API endpoints, which cascaded across AWS’s interconnected services, disrupting major platforms including Snapchat, McDonald’s, Disney+, Roblox, Coinbas, Reddit, and Amazon’s own services.

How WWT Proves the Value of Agentic AIOps with LogicMonitor's Edwin AI

Agentic AI has entered day-to-day operations. Systems with the ability to act, learn, and adjust are already cutting noise, speeding remediation, and giving engineers time back for work that moves the business. In a recent webinar, Karthik SJ, General Manager, AI at LogicMonitor, and Mike Cervasio, Global Practice Manager, AIOps at World Wide Technology, explored what makes this new phase of AIOps actionable.

Live in Boston: Data, DEX, and a Few Fist Fights @ Nexthink Experience

Tim and Tom host another special live edition of The DEX Show, this time from the Omni Boston Hotel, recorded during last week’s Experience Boston. Joined by Christina Lahr (Bayer), James Krick (Campbell’s), and Ryan Way (Warburg Pincus), the hosts dig into more real-world stories of data-led IT excellence, once again in-person. In between, listeners can learn a few unexpected facts about Tim — has he ever been in a fist fight, starred in a play, or been thrown out of a bar? Listen now to find out...

What Is an Email Blacklist?

An email blacklist is a database that lists IP addresses or domains suspected of sending spam or malicious emails. Mail servers use these lists to decide whether to deliver or reject incoming messages. Understanding how blacklists work is essential for keeping your messages deliverable and your domain reputation intact.

Introducing Updog.ai: Real-time provider status from Datadog

When external SaaS providers or cloud services degrade or go down, engineers often find themselves wondering if the issue they're encountering is local or more widespread. The answers they find are usually slow to surface, limited in detail, or entirely dependent on the provider's updates. Vendor-controlled status pages and third-party aggregators don’t provide the timely, independent visibility that's necessary to quickly and accurately identify the root cause of slowdowns.

What is Open Telemetry? The Future Is Here

Watch SolarWinds tech evangelist, Sascha Giese, dive into OpenTelemetry (OTel) and explain why a vendor-agnostic standard is the future of observability and application performance monitoring (APM). If you’ve ever wondered, what is OpenTelemetry? Sascha’s presentation is a great start or restart to diving back into the topic.

Optimize HPC jobs and cluster utilization with Datadog

High-performance computing (HPC) environments support some of the most critical workloads in the world—from asset pricing models in financial institutions to molecular simulations in drug discovery. These workloads often span hundreds of thousands of cores, depend on specialized infrastructure such as GPUs, and run for extended periods. As a result, performance and efficiency are critical.

Detect and map third-party outages with Datadog External Provider Status

Modern applications depend on dozens of external cloud platforms, APIs, and SaaS services to function. But when those providers experience issues, engineers often spend valuable time asking a basic question: Is the problem with us or with them? Provider-maintained status pages are often slow to update, leaving teams waiting for confirmation while incidents escalate. This delay wastes valuable time, prolongs investigations, and risks customer trust.

Authentication Model in OpenTelemetry

In any type of software that involves the movement of data or information, there is a pressing need to make the passage of data secure. One way of achieving this is by authentication. You must have experience authenticating API calls or other data streams. In modern systems, where even a small mishap can wreak havoc and you might wake up to a $$$ bill the next day, we should do whatever is within our capacity to secure our systems.

Traceroute vs. Ping: When to Use Each

Let’s talk about the most fundamental network diagnostic tools: ping and traceroute. These command-line utilities have been the backbone of network troubleshooting for decades, yet many IT professionals struggle to use them in the right context. Knowing which tool to use (and when) can mean the difference between a five-minute fix and hours of frustration. While both ping and traceroute help diagnose network connectivity issues, they serve distinctly different purposes.

Network Monitoring for Data Centers

Kentik NMS (Network Monitoring System), part of the Kentik Network Intelligence Platform, brings true visibility and context to network operations. See how device metrics, traffic data, and application insights come together to eliminate blind spots—so your critical workloads, like AI training and inference, run smoothly and reliably.

The Monitoring Blind Spot That Could Cost You Black Friday

With Black Friday and the holiday season looming, IT teams everywhere are bracing themselves for what is, year after year, the most daunting stress test of your entire service delivery chain. Under relentless peak demand, every link in your digital experience is scrutinized by customers whose tolerance for friction is at an all-time low. It’s not just about uptime, monitoring dashboards, or technical metrics.

AI Agent for Incident Resolution: Combining Intelligence with Autonomous Actions

Incident management is a high-stakes function. IT operations teams and SRE teams may play different roles, but when a priority incident surfaces, it is often all-hands-on-deck to ensure it is resolved in minimal time. That’s because of the high impact of incidents-if not resolved in time, they can cascade and impact other IT systems, leading to downtime, business disruptions, monetary losses, and impacting brand value, compliance, and regulatory rules.

Datadog Cloud Cost Management: Telemetry-driven cost allocation

See why Datadog is a leader in cloud cost allocation. In this demo, learn how Datadog leverages high-resolution observability data to deliver accurate, dynamic cost attribution across clouds and containerized environments. You’ll see how Datadog: Discover how Datadog combines cost, performance, and business context to make cost reporting both accurate and actionable.

The Agentic Enterprise Needs a Nervous System

Over the weekend, when Salesforce introduced the concept of the Agentic Enterprise, it wasn’t defining a new market trend. It was signaling an inflection point. A moment when the conversation about artificial intelligence stopped being about tools and started being about trust. For the first time in decades, enterprise software isn’t simply enabling decisions. It’s making them. Systems are reasoning, choosing, and acting in real time across sprawling digital ecosystems.

Bridging partners in pursuit of agentic AI - Part 2: How leaders can position themselves for the future

From ecosystem foundations to future advantage In Part 1: Why partnerships matter for enterprise intelligence, we explored how enterprises are moving from experimentation to scalable impact with agentic AI and how ecosystems make that possible. But naturally, the next question is: Where do we go from here?

AI-Powered Translation Tools: A Hidden Asset for Scaling DevOps Globally

DevOps or development (Dev) and IT operations (Ops) teams are no longer confined to single geographic locations or language groups. With over 80% of organizations now practicing DevOps (a figure projected to reach 94% in the near future), the challenge of scaling operations globally has never been more critical. Yet, one persistent bottleneck continues to slow down even the most sophisticated DevOps workflows: language barriers.

Making logs work smarter: Evolving your observability strategy

When you start building an observability stack, it’s natural to reach for logs first. They’re familiar, easy to generate, and often already part of a developer’s workflow. And sending logs to a centralized system feels like a quick win, too. Simply add a log shipper, and voila, your application is observable.

Bridging partners in pursuit of agentic AI - Part 1: Why partnerships matter for enterprise intelligence

The pace of change in AI development has been dizzying. In just a few years, we’ve moved from experimenting with AI, machine learning (ML), retrieval augmented generation (RAG), and agents to asking how these innovations can solve real business problems. Enterprises are no longer impressed by the novelty and possibilities; instead, they expect outcomes.

Navigating the Database Ecosystem in 2025

In 2025, the database ecosystem is more diverse and interconnected than ever before. From AI-assisted natural language queries that analyze your data to open table formats that make it easy to bridge systems, data infrastructure is moving towards openness, intelligence, and composability. Modern databases are no longer isolated systems; they are part of a broader ecosystem where interoperability is as important as performance.

RED Metrics & Monitoring: Using Rate, Errors, and Duration

The RED method is a streamlined approach for monitoring microservices and other request-driven applications, focusing on three critical metrics: Rate, Errors, and Duration. Originating from the principles established by Google's "Four Golden Signals," the RED monitoring framework offers a pragmatic and user-centric perspective on service assurance and service performance.

Get started with Grafana Alerting: Route alerts using dynamic labels

In this tutorial you will learn how to configure notification policies for dynamic routing based on query values Don't miss the rest of the "Get started with Grafana Alerting" series! Each part dives into a different feature to help you get the most out of alerting in Grafana.

Application Performance Monitoring (APM) Guide: Monitor and Optimize Application Performance

Every millisecond your application takes to respond can decide whether a user stays or leaves. But here’s the catch, you can’t improve what you can’t see. Behind every slow page load, failed API call, or random spike in latency lies a story your application is trying to tell. Application Performance Monitoring (APM) is how you listen to that story.

Demo of Raygun's remote MCP

This Raygun remote MCP demo highlights the new depth of context available. The agent isn’t just fetching error lists. it’s reasoning through stack traces to find the issues. Combine this with the ability to now view associated deployment versions, browser information, breadcrumbs, customer data and more, the agent becomes infinitely more capable at solving errors. We’ve even heard of some of the early testers going from having errors in production to having them solved within minutes.

AWS Outage: How do you prepare for the failure of your own safety net?

When AWS’s massive outage struck, it didn’t just take down cloud services, apps, and enterprise platforms. It also knocked out many of the monitoring systems organizations depend on for real-time answers. Observability companies, including Datadog, New Relic, Checkly, Dynatrace, SpeedCurve, and Splunk Observability, lost visibility or functionality precisely when organizations needed them most.

Unreal Engine crash reporting now available on gaming consoles with trace-connected logs

With the first major release of the Sentry Unreal SDK (now on v1.2.0, and you can also explore in our interactive sandbox), we’ve made some important improvements to support cross-platform Unreal developers when it comes to platform coverage, debugging with user feedback, and performance monitoring improvements. Here’s what’s new.

10 Best Log Monitoring Tools

Log monitoring stands as the backbone of resilient, secure, and high-performing digital operations. Every digital service, application, cloud platform, and network device leaves behind a trail of log files, containing raw, unstructured data that chronicles system events, user actions, errors, security activities, and business transactions. For organizations striving to achieve operational excellence, these logs are more than archives; they're the heartbeat of every mission-critical system.

Microsoft Teams Troubleshooting for Teams Performance and Connection Issues

How many times has this happened? You're on a Microsoft Teams call, and your call disconnects, lags or freezes. so you go to Google to look up how to solve the problem. Well look no further! If you're using Microsoft Teams, there are proven ways to troubleshoot those pesky performance and connection issues that are putting a damper on your team's collaboration.

Show me the (meeting) money: How to monitor the real-time costs of a meeting in Grafana

This meeting could’ve been an email. It’s a phrase most of us have said (or at least thought) at some point in our careers. For me, that realization hit years ago while working for a government organization. I’d frequently sit through long, agendaless meetings that seemingly went nowhere. I wasn’t sure why I was there. And because I’m an engineer at heart, I started to wonder: what were these meetings actually costing the organization?

A deep dive into Java garbage collectors

Historically, developers have relied on languages like C and C++ for explicit control over memory allocation and deallocation. This approach can yield very low overhead and tight control over performance, but it also increases complexity and risk (e.g., memory leaks, dangling pointers, and double frees). This often results in runtime issues that are difficult to diagnose, which can become a drag on team velocity.

Ingest OTLP metrics directly into Datadog with the new OTLP Metrics API

Many organizations rely on OpenTelemetry (OTel) to standardize observability across distributed systems. These organizations are at varying stages of adoption and are implementing OTel in complex environments with diverse configurations. To support this range of use cases, Datadog offers many ways to use OpenTelemetry with Datadog.

Track, debug, and roll back changes with Version History for Synthetic Monitoring tests

A synthetic test is only useful if you can trust what it’s telling you. When one fails, the reason may not be obvious. Was the application updated? Did the test change? Or both? As more people contribute and refine the same test, it becomes harder to understand what changed or restore a working version. Without clear visibility into those updates, teams can spend more time tracking down the cause of a failure than resolving it.

From court to code: Build an agentic RAG assistant with Elasticsearch

Want to see what it really takes to build a smart AI assistant? How about one that can help you make the right fantasy basketball picks? In this live session, we’ll demonstrate how to instantly activate and ground a high-performance AI agent using the Elastic Agent Builder, and we’ll show how it powers real-world use cases like smarter player picks. Join JD Armada, developer advocate, for a 20-minute live coding session to learn about.

How Leading Businesses Achieved Greater Uptime with Atatus Monitoring

When every second of downtime can mean lost revenue and frustrated customers, leading businesses can’t afford to leave performance to chance. That’s why leading companies are turning to Application Performance Monitoring (APM) tools like Atatus, a Datadog alternative to keep their applications healthy, detect issues before customers do, and achieve higher uptime than ever. But how exactly are they doing it?

Powering Mexico's Digital Future: Expanded Internet Observability with Catchpoint

As of 2025, more than 110 million Mexicans are online, putting digital‐access penetration at roughly 83% of the population. Mexico is already one of Latin America’s anchor markets, leading the region in startup momentum, cloud adoption, and cross-border digital trade. A few days ago, CloudHQ announced a $4.6B investment in Mexico to open multiple datacenters. Yet even with this scale, service quality still varies dramatically across cities, states, and ISPs.

SharePoint Server Monitoring: Uptime, Performance & SLAs

SharePoint is the backbone of internal collaboration for countless organizations. It hosts documents, drives workflows, powers intranets, and underpins team communication across departments. But when it slows down—or worse, goes dark—productivity grinds to a halt. The problem is that most monitoring approaches treat SharePoint like a static website. They check availability, not experience.

Onboarding Microsoft Sentinel data lake with DataStream

Modern security operations teams face an overwhelming challenge: a rapidly growing volume of logs, alerts, and telemetry from cloud services, on-premises infrastructure, and third-party security tools. Traditional SIEM platforms often struggle to scale cost-effectively and provide the agility needed for advanced analytics and threat hunting.

The Hidden Barrier to Network Automation Isn't Your AI - It's Your Data

For years, the promise of AI-driven network automation has loomed large. Vendors and analysts alike have painted a future where autonomous operations handle outages before they happen, root causes are explained instantly, and teams finally escape the endless cycle of alerts, tickets, and manual troubleshooting. But in practice, most automation initiatives stall long before they reach that vision.

Tech Talk #10 Building a VictoriaMetrics PaaS: Setting up Metrics, Logs, and Traces

From Blueprint to Reality This episode is designed to be a practical, step-by-step guide. We will show you how to leverage the VictoriaMetrics Kubernetes Stack—our "easier button"—to simplify the deployment process and get your components running quickly.

From Datadog to Checkly in minutes

Looking to cut your Datadog bill and modernize your monitoring workflow? In this session, Dan Giordano and Giovanni Rago show how to migrate your Datadog synthetic monitors to Checkly in minutes, unlocking Playwright, Monitoring-as-Code, and AI-powered automation. Timestamps: Intro — Why Migrate from Datadog Dan introduces the session, what will be covered, and who it’s for.

Introducing Obkio's Visual Traceroute Tool

Introducing Obkio's New Visual Traceroute: See Your Network Issues, Not Just Hops After years of evolution, we're launching the most advanced Visual Traceroute we've ever built, now fully integrated into the Obkio app. The Journey: What's New:✓ Fully integrated visual network mapping✓ Historical timeline that actually remembers✓ Correlated with Network Performance and SNMP data✓ No extra setup required✓ See the complete story of network issues, not just individual hops.

What is an Anycast network and how does it help handle high volume network traffic and effective query resolution?

When an organizations' DNS authoritative server faces high volume of network traffic from multiple client devices, they would need more than one DNS server to handle them. But manually routing the network queries to each DNS server in the network would be a tedious job for the network admins. And in turn, this would slow down the network service responses, leading to multiple delays and disruptions.

Network Diagnostic Tools: What They Are, What They Do, and Why Network Pros Need Them

If you’ve ever been the “network person” in the room, you know how it goes: the moment something slows down or disconnects, everyone looks at you. The pressure’s on, and you need answers fast. Is it the Wi-Fi? The ISP? A misconfigured switch? Or maybe that new cloud app is hogging bandwidth? That’s where network diagnostic tools come in.

25 Sumo Logic updates to better monitor and secure your Azure environments

If you manage workloads across multiple clouds, you know how easy it is for critical alerts or performance issues to get lost in the noise. Switching between consoles, correlating logs, and tracking metrics across platforms can slow down troubleshooting, delaying incident resolution and increasing risk of missing critical alerts.

How Legal IT Can Escape the Graveyard of Recurring Tickets

It’s 3:30 p.m. A partner’s laptop refuses to authenticate to the VDI. The urgent filing is in two hours. The ticket title reads like a headstone you’ve seen a hundred times: “Can’t connect, tried rebooting, please help.” Another “undead” incident claws its way out of the queue. By home time, the backlog becomes a graveyard of recurring tickets, and your team, although brilliant and capable, is exhausted and applying the same fixes again and again.

ISP Monitoring Explained: How to Measure, Manage, and Improve Internet Performance

Reliable internet connectivity isn’t a convenience. It’s mission-critical infrastructure for modern organizations. Every organization today depends on high-speed, reliable internet access for daily operations—from cloud collaboration and data transfer to streaming, remote work, and customer engagement. As digital transformation accelerates, the rise of AI, large language models (LLMs), IoT, and device sprawl has massively increased bandwidth demand and network complexity.

InfluxDB 3 on Amazon Timestream for InfluxDB: Real-Time Performance, Now Fully Managed on AWS

Today, we’re announcing a major milestone for developers building the next generation of intelligent, real-time systems: InfluxDB 3 is available on Amazon Timestream for InfluxDB, now the default time series database offered directly in the AWS Management Console. This brings InfluxDB 3, our next-generation time series database, directly into the AWS ecosystem for the first time.

Network Intelligence in the AI Era #network #networktraffic

Transform your network strategy from guesswork to data-driven. In this session, you'll learn how to: Build a peering & transit strategy that cuts costs. Model real connectivity costs per customer. Use dashboards to improve margins and renewals. Ask natural-language questions with Kentik AI. Join experts from Kentik, NetMavens, and Seaborn Networks, hosted by Capacity Media, to align your network reality with your commercial goals.

SOC 2 Type 2: Netdata's Security Controls Validated Over Time

We’re excited to share that Netdata has successfully achieved SOC 2 Type 2 attestation. Following a five-month audit conducted by Sensiba LLP, we can now confirm that our security controls work consistently in practice. The audit covered the period from April 1 to August 31, 2025, and tested whether our controls operated effectively throughout that entire timeframe.

From pillars to rings: How interconnected observability in Grafana Cloud optimizes performance and reduces telemetry waste

In observability, we’ve traditionally been taught to think in terms of pillars, namely logs, metrics, and traces (and more recently, profiles). But pillars are rigid and disconnected. They don’t reflect how modern systems actually work or how we troubleshoot in real time. So let’s change that.

Top 9 APM Tools for Node.js Performance Monitoring

When a Node.js app slows down, you don’t get a clear picture right away. One service stalls, another spikes in CPU, and somewhere in between, requests start piling up. You can’t fix what you can’t see. Application Performance Monitoring (APM) tools close that gap. They capture request traces, latency, and errors across your stack — showing you what’s running slow and why.

Obkio's Visual Traceroute Tool: Feature Release

Today, Obkio’s Network Performance Monitoring solution is announcing the release of our all-new Visual Traceroute Tool integrated into Obkio’s application. This feature is a re-invention of Obkio’s standalone Visual Traceroute Tool (Obkio Vision), and has been transformed to help users better understand network path performance and the source of network issues.

Implement Distributed Tracing with Spring Boot 3

A slow checkout request. A background job stuck waiting on another service. A log message that looks fine — until performance drops. In a Node.js microservices setup, these are the moments that test your observability. You know something's wrong, but tracing the request across dozens of services feels impossible. Distributed tracing changes that. It connects every span in the request's journey, showing exactly where time is spent and where things start to break down.

Reality Bytes: Jon Leighton Returns! How Community Continues to Shape DEX

Head of Nexthink's Digital Community and User Groups Jon Leighton rejoins Reality Bytes with Tom, Sean, and Dina to explore how community remains the beating heart of Digital Employee Experience (DEX). Fresh from Experience London and heading into Experience Boston, Jon shares how Nexthink’s Ambassador Program, user groups, and learning initiatives empower practitioners to grow, collaborate, and lead change. From storytelling and communication to real-world impact and career development, this episode celebrates the people and connections driving DEX forward.

The 2025 Guide to Open Source Status Page Software

This is an updated version of the 2024 article. Maintaining transparent communication about service availability is crucial for businesses of all sizes. Status pages are an important part of your communication strategy during times of outages and maintenance events. You can choose to go with a fully managed status page provider or host an open-source one yourself.

Best APM Tool for Modern Teams | Site24x7's Application Performance Monitoring

Your apps are the heartbeat of your business. You risk user satisfaction when the app performance drops. ManageEngine Site24x7's Application Performance Monitoring (APM) is here to give you the visibility you need into your application environment. The features range wide--code-level insights, distributed tracing, centralized log management, and much more.

CriblCon 25 Keynote Livestream

IT and security data professionals stand at a crossroads. The practices and technologies that have served you for the last ten years are at their breaking point, facing an onslaught of data growth and complexity that will only accelerate as AI goes mainstream. You have a choice. Stay earthbound or take your telemetry to the stratosphere and beyond.

Monitor logs from Amazon EKS on Fargate with Datadog

Amazon EKS on Fargate is a managed service that reduces the operational overhead of maintaining a Kubernetes cluster by abstracting away the underlying infrastructure. In a serverless Fargate environment, each pod is assigned its own isolated compute resources; there is no direct host-level access.

CIDR blocks vs. IP ranges: Aligning network discovery with business value

At every turn, IT leaders are required to prove the value of every technology investment. Technology business management (TBM) practices encourage connecting tech spend directly to business outcomes, demanding accurate data about what’s in your network and how it supports the organization.

Baking in site reliability with observability and AI: How SpotOn uses Grafana Assistant to keep restaurants running

When you operate a restaurant, the last thing you want to do is shut your doors and turn away guests and staff because of some technology failure. And if you’re the one providing that tech, it’s your job to make sure that doesn’t happen. “For us, observability is about a lot more than just dashboards and alerts.

Kentik in Motion: Unlocking the Power of Data Explorer

Kentik Data Explorer is the heart of Kentik, where raw network telemetry is transformed into actionable insights. Yet many users don’t realize just how much they can do with it, or how Data Explorer connects to other parts of the Kentik platform. In this session, we walk through the fundamentals of using Data Explorer effectively, provide real-world examples, and highlight how it ties into workflows such as alerting, dashboards, and troubleshooting.

Teams issues are inevitable - but your users don't need to know that

Our previous blog gave a quick overview of an all-too real scenario involving poor Microsoft Teams performance and frustrated VIP users. The situation, picking up on our recent Power Moves webinar, centered on a big board meeting held over Teams that suffered from multiple call quality issues — spurring the CEO to pay a stormy visit to IT. In that case, the issue had already happened, and our point was that with native Microsoft tools, it can be hard to get to a precise root cause quickly.

APM vs Observability: Both-and, not either-or

I'll start this, the third and final entry in my series on APM and Observability, which was originally inspired by my contribution to an APMdigest article, by once again pointing out that APM tools can be built with observability in mind. Many are, in fact. And the ones that aren’t don’t turn into a different type of tool. In my experience, it's more that there's a difference of mindset.

Rolling Out AI Application with Confidence: How Nexthink's AI Drive + Adopt Makes AI Compliant, Insightful, and Effective

From Microsoft Copilot to ChatGPT, AI applications are quickly becoming everyday workplace tools. But for many organizations, turning on these capabilities isn’t as simple as flipping a switch. Enterprise licenses for AI tools can cost millions, yet few companies can confidently say employees are using them effectively, or safely. The reality is that most AI rollouts start strong but stall fast.

Distributed Historian Architecture with InfluxDB 3

From pipelines to warehouses, modern operations generate more distributed data than ever, with equipment and connected devices spread across factories, grids, and remote sites. A single, centralized historian can no longer handle this volume or distribution. Without change, organizations risk fragmented visibility, higher costs, and slower responses.

Choosing the Right APM for Go: 11 Tools Worth Your Time

If you’re building high-performance systems, Golang has probably earned a spot in your stack. Its speed, lightweight concurrency, and quick compile times make it ideal for scalable APIs, microservices, and distributed systems. But those same qualities that make Go powerful can make performance monitoring tricky. Goroutines run fast and in parallel, which means a simple CPU or memory graph doesn’t always tell you what’s slowing things down.

AI-First: Agentic AI needs a new architecture

At Cribl, we’ve talked a lot about epochs. A moment in time when there was a before and after. AI, and specifically agentic AI, is an epoch. The way we work is going to forever change. There have been many such events in our lifetimes: the PC, the Internet, and the smartphone. AI will change how we work forever. Prior to the PC, there were people whose jobs were literally titled “computer”.

Introducing Cribl Notebooks: One Tab For Your Entire Investigation

Investigations move fast. Data is messy. And today’s analysts are expected to connect the dots across massive datasets and various tools—while documenting every step and sharing results with stakeholders. What does that look like? A security investigation may involve 10 or more queries—each one filtering, transforming, and analyzing data from a different angle—duplicated across multiple browser tabs so nothing gets lost.

Introducing Cribl Insights: A central hub for monitoring and alerts

What happens when your data pipelines slow down, drop volume, or quietly change shape? Most monitoring tools won’t catch those shifts until it’s too late—when downstream systems are already impacted, dashboards are broken, or critical information is missing. That’s why we’re excited to introduce Cribl Insights, to give you real-time visibility into every part of your Cribl environment: data flows, operations, processing, user activity, configuration changes, and more.

Managing observability costs at scale: A look at the latest cost management features in Grafana Cloud

The benefits of observability are clear: deep visibility into system health, faster troubleshooting, and improved reliability (to name a few). But what’s equally clear is that, as organizations scale and evolve their observability strategies, they need a way to tap into these benefits without runaway costs. According to Grafana Labs’ 2025 Observability Survey, 74% of respondents say cost is a top priority for selecting tools.

Optimize Cloud Costs with Datadog Cloud Cost Management

Datadog Cloud Cost Management unifies observability and cost data so engineering and FinOps teams can drive efficiency together. In this demo, see how you can: Allocate cloud costs across AWS, Azure, Google Cloud, OCI, and SaaS providers with precision Empower engineers by surfacing costs in their daily workflows Automate recommendations to accelerate optimization Monitor your daily Datadog costs - at no additional charge.

Break production less with AI code review

Prod is down, the errors feed is on fire, and your code is to blame. You’ve got the info you need to debug, but it would’ve been nice to have before you shipped this mess. In this workshop, we’ll do a complete walkthrough of Sentry’s new AI code review features. This workshop will cover: How Sentry predicts errors to save you from shipping high-impact bugs Using Ai-powered PR review instead of making your teammates search for every typo Getting AI-generated unit tests that cover your changes and catch potential issues.

Introducing Cribl Notebooks: Investigate, Visualize, and Share - All in One Tab

Run every part of an investigation in one workspace with Cribl Search’s new Notebooks feature. Bring queries, visualizations, and annotations together to make sharing and collaboration easier. Speed up investigations and turn complex workflows into narratives anyone can follow.

What Is SolarWinds, And Should You Use It?

Downtime is brutally expensive and damaging. Enterprises can lose about $9,000 every minute systems are down, while smaller businesses lose hundreds of dollars per minute. A single outage can often cost over $100,000, and nearly a third of companies lose customers due to downtime. That’s why many organizations turn to platforms like SolarWinds to maintain reliable systems and minimize the risk of costly disruptions.

Observability in Fraud Detection: How Transaction Monitoring Tools Can Help Spot Money Laundering

In today's increasingly digital financial landscape, transaction monitoring has become a critical component of global fraud detection strategies. As financial crimes evolve in complexity, institutions must strengthen their ability to detect anomalies and uncover suspicious activity before it causes damage. Observability, a concept long used in IT and data operations is now emerging as a powerful approach for improving visibility into complex financial transactions.

Real-Time Outage Alerts in Slack and 4 Ways To Set It Up

When a third-party service you depend on goes down, every minute counts. The sooner your team knows about the outage, the faster you can respond and reduce downtime. Since most IT and operations teams live in Slack, it makes sense to receive real-time outage notifications directly in Slack channels where you already collaborate. There are several ways to do this, from integrating an all-in-one status page aggregator like StatusGator, to setting up RSS feeds or building your own Slack app.

From Idea to Deployment: How To Build a Practical AI Roadmap

AI is being adopted at a faster rate than ever across the business world. According to Stanford, 78% of organizations had implemented AI in some form by 2024. And if that’s not convincing enough, 92% of companies plan to expand their AI investment over the next three years. Practically everyone, including your competitors, is already using AI to gain a competitive edge. If you don’t act soon, there's a real risk of falling behind.

9 Essential Network Administration Tools

Network administration has become more complex than ever. IT professionals are tasked with managing sprawling infrastructures, maintaining uptime, optimizing performance and defending against increasingly sophisticated security threats. With hybrid environments, cloud integrations and remote workforces, the pressure to maintain seamless connectivity and security is relentless.

Simplify server issue diagnosis with service monitoring

It's well-known that an alert that just states “the server is down,” is not particularly helpful for your already overworked SysAdmins and SRE teams. Diagnosing why the server went down is their challenge. The problem is that memory spikes, CPU overload, failing services, or blocked ports can all look the same from a distance. Too often, these issues are responsible for delayed fixes, alert fatigue, and hours wasted switching between tools for data correlation.

Strengthen the server back end with server URL checks

In distributed architectures, the back-end service reliability of microservice endpoints and internal APIs relies on the health of local URLs. These local URLs are not exposed to the public internet and are essential for your IT infrastructure health and automation suites. Site24x7’s server URL check is engineered for operations teams that require immediate visibility into these server-level endpoints. These granular endpoints are often overlooked by traditional external monitoring tools.

How We Saved 70% of CPU and 60% of Memory in Refinery's Go Code, No Rust Required

We've just released Refinery 3.0, a performance-focused update which significantly improves Refinery's CPU and memory efficiency. Refinery has a big job: it performs dynamic, consistent tail-based sampling that maintains proportions across key fields, adjusts to changes in throughput, and reports accurate sampling rates.

Application Observability Done Right: Best Practices & Tips

Companies invest millions of dollars in observability platforms, yet they often still struggle to get application monitoring right. This is because most organizations focus on the technology, while neglecting the business. In this article, we’ll show you how to combine business requirements with technological needs. As the CTO of Logz.io, these are based on my experience working with global companies on their application observability needs.

Big Week at Logz.io: Major Product Announcements Signal New Era of AI-First Observability

Four months ago, we announced our vision of AI-first observability. Today, we’re not just talking about the future, we’re shipping it. This week marks a significant milestone with several major product announcements that demonstrate our continued momentum as the industry’s leading AI-first observability platform.

How to Monitor Microsoft Teams Issues & Fix Microsoft Teams "We're sorry - we've run into an issue"

Welcome to the world of Microsoft Teams! When it comes to video conferencing and messaging, Microsoft Teams is one of the most popular players in the game. When we get error messages like Microsoft Teams “We're sorry—we've run into an issue,” or “something went wrong,” it’s important to have a tool to help monitor and troubleshoot Microsoft Teams performance issues and connection issues.

From Data to Dashboards: Building Streamlit Applications with InfluxDB 3

Python developers often reach for Streamlit when they need to construct compelling web applications quickly. It provides a fast way to transform Python scripts into interactive applications without complex web frameworks. When paired with InfluxDB 3 Core, the leading time series database, engineers can build powerful real-time analytics dashboards entirely in Python.

Keynote: Clarity from chaos: turning data sprawl into operational intelligence

Join us as we explore how to cut through the chaos and transform fragmented data into a single source of truth. Discover how SquaredUp helps you visualize the bigger picture, connect the dots, and unlock operational intelligence that drives smarter, faster decisions.

Clarity: Explore Out-of-the-Box Data for Smarter Reporting and Insights

Good reporting starts with the right data — and with Clarity’s Out-of-the-Box Data, the heavy lifting is already done. This hands-on simulation gives you an inside look at Clarity’s built-in data features within the Reporting Workspace. Learn how to use preconfigured data to accelerate reporting, ensure governance, and drive faster insights. Whether you’re new to Clarity or looking to improve reporting efficiency, this video will show you how to build smarter, more reliable reports — without starting from scratch.

15 PHP APM Tools Worth Using in 2025

PHP powers a large swath of the web — from blogs to storefronts to APIs. But with microservices, third-party dependencies, and scaling complexity, performance can slip in subtle ways. Your app might mostly work, but small—noted delays, occasional spikes, or hidden bottlenecks build up. An APM tool helps you see inside the black box: which functions are slow, which DB queries are hogging time, which external calls are failing or stalling.

What Is Cloud Monitoring? Everything You Need To Know

Cloud computing offers several undeniable benefits to businesses. Some of the biggest ones are agility, cost savings, data recovery, and developing new apps and services to meet changing customer needs. Despite these benefits, the cloud can be complex, demand specialized skills, and require companies to follow up-to-date cloud security best practices. Why?

Optimize your end user computing with M365 reports

Unlock the full potential of your end user computing with SquaredUp’s unified M365 dashboards—designed to empower you to make smarter, faster decisions. This session highlights the challenges of fragmented reporting across the M365 suite and discover how SquaredUp’s native M365 plugin leverages the power of MS Graph API to deliver unified dashboards.

Micro Lesson: Sumo Logic Dojo AI Summary Agent

In this video, we'll introduce the new AI powered Summary Agent to help security teams using Cloud SIEM understand and prioritize cybersecurity insights in a faster and more efficient manner. The summary agent provides AI generated summaries of the component signals within an insight, giving analysts a clear view of the underlying evidence without having to spend time reviewing raw logs or multiple events individually. The summary agent is part of Sumo Logic's new Dojo AI platform, featuring a number of useful AI agents across all Sumo Logic products and services.

Best Practices for Public Status Pages

When things go wrong, your public status page is the most important way to talk to people. Your users all want to know what’s going on and when they can get back to the site. A public status page that is well-made makes people trust, be open, and have faith in your brand. In this blog post, you’ll learn what a public status page is and how to make the best ones.

NiCE VMware vSphere Management Pack 6 1 Walkthrough 2025Q4

The next generation of VMware monitoring with Microsoft SCOM is here! Watch the NiCE VMware vSphere Management Pack 6.1 walkthrough and see how we’ve re-architected VMware monitoring to be smarter, faster, and more secure. Native SCOM architecture – no more WMI or external services High availability & load balancing via SCOM Resource Pools Near real-time discovery of VMware changes for maximum accuracy Event-driven monitoring for faster, more reliable alerts Compliance & security monitoring built-in.

From SNMP to Modern Telemetry: A network Monitoring Journey

Simple Network Management Protocol (#SNMP) has been a backbone for network management since the 1980s. While it’s still useful for remotely devices, it’s showing its age. This Short takes you through the evolution of SNMP and shows how newer, more efficient methods for collecting network data are changing the game. Learn how to move beyond older protocols and better monitor and manage your modern network environment.

What Is Email Blacklist Monitoring?

When legitimate emails start bouncing or disappearing into spam folders, the cause is often a hidden one: your domain or mail server has been blacklisted. Email blacklist monitoring is the process of continuously checking your domain and IP address against major spam-tracking databases. Its purpose is to detect blacklisting early, so you can act before it damages your communication, reputation, or revenue.

A serverless approach to CI/CD observability with GitLab and Grafana

In today’s fast-paced development environment, it’s critical that you understand what’s happening in your CI/CD pipeline. And yet, many teams struggle with fragmented tooling that makes it difficult to get a holistic view of their dev lifecycle. For example, if you’re using GitLab for CI/CD and Grafana for observability, you’ve probably faced this challenge: how do you bring your GitLab events into your existing observability and alerting infrastructure?

How OpenTelemetry Auto-Instrumentation Works

Most developers use auto-instrumentation as it’s meant to be used — run the Java agent, add NODE_OPTIONS, and telemetry starts flowing. When it stops, though, figuring out why can be tricky. Maybe the agent didn’t load, maybe there’s a framework version mismatch, or something else entirely. Understanding how auto-instrumentation works makes it easier to spot and fix these issues.

Gaming Latency Monitoring: How to Detect & Reduce Lag

Latency isn’t just a technical metric in gaming—it’s an emotion. Players don’t measure milliseconds, they feel them. A button press that lands a fraction late, a flick shot that fires just off target, a character that rubber-bands at the worst possible time—all of it translates to frustration. In fast-paced multiplayer environments, a 50ms delay can decide outcomes, erode trust, and send players to competitors who seem “smoother.”

NiCE VMware vSphere Management Pack 6.1

In times of rapid transformation within the VMware ecosystem, IT teams are reassessing how to best maintain virtual environments as stable, secure, and efficient as possible. With numerous monitoring options available on the market, the question arises: Why stick with Microsoft System Center Operations Manager (SCOM)?

You're in Good Microsoft SCOMpany

When it comes to enterprise monitoring, consistency and reliability matter more than ever. That’s why organizations across industries, from financial services to healthcare, manufacturing, and government, turn to NiCE IT Management Solutions to extend and optimize their Microsoft SCOM environments. And the results speak volumes.

5 simple habits to beat digital fatigue

Top tips is a weekly column where we highlight what’s trending in the tech world today and list ways to explore these trends. This week, we’re tackling a common struggle for anyone living the digital life, how to beat digital fatigue and bring back real focus in a screen-heavy day. Ever hit that 3pm wall where your eyes sting from staring at the screen, your shoulders feel like bricks, and even the third cup of coffee isn't helping?

Mobile session replay - now live in Coralogix

Coralogix Real User Monitoring (RUM) already gives teams a complete view of how users experience their websites. Now, that same visibility comes to mobile. With Session Replay for iOS and Android, you can watch real sessions unfold and understand exactly what users saw and did, without relying on vague support tickets or incomplete crash logs. Session replay captures exactly how users interact with your mobile app: taps, swipes, scrolls, and screen transitions.

Manage and optimize your OCI costs with Datadog Cloud Cost Management

Engineering teams need to deliver reliable, secure, and high-performing applications, all while keeping costs under control. But engineers often lack visibility into cloud cost data, relying on finance-driven reports that they receive only after the billing cycle closes. Without daily cost insights alongside observability data, they don’t know until it’s too late that an infrastructure change caused a significant cost increase.

Stop decision overload: How discovery filters optimize device onboarding for efficient network monitoring

Every network administrator encounters the same question during discovery scans: Should this device be monitored or ignored? Routers are critical, but what about test servers, lab switches, or that aging and unused printer still on the network? Manually deciding for each device creates decision overload and risks overlooking what really matters.

How to Scale Prometheus APM for Modern Applications

When developers monitor application performance, they pick one of two paths: traditional APM tools with distributed tracing and code profilers, or metrics-driven monitoring with Prometheus. The second approach — Prometheus APM — tracks the signals that matter most: request rates, error rates, latency, and resource utilization. No agents to install, no per-host pricing, just exporters and PromQL. For most teams, Prometheus APM is where monitoring starts.

Improving browser tracing step by step

Browser tracing has always been one of those things that feels invisible until it isn’t. When it works well, you get clear, actionable insights into how your app is performing in the wild. When it doesn’t, you’re left staring at noisy data, gaps in traces, and spans that don’t quite tell the story. Over the last few months, we’ve been chipping away at that problem.

Dashboarding OCI costs: A guide to building a usage API with OCI functions and SquaredUp

Oracle Cloud Infrastructure (OCI) provides powerful tools for managing your cloud resources, but getting a clear, real-time view of your usage and costs can sometimes feel locked away behind complex reports. What if you could build beautiful, shareable dashboards that show you exactly what you're spending, where you're spending it, and how it trends over time? In this guide, we'll walk you through deploying a simple, secure OCI Function that acts as a proxy to OCI's Usage API.

How image generation models are creating new infrastructure demands for DevOps teams

The rapid adoption of generative AI has moved far beyond research labs and creative studios. Image generation models, in particular, have become critical components in content production pipelines, marketing platforms, design workflows, and enterprise applications. What began as a novel way to create digital art has evolved into a class of workloads that behave very differently from traditional web services.

Maximize data value and cut costs: Adaptive Telemetry for metrics, logs, traces, and profiles in Grafana Cloud

When it comes to observability, more data doesn’t always mean more clarity. In fact, as telemetry volumes grow, it only becomes more difficult to discern the signals from the noise and to keep overall costs in check. This is exactly why we built Adaptive Telemetry, a suite of features in Grafana Cloud that analyzes how your telemetry is used and then automatically recommends actions like aggregating, sampling, dropping, or reducing low-value data.

Enhanced Icinga 2 Container Images

As some of you might have already noticed, we recently gave our official Icinga 2 container image builds a complete overhaul. These new images are currently available only as snapshot builds but will replace the existing stable images with the next Icinga 2 v2.16.0 release. In this blog post, we’ll walk you through the key changes and improvements that come with the new images, as well as the reasons behind these changes.

Nobody Cares About Your MTTR

I’ve been in those late-night "war room" calls where, after hours of painstaking work, the team finally resolves a critical outage. The dashboards all turn green, a collective sigh of relief is shared, and the next day’s report highlights a victory: Mean time to resolution (MTTR) was reduced by 15% compared to the last major incident. It feels like a win.

Tag(ging)-You're It: How to Leverage AppNeta Monitoring Data for Maximum Insights

Today’s enterprise networks are a far cry from the centralized, predictable infrastructures of the past. Instead, they are sprawling, dynamic ecosystems that stretch across cloud services, SaaS applications, on-premises data centers, distributed branches, and thousands of end users connecting from every imaginable location. This complexity creates a huge challenge for IT and network operations teams: How do you get a clear, real-time view of what’s really happening?

ObservabilityCON 2025 Keynote: Grafana Assistant GA and Full-Stack Observability in Grafana Cloud

Join Grafana Labs CEO Raj Dutt, CTO Tom Wilkie, and engineering leaders to kick off ObservabilityCON 2025 with the latest in AI-powered observability in Grafana Cloud. See how Grafana is making observability smarter, simpler, and more scalable. This ObservabilityCON 2025 keynote unveils: AI-powered observability → Grafana Assistant (GA) and Assistant Investigations (Public Preview). Observability at scale → The Adaptive Telemetry suite is now complete (Traces GA, Adaptive Profiles in Private Preview) plus BYOC for flexible, cost-efficient cloud deployment.

AI-powered observability: Resolve incidents faster, reduce alert fatigue, and expand access

When an incident lands in your lap, you’ll often start with a lot of questions: Why is latency so high? What’s causing this outage? How much money are we losing at this very moment? The uncertainty—and the pressure to quickly find answers—has always been one of the more nerve wracking parts of being an on-call engineer, but it doesn’t have to be that way any more.

Top 9 LLM Observability Tools in 2025

Organizations are adding GenAI to their current and future architectures and product roadmaps, requiring Ops teams to ensure LLMs are accurate, fast, secure and cost-efficient. LLM observability tools directly addresses these needs, helping identify and prevent common LLM errors and issues: LLM observability provides the telemetry data for this analysis. LLM observability tools trace requests end-to-end, evaluate outputs, and correlate quality with latency, cost, prompts, tools, and data sources.

Vibe Coding: Closing The Feedback Loop With Traceability

I have begun to truly embrace vibe coding over the last few months, using Cursor as my main code editor and Claude Sonnet 4 for my agent's LLM. It's an exciting time as a developer, we get to experiment with something that promises to 100x our productivity while pioneering the new workflows and strategies for implementing these tools. But, as most people who have done any extensive development with LLMs in a sufficiently sized code base knows, it's a bit like trying to herd scared cats.

ObservabilityCON 2025: A guide to all the announcements from Grafana Labs

Today at ObservabilityCON 2025 in London, we unveiled a number of exciting announcements and updates to Grafana Cloud that reimagine SaaS economics, simplify the complexity of running your observability stack at scale, and provide AI tooling that’s actually useful. (Root cause analysis via chatbot? Yes, please!) Check out the keynote to learn more about how we’re helping you do more with the open observability cloud, and read on for a quick recap of all the news from ObservabilityCON 2025.

The Best Tools for Synthetic & Infrastructure Monitoring-A Comparative Guide

Both user and server-side monitoring are important to make your apps better. Tools that offer monitoring of just one side leave gaps in your diagnosis, causing negative experiences and reliability issues. Here are the top 10 tools you should consider based on their benefits and coverage.

Agentic AI Explained: How Autonomous Systems Are Changing Cybersecurity

Discover how agentic AI enhances cybersecurity by augmenting security teams’ existing security tools and workflows. See how Retrieval-Augmented Generation (RAG) enables faster threat detection, streamlined investigations, and smarter incident response — empowering SOC teams to work more effectively. Join cybersecurity experts Lisa Jones-Huff and Mohammed Anas Khatri to discover how agentic AI can help your security team multiply its impact.

Complete guide to OpenTelemetry Tracing (with code examples)

Distributed tracing is an essential technique for monitoring modern, cloud-native applications. It provides a holistic view of a request's entire journey as it propagates through a multi-service architecture, making it invaluable for performance optimization and root cause analysis. But how do you generate and collect this trace data in a standardized, vendor-agnostic way? That's where OpenTelemetry comes in.

Optimizing Your Cart with Signals: Smarter State, Better Debugging

In the first two parts of this series, we introduced Angular Signals and built a reactive shopping cart. Our CartService already supports core operations like adding, removing, and clearing items, as well as computing total price and item count using computed(). All of this was done without touching RxJS, subscriptions, or change detection hacks. But a real-world cart does more than tally up numbers.

OpenTelemetry + ignio: The Foundation for Intelligent, Unified Observability

In the previous post, What is OpenTelemetry?, we went over the What, Why, and the How of OpenTelemetry. We also went over the telemetry data lifecycle (data generation à collection à storage à usage) and how telemetry data (MELT) could be put to use to troubleshoot a representative web application scenario.

Closing Visibility Gaps in the Modern Data Center

In today’s high-performance data centers, “all green” dashboards can mask catastrophic issues hiding just beneath the surface. If you’re missing the microbursts, hidden oversubscription, and routing imbalances that are devastating application performance, you’re flying blind. Learn how to close these visibility gaps and shift from reactive firefighting to proactive network intelligence.

Python performance monitoring for Django, Flask, Celery, and more

Here's some excellent news for the Pythonistas in the room: You can now monitor the performance of your Python applications with Honeybadger. Last year, we launched Honeybadger Insights, a new logging and observability tool bundled with Honeybadger. Insights enables you to query your application logs and events to answer performance questions, perform root-cause analyses, and create charts and dashboards to see what's happening in real time.

Telemetry Now Teaser: "Tracking the Red Sea Cable Cuts with Kentik's Cloud Latency Map"

Go behind the scenes of a major internet analysis. When the recent Red Sea cable cuts disrupted global connectivity, Kentik's Director of Internet Analysis, Doug Madory, turned to the Cloud Latency Map to track the fallout in real-time. In this clip from the latest Telemetry Now podcast, Doug walks through how he identified the latency spikes and rerouting caused by the damage.

3 real-world generative AI strategies for executives

Everyone is excited about AI, but few companies have successfully implemented it. While enthusiasm for generative AI (GenAI) has helped accelerate AI adoption across enterprises, the promises of artificial intelligence have yet to translate into measurable impact on most organizations’ bottom lines. The trouble isn’t the tech — it’s a lack of executive ownership.

Real Estate App Development for Ops & Product Teams: From MVP to Scale

In the competitive world of real estate technology, developing an app that can scale from a Minimum Viable Product (MVP) to a fully-fledged solution is crucial. For operations and product teams, this journey involves strategic planning and execution to ensure the app meets evolving market demands and user expectations.

Live in London: Adoption & AI Confessions @ Nexthink Experience

Tim and Tom are back with another special DEX Show Live!—recorded last week at Nexthink Experience London at the Intercontinental by the iconic O2. Day 1 of the world's biggest DEX event saw over a thousand IT pros gather for two days of innovation, insight, and energy. In this lively episode, the hosts are joined by Guillaume Charles, Senior Director of Product Management (Diagnostics) at Nexthink, and Gabriela Moraes from Electrolux to explore the state of digital and AI adoption in the enterprise.

Downtime on the Docket: The Death Sentence for Productivity in Legal Firms

When minutes matter, IT leaders need more than quick fixes; they need foresight. That’s where Teneo’s Managed DEX (Digital Experience Monitoring) comes in. Managed DEX is designed to detect what legal teams can’t afford to miss. It monitors for “ghost traffic”- those eerie, unexplained signals of abnormal network activity that often signal compromise or instability- and other anomalous device behaviors that can precede full-blown outages or cyber incidents.

Announcing Honeycomb for Frontend Observability React Native Beta

React Native apps straddle two worlds: JavaScript powering your UI and native modules running underneath. Add in backend services, and when something goes wrong, there are many possible culprits. Was it JS logic, the native bridge, the native API call, or a downstream API call? Most tools give you parts of the picture. A crash tool can tell you where the app failed but not what else happened in a session.

SRE Report Retrospectives - Have AIOps Predictions Held Up?

Welcome to a new blog series where we take a candid look at the predictions, insights, and bold claims we've made in previous SRE Reports and ask the uncomfortable question: How did we do? For the uninitiated, Catchpoint's SRE Report is our annual, practitioner-driven effort to capture the pulse of the global reliability community.

Redis Performance Monitoring: Combine Logs and Metrics for Complete Visibility

Redis earns its place in modern stacks because it’s an in-memory data store with microsecond latency and rich data structures, making it perfect for things like caching, sessions, and rate limiting. Since it often sits on the request path, small issues (connection churn, blocked commands, memory pressure) can quickly ripple into user-visible incidents.

Monitoring Encrypted Network Traffic

How do you spy on a secret message? That's the challenge for network monitoring tools like Suricata today. Encryption is essential for privacy, but it creates massive blind spots for security. Dive into the modern-day cat-and-mouse game of monitoring encrypted traffic. How do you deal with security blind spots in your network?

Ep 13: Everyone is winging it: Hope for an AI future

In this episode, we welcome Naomi Buckwalter, Sr. Director of Product Security at Contrast Security, to chat about the evolving landscape of security threats and the dual role of AI in both facilitating and combating these challenges. We explore the increasing sophistication of modern phishing attacks and discuss how security teams must rapidly adapt to stay ahead of emerging threats. We debate the transformative impact of AI on the future job market, where personal qualities and soft skills may increasingly take precedence over traditional technical competencies.

Happiest Minds boosts IT efficiency and service delivery with Site24x7

As a born-digital, born-agile IT services company, Happiest Minds delivers 24/7 strategic, transformation, and managed services across product digital engineering services, infrastructure management and security services, and generative AI business services. As its customer base and complexity grew, the company needed unified observability, multi-tenant monitoring, and real-time root cause analysis—without the burden of manual effort or siloed tools.

How we use Datadog to get comprehensive, fine-grained visibility into our email delivery system

Visibility into email performance is indispensable to any organization that counts on its ability to reach people through their inboxes, including Datadog. SREs, FinOps, and many other teams rely on email as a critical channel for communications from our platform, including monitor alerts, usage reports, and service account notifications. At Datadog, we depend on the visibility provided by our integrations for Mailgun, SendGrid, and Amazon SES to optimize our email performance and ensure deliverability.

What's New in VictoriaMetrics Cloud Q3 2025? From new region in Asia to proactive alerts

The third quarter of 2025 has been a busy one for VictoriaMetrics Cloud! We expanded globally, polished the user experience, introduced new enterprise debugging tools, and delivered smarter alerts to help users make the most of their observability data. If you missed our Quarterly Live Update, don’t worry! You can watch the full recording here: Let’s recap what’s new in VictoriaMetrics Cloud this quarter.

Get Third-Party Outage Alerts in Discord with StatusGator

When SaaS tools go down, teams need fast, reliable alerts right where they communicate. Now, with the StatusGator integration for Discord, you can receive real-time third-party outage alerts directly in your server. Whether you’re monitoring the status of AWS, Slack, GitHub, or Google Workspace, StatusGator keeps your team informed instantly when disruptions happen.

Zoom Troubleshooting Performance and Connection Issues: The Complete Guide

In an era of remote work and virtual meetings, Zoom has emerged as a lifeline, connecting people across distances and facilitating seamless collaboration. However, like any technological tool, it's not without its fair share of challenges. From occasional performance hiccups to frustrating connection issues, navigating the world of Zoom can sometimes be a daunting task.

Observability-as-Code: Bring synthetic monitoring into your pipeline

Your team just deployed to production. The infrastructure spun up in 90 seconds, but recreating your monitoring? That’ll take hours. It’s added late in the process, managed through dashboards, and prone to inconsistency. Short-term, this slows delivery and creates visibility gaps that surface only during incidents. Long-term, it leaves a business-critical capability out of your observability pipeline.

Datadog vs Splunk: A Side-by-Side Comparison [2025]

Datadog and Splunk are both leading tools for monitoring and observability. Each offers a range of features designed to help you understand and manage your data. Datadog provides tools for tracking application performance and analyzing logs in real-time. Splunk, meanwhile, is known for its powerful log analysis and search capabilities. In this post, we will compare Datadog and Splunk on important aspects like APM, log management, search capabilities, and more.

SQL performance improvements: analysing & fixing the slow queries (part 2)

This is part 2 of a 3-part series on SQL performance improvements. A few weeks ago, we massively improved the performance of the dashboard & website by optimizing some of our SQL queries. In this post, we'll dive deeper into the optimisations of queries with indexes.

Scaling Datadog observability: 1,000 integrations and counting

Integrations have always been central to the Datadog platform, enabling customers to collect the data they need directly from the technologies they use every day. By unifying signals from infrastructure and applications to security and SaaS applications, teams gain both high-level visibility and the ability to drill into the details that matter the most. With more than 1,000 integrations now available, the Datadog ecosystem continues to expand alongside the platforms our customers rely on.

Pastries with SREs: Leveling up observability and donut dunkability

In this episode of Pastries with SREs, we explore what it really means to shift left with observability, moving from reactive firefighting to proactive performance. And yes, it starts with donuts. We unpack how SREs and IT Ops teams are often stuck reacting to incidents, battling alert fatigue and swivel-chair triaging. But what if you could pull in developers earlier, and give everyone a unified view of observability data?

How to Perform Ping Tests: Different Tools and Techniques

If you’re a remote worker struggling with video calls, or a gamer noticing lag, a quick Internet ping test using an online ping tester can give you a simple yes/no answer: Is my connection alive, and how fast does it respond?. But if you’re a network admin or IT professional, that’s just scratching the surface. Business networks are more complex beasts.

How to automate sending SquaredUp dashboards to Slack with the Notification API

SquaredUp's existing notifications fire when monitors change state. With Notification API, you control the trigger. Send dashboards on a schedule, before meetings, or on-demand through chat commands. In this step-by-step guide, you’ll learn how to automate sending SquaredUp dashboards to Slack. I’ll use Power Automate as the example, but the same approach works with other automation tools such as Zapier, Make, n8n, or even a custom script, as long as it can send an HTTP request.

LLM Observability Explained: Prevent Hallucinations, Manage Drift, Control Costs

Large Language Models (LLMs) are transforming how businesses interact with users, automate workflows, and deliver insights in real time. But as powerful as these models are, running them at scale comes with unique challenges, from hallucinations and latency spikes to cost overruns and user trust issues.

The observability maturity curve: How IT leaders are shifting from tools to outcomes

Observability has come a long way from its origins in monitoring logs and metrics. Today, it sits on a maturity curve: Organizations move from fragmented tool stacks to unified platforms to proactive engineering practices that tie reliability to business outcomes. To better understand where IT leaders are on this curve, Grafana Labs surveyed 150 decision-makers across industries in advance of ObservabilityCON 2025.

Why DEX Scores Must Be Part of Every Total Cost of Ownership Study

Price is not the same as cost. When organizations evaluate new end-user technology investments, whether that’s laptops, operating systems, or management tools the conversation inevitably turns to Total Cost of Ownership (TCO). TCO studies traditionally focus on direct, measurable costs: hardware procurement, software licensing, support contracts, and lifecycle services. But there’s a growing blind spot in these calculations: the employee experience.

Keep stakeholders informed with Datadog Status Pages

When incidents occur, clear communication can be just as important as fast remediation. Your internal teams need timely updates to stay aligned, and your users want to know what is happening and when they can expect a fix. Without a reliable way to proactively share updates, support teams can get flooded with tickets and customer trust can erode. Datadog Status Pages, now generally available, makes it easy to keep everyone informed through a public or internal web page during outages.

How DreamHost Slashed Memory Usage by 80% and Scaled to 76 Million Time Series

For any growing business, there comes a point where the tools that once worked perfectly begin to show their limits. This is especially true for monitoring infrastructure. As your user base, services, and data volumes expand, the pressure on your monitoring stack intensifies. For web hosting leader DreamHost, with over 1.5 million websites to manage, their existing open-source solutions simply couldn’t keep up.

How Technology Improves Commercial HVAC Efficiency

Efficient heating, ventilation, and air conditioning (HVAC) systems are important for maintaining comfortable, healthy, and cost-effective commercial spaces. As energy costs rise and environmental concerns grow, businesses are increasingly looking for innovative ways to optimize their HVAC operations. Technological advancements are transforming how systems are monitored, controlled, and maintained, resulting in improved performance and lower operating costs.

How to Monitor Zoom Performance & Fix "Zoom Your Internet Connection is Unstable"

Zoom calls have become a staple of modern life, connecting us with friends, family, and colleagues from all over the world. But have you ever experienced the frustration of a laggy, glitchy Zoom call that leaves you feeling like you're in a bad sci-fi movie? Laggy video, packet loss, and jitter make it difficult to have a clear and coherent conversation over Zoom - which is why it’s important to identify these Zoom issues before your next call.
Sponsored Post

3 secure ways to handle user data in Raygun

You know the feeling: You're right in the middle of cracking a really convoluted coding problem, when an urgent support ticket pops up. It's not just any ticket; it's from a VIP customer with a high-severity issue demanding resolution within an hour. You have to drop what you're doing and scramble, completely context-switching and losing all your momentum.
Sponsored Post

Top 10 Reasons Why You Need a Status Page Aggregator

Managing dependencies on multiple third-party services has become a critical challenge for modern engineering teams. A status page aggregator solves this by centralizing monitoring across all your vendors' status pages into a single dashboard, giving you real-time visibility into potential issues before they impact your users. Whether you're managing a complex microservices architecture or simply relying on various SaaS tools, understanding when and why your dependencies fail is crucial for maintaining service reliability.

Top tips: Mastering browser extensions without overwhelming yourself

Top tips is a weekly column where we highlight what’s trending in the tech world today and list ways to explore these trends. This week, we’re looking at how browser extensions can boost productivity when used wisely—and how to avoid being overwhelmed by them. Extensions are like candy for your browser. One promises to save time, another blocks ads, a third manages your tabs, and before you know it, your browser looks like a Swiss army knife.

How to Use Synthetic Monitoring in CI/CD Pipelines

CI/CD pipelines are the heartbeat of modern software delivery. They automate builds, run unit tests, package applications, and deploy them to production with a speed that traditional release cycles could never match. For engineering teams under pressure to move fast, pipelines are the mechanism that makes agility possible.

Your big VIP Teams call just went south. Do you have the tools to troubleshoot - fast?

Imagine you’re the IT lead responsible for your organization’s Microsoft Teams experience. A big call with the board comes up, loaded with company VIPs — and it’s chock full of issues. Lag, choppy audio, bad connections. After the call, there’s a knock at your door. Not a happy knock. You answer and standing there is your CEO, stamping her foot demanding to know what went wrong.

How to Identify Network Bottlenecks: From Snail Mail to Warp Speed

Welcome, network admins and IT pros, to a world where network bottlenecks become nothing more than a distant memory. In an era where the need for speed is paramount, identifying and eliminating network bottlenecks is the key to achieving warp-speed connectivity. Your network is like a bustling metropolis, with data zipping through its veins like cars on a busy highway. But suddenly, the flow slows down to a snail's pace, causing frustration and hindering productivity.

New Dashboards and Reports for Kubernetes Monitoring

This is just a quick blog to draw attention to some new and enhanced monitoring dashboards and reports we have added to eG Enterprise in our latest release (v7.5) to provide quick and powerful overviews when monitoring a range of Kubernetes technologies. As with all our dashboards, color-coded overlays provide guided drilldown for help desk operators and administrators.

Elastic named a Leader in The Forrester Wave: Cognitive Search Platforms, Q4 2025

Today, we’re excited to share that Elastic has been named a Leader in The Forrester Wave: Cognitive Search Platforms, Q4 2025. We believe this recognizes our continued innovation in AI-powered search and the momentum of the Elasticsearch Platform.

Observability vs. Visibility: What's the Difference?

In modern IT systems—distributed services, cloud-native platforms, and dynamic networks—just knowing that something is “up” isn’t enough. Green checkmarks on dashboards don’t tell you why performance shifted, why latency crept in, or why a perfectly healthy-looking service suddenly failed. This is where the conversation around visibility and observability begins. They sound similar, but they solve very different problems.

Scheduling discovery jobs for dynamic enterprise networks

Networks have evolved far beyond simple data conduits.They're now the backbone of decentralized digital enterprises, serving as critical channels for information exchange. Modern networks connect dispersed locations and devices, driving performance, security, and cost efficiency. However, decentralization also scatters assets, creates blind spots and increases operational complexity.

Understanding NetFlow: The Key to Network Insights

Is your network data CRASHING your database? NetFlow offers incredible insights, but there's a hidden catch: cardinality explosion. Collecting every IP address can overload time-series databases (even VictoriaMetrics!), killing performance. Watch to learn how to tame the data beast! What's the worst 'cardinality explosion' you've ever witnessed?

September product updates

September was a busy month at StatusGator! We rolled out several major updates designed to give you more visibility, better integrations, and deeper control of your monitoring workflows. From new Early Warning Signal integrations to AWS Health support — plus our biggest API release yet — here’s a quick recap of everything we shipped last month.

Why Citrix VAD/DaaS Customers Using VMware Should Consider Migrating to XenServer

For years, VMware vSphere was the undisputed leader in enterprise virtualization. Its reliability, feature set, and ecosystem made it the go-to hypervisor for organizations. Also for organizations running Citrix Virtual Apps and Desktops (VAD) or Citrix DaaS, VMware was synonymous with virtualization excellence. But the landscape has changed, dramatically. If you're still running your Citrix workloads on VMware, it's time to take a serious look at XenServer, and here's why.

Announcing Scout's MCP Server for AI-Native Monitoring!

We’re excited to introduce the Scout Monitoring MCP Server — a new way to bring AI-native monitoring directly into your coding assistant. Instead of flipping between dashboards and logs, the MCP (Model Context Protocol) server surfaces performance data, errors, and slow endpoints right where you work. Ask plain-language questions like “show me the latest five errors” and get answers grounded in live telemetry. You can even let your coding assistant propose and push fixes!

AI for Network Leaders by Selector - Strategic Imperatives in an AI World by William Collins

Strategic Imperatives for Infrastructure Leaders in an AI-Enabled World William Collins, Director of Technical Evangelism at Itential, explores the strategic imperatives facing infrastructure leaders in today’s AI-enabled world. He unpacks the Gartner Hype Cycle, the true monetary costs of network downtime, and shows how Itential + Selector can close the loop on AIOps with autonomous agents and MCPs.

AI for Network Leaders by Selector - Building Your First RAG App by John Capobianco

Building Your First GenAI RAG Application John Capobianco, Head of Developer Relations at Selector, walks through a 6-step process for building your first GenAI RAG application. From foundational building blocks to the path toward full AI agents, RAG remains a powerful tool with huge ROI. Even in a world of autonomous agents and MCPs, RAG is still one of the best ways for network engineers and IT leaders to query dozens of sources and unlock real value.

AI for Network Leaders by Selector - AI Agents and MCP by John Capobianco

AI Agents & Model Context Protocol John Capobianco, Head of Developer Relations at Selector, dives deep into AI Agents and the Model Context Protocol (MCP). In this session, John demonstrates Selector MCP in action — running as a client-server, connecting multimodal inputs, and even talking to Selector using microphone + TTS audio via Gemini CLI. He also showcases Sebastian Maniak’s Claude Desktop integration, where Selector MCP powers a chatGPT-like UI for network engineers. A practical look at how MCP is transforming AI into a true digital co-worker.

Cloud Microservices Monitoring on AWS and Azure with OpenTelemetry

Your checkout flow starts in an AWS Lambda function, calls a payment service running on EKS, then triggers notifications through Azure Functions. Three different compute platforms, two cloud providers, one distributed trace that you can't see. Cloud providers want you to use their native monitoring tools. AWS pushes X-Ray and CloudWatch. Azure promotes Application Insights and Azure Monitor. These tools work well within their ecosystems but lock you into vendor-specific implementations.

Observability - Not Just Dashboards and Alerts | Why Teams Like Uber & Salesforce Use Grafana Cloud

Grafana Cloud is a fully managed observability platform built on open source and open standards. From Fitbits to power grids, it helps teams monitor systems, cut through noise, and act faster. With 150+ integrations, Grafana Cloud unifies logs, metrics, and traces, giving visibility from backend to frontend. AI-powered guidance accelerates root cause analysis and simplifies on-call, while customers like Citigroup, Salesforce, Uber, and ASOS scale with confidence.

Honeycomb Observability Day SF - Kesha Mykhailov, Fin.ai: Human-Centric Observability in AI Systems

Empathy is one of the superpowers of modern teams, especially when building tools that interact with humans. This talk by Kesha Mykhailov tells the story of Fin, Intercom's Customer Support agent, and how they transformed their approach to Fin's.

Inside the InfluxDB 3 Plugin Ecosystem

Companies today face growing pressure to manage and analyze massive flows of time series data, from IoT sensors to cloud-native infrastructure. Storing this information is relatively straightforward. The greater obstacle is keeping it useful and consistent while balancing a wide range of tools and modern technology platforms that continue to evolve.

A closer look at Grafana k6 browser: alignment with Playwright, modern features for frontend testing, and what's next

Over the years, we’ve seen our community embrace Grafana k6 browser as a key component of their frontend testing strategies. By helping collect frontend web vitals, capture custom metrics, and simulate user actions like clicking buttons or completing forms, the module offers teams a deeper understanding of performance and availability from their end users’ point of view.

Sending beers all across Belgium, a throwback to how we named Oh Dear

We're obviously a little biased, but we believe we have one of the best website monitoring tools on the market today, leading in features compared to our competitors. We've already tried a variety of marketing techniques to promote our service, but none really had the impact we were looking for. Maybe we're better at actually building good software than we are at marketing it? Or are we trying what everyone else is also doing, thus making it all harder?

What the 2025 DORA Report Teaches Us About Observability and Platform Quality

The 2025 DORA State of AI-Assisted Software Development Report delivers a critical insight for technology leaders: AI is fundamentally an amplifier, not a solution. It magnifies the strengths of high-performing organizations with robust observability while exposing the dysfunctions of struggling ones. For organizations that have rushed to adopt AI coding assistants all while expecting immediate productivity gains, this finding demands a strategic pivot.

Debugging Microservices in Production with Distributed Tracing

Your production checkout flow just started returning 500 errors. Six microservices handle checkout. Logs show errors in three of them. Which service broke? Which error happened first? What caused the cascade? Traditional debugging doesn't work. You can't attach a debugger to production. Searching logs across six services gives thousands of lines with no obvious connection. By the time you correlate timestamps and trace IDs manually, customers have abandoned their carts.

When BGP becomes UX: The inside story of a SaaS routing decision gone wrong (or right)

Most operations teams trust their green dashboards. If the internal monitoring says everything is healthy, the app must be fine, right? But as the Internet keeps proving, what’s green inside the firewall can look red for customers outside of it. Sometimes, a single change in how web traffic moves can suddenly slow logins, disrupt websites, or hurt business results, even if everything looks fine inside.

Agentic AIOps in Action: LogicMonitor, IBM, and Red Hat Deliver Self-Healing IT

Your most skilled engineers shouldn’t be spending nights and weekends piecing together root causes of outages. Yet many organizations still rely on manual incident response across sprawling hybrid and multi-cloud environments. The result: slower resolution times, frustrated customers and lost revenue that can reach up to $1 million per hour according to IDC. At LogicMonitor, we believe the answer isn’t just better monitoring. It is systems that can heal themselves.

September 2025 - Early Warning Signals

In September 2025, StatusGator Early Warning Signals identified dozens of outages across cloud, fintech, and education platforms. Many of these incidents were detected before providers acknowledged them — and in some cases, without any acknowledgment at all. We’ve highlighted several of the most significant outages as featured incidents, followed by a list of additional disruptions reported throughout the month.

Monitor Slurm with Datadog

Slurm (Simple Linux Utility for Resource Management) is an open source workload management system used to schedule jobs and manage resources for high-performance computing (HPC) Linux clusters. It ensures that jobs and resources are scheduled fairly and efficiently and is scalable across large clusters, an issue that native Linux process management tools struggle with.

How to know your data with Cribl's Ed Bailey and VisiCore Technology's Paul Stout.

Classifying and tagging data is the key to automating pipelines and improving visibility across the enterprise. We’ll share both the technical and business impact of truly knowing your data, and why Cribl makes it possible. Plus, we’ll talk CriblCon and why we’re excited to see you there.

Reality Bytes: Our Everyday AI Use (Personal & Professional)

The Reality Bytes team is back together again! Tim, Tom, Megan, Dina and Sean swap stories of how AI has reshaped their personal and professional lives and habits over the past year—from eerie chatbot encounters and creative breakthroughs to frustrations with hallucinations and the hunt for the true “human fingerprint.”

OTel Naming Best Practices for Spans, Attributes, and Metrics

An incident’s in progress. Services are slow, customers are frustrated, and your dashboards… look fine. At least, until you search for payment metrics and get 47 different names for the same signal. Suddenly, the real issue isn’t latency — it’s inconsistency. The OpenTelemetry project recently published a three-part series on naming conventions to solve exactly this problem.

How to check CPU usage on Linux

When your Linux system feels sluggish, one of the first things to investigate is the CPU usage. The CPU (Central Processing Unit) is the brain of your machine, and if it’s overloaded, everything else slows down. In this guide, you’ll learn different ways to Linux check CPU usage with command-line tools, how to interpret the metrics, and why automatic monitoring with Icinga ensures long-term system stability.

Easiest Way to Ship Docker & Nginx Logs to Loki with Promtail

Effective monitoring catches problems before users do, and with Promtail, Loki, and LogQL, it’s a lightweight, approachable option for any DevOps team. This guide shows how to monitor Docker itself (pull failures, restarts, health flaps) so you’ve got a baseline on container runtime health.

Why 1% Packet Loss Is the New 100% Outage

For years, you had an unspoken agreement. Your networks were built to be resilient, and your applications were, for the most part, forgiving. You sent emails, transferred files, and backed up data. If a few packets went missing along the way, the protocols would quietly clean up the mess. A little bit of packet loss was just background noise, an expected imperfection in a system that was, by and large, incredibly robust. You could tolerate it.

The Importance of Community Knowledge in Tech

Tools alone aren’t enough. How you use them and the expertise you tap into make all the difference. In this Short, we explore why even the best tools need the proper guidance to unlock their full potential. Open-source communities are goldmines of knowledge and support Connecting with experts can save you serious time and headaches While enterprise support is valuable, the community often has your back. Get practical tips to get the most out of your tools, and remember: it’s not just what you use; it’s how you connect, learn, and grow along the way.