Operations | Monitoring | ITSM | DevOps | Cloud

Why Browser Monitoring Is Essential for Early Outage Detection in Multi-Cloud Environments

Businesses are quickly adopting multi-cloud because they can use AWS, Google Cloud, Azure, and other cloud providers at the same time to make their systems more reliable, scalable, and effective. This distributed strategy gives more freedom and makes it less dependent on vendors, but it also makes things more complex and maximizes the chance of outages, which are hard to identify. In multi-cloud settings, failures do not necessarily indicate that everything stops functioning.

Comprehensive Browser Monitoring for Modern Web Apps: Mastering API & SPA Performance

Modern websites and applications are no longer simple HTML pages. As applications evolve into sophisticated single-page applications (SPAs) built with frameworks like React and Vue and rely heavily on API-driven architectures, the need for advanced browser monitoring has never been more critical.

Best 30 Remote Network Monitoring Software for IT Professionals of 2026

Remote network monitoring is the process of monitoring and managing network resources and activity from a remote location using software and tools that allow network administrators to monitor network performance, troubleshoot problems, and ensure that the network is secure and operating efficiently.

Using Tech to Keep Your Industrial Operations Safe

In modern industry, safety is pretty much a founding principle of good business, and without it, your company would not exist for very long, whether due to reputational damage or legal action. Keeping everyone safe is vital, and the good news for you is that tech has made it easier than ever to achieve.

Stop Manual Checks: Automating DORA Compliance for Cloud Dependencies

Financial services organizations face increasing pressure to comply with the Digital Operational Resilience Act (DORA) while managing complex cloud infrastructures. Manual compliance checks for third-party dependencies are no longer sustainable. This guide explores how to automate DORA third-party risk management, ensuring continuous compliance without the overhead of manual processes.

Sentry x Acquired

On October 23, 2025, Sentry hosted a special live event featuring the hosts of the Acquired podcast, Ben Gilbert and David Rosenthal. Together, we spoke with some of today’s most inspiring leaders across tech and sports. Whether you’re a longtime Acquired listener or new to Sentry, tune into this rare chance to hear directly from some of the people shaping the next generation of everything from technology and storytelling to leadership and more.

Grafana Campfire - Using Git Sync (Grafana Community Call -Nov 2025)

For our upcoming Grafana Campfire Community Call, we'll explore Grafana Git Sync, a powerful new feature that enables seamless dashboard-as-code workflows. Git Sync bridges the gap between Grafana's UI and your version control system, allowing you to manage dashboards with the same rigor and collaboration practices you use for application code.

Protecting PII in Synthetic Monitoring: How to Monitor Safely

Synthetic monitoring feels like the safest layer in the observability stack. It uses artificial users. It runs scripted journeys. It never touches real customer accounts. Yet this is exactly why many teams overlook the privacy exposure hidden inside it. Synthetic tests often produce screenshots, network captures, HTML snapshots, console logs, authentication artifacts or even short screencasts.

A Financial Business Case for Migrating Citrix Workloads from VMware to XenServer

In our previous blog post, we outlined why Citrix VAD/DaaS customers using VMware should consider migrating to XenServer. With Citrix now officially supporting XenServer for non-Citrix workloads and Broadcom’s recent licensing changes, many organizations are reevaluating their virtualization strategy.

Building Your DORA Register: Map Critical ICT Third-Party Providers

The Digital Operational Resilience Act (DORA) requires financial institutions to maintain a comprehensive register of their critical ICT third-party providers. This register isn't just a compliance checkbox, it's a strategic tool for managing operational risk and ensuring business continuity. Building an effective DORA register requires systematic identification, classification, and continuous monitoring of your SaaS dependencies.

Third-Party Vendor Status Monitoring: Automate Visibility and Accountability

Modern IT teams rely on dozens, or even hundreds, of third-party SaaS tools. From Workday and Cloudflare to Slack, Zoom, and GitHub, your business depends on these services to function. When a vendor outage occurs, the consequences are immediate: internal tools fail, support tickets spike, and leadership demands answers. Recent events, like the widespread Cloudflare outage in November 2025, highlight how quickly things can spiral out of control.

9 Tools and Integrations for InfluxDB

InfluxDB is the go-to database for developers working with high-velocity time series data for use cases like application performance monitoring and real-time analytics. But InfluxDB exhibits its true power when combined with the right tools and integrations. The tools covered in this blog post can help at all stages of your workflow, from data collection to visualization and analysis, so you can get the most out of your InfluxDB deployment.

Why am I getting R14/R15 errors in NodeJS? | MericFire

How to Detect, Alert, and Resolve Memory Issues Before They Cause Downtime When applications scale on Heroku, memory-related issues are among the most common (and most frustrating... -_- ) sources of instability. Two of the most notorious culprits are the R14 (Memory Quota Exceeded) and R15 (Memory Quota Hard Limit) errors.

VirtualMetric DataStream + Elasticsearch: A Smarter Way to Send Logs to Elastic

Elasticsearch has long been the backbone of security analytics for organizations that need fast search, flexible dashboards, and scalable visibility across massive datasets. It powers everything from threat hunting to compliance reporting and real-time investigation. But anyone who has operated Elasticsearch at scale also knows a quiet truth: Elasticsearch is only as strong as the data you feed it. And getting clean, consistent, usable telemetry into Elastic is often the hardest part.

How to monitor Amazon Bedrock AgentCore AI agent infrastructure in Grafana Cloud

Modern AI agents are now highly advanced, frequently becoming essential components of engineering workflows and deployment pipelines. However, operating these systems often feels like trying to navigate a ship through a dense fog. When an agent errors, slows down, or consumes excessive resources, engineers find themselves adrift, lacking the navigational charts needed to diagnose the problem. The absence of deep insight makes debugging, performance tuning, and cost management unnecessarily difficult.

The Role of Browser Monitoring in E-Commerce Conversion Optimization

In today’s hyper-competitive e-commerce environment, customers expect websites to be fast, responsive, and error-free. Even a few seconds of delay can lead to abandoned carts, lost revenue, and poor user experience. With more shoppers browsing on mobile devices and expecting real-time interaction, performance optimization has become a key driver of conversion growth.

ManageEngine Recognized in 2025 Gartner Magic Quadrant for Digital Experience Monitoring

ManageEngine is recognized in the 2025 Gartner Magic Quadrant for Digital Experience Monitoring. We have been committed to delivering reliable digital experiences through powerful monitoring and AI-driven insights. Thanks to our customers and partners for their continued trust and support.

Why Indoor Air Quality Declines Quickly When Mold Goes Untreated

Mold growth inside a home is more than an unpleasant sight, it can dramatically reduce indoor air quality in a surprisingly short amount of time. When mold spores multiply, they spread through the air and settle into various areas of the house, creating an environment that becomes increasingly difficult to live in comfortably. As mold thrives in warm, damp, and poorly ventilated spaces, even a small patch can spread rapidly if ignored. Understanding how and why air quality declines helps homeowners act early, protect their health, and prevent serious structural problems.

Instrument Jenkins With OpenTelemetry

You can instrument Jenkins with OpenTelemetry using the official plugin and an OpenTelemetry Collector, then send the data to a backend like Last9 to understand where pipeline latency and failures actually originate. Jenkins provides job status and console logs, but it doesn't show how time is distributed across stages, agents, plugins, and external systems. OpenTelemetry fills that gap by emitting traces, metrics, and logs in a standard format that any OTLP-compatible backend can process.

Amazon AppStream 2.0 Multi-session Service Monitoring

In late 2023, Amazon introduced the ability to deliver AppStream 2.0 using Microsoft Windows Server OS rather than the desktop of the OS. This feature enables IT admins to host multiple end-user sessions on a single AppStream 2.0 instance, helping to make better use of instance resources.

Building high performance dashboards with SquaredUp and ClickHouse

ClickHouse is redefining the boundaries of analytical database performance. Trusted by hyperscalers like Netflix, OpenAI, and Disney, it delivers sub-second query responses on billions of rows and scales seamlessly to petabyte workloads. The great news is that it is open source, so this power is available to everyone. You can spin up a local instance running in a Docker container in a matter of seconds.

Observatory Video Podcast - Episode 1 | KubeCon Atlanta Recap

Did you miss KubeCon + CloudNativeCon Atlanta, or are you still processing it? Join Marc Sherwood (Marketing), Mathias (SE Magician), Stephan Burns, and Diana (Evangelist) as they take you behind the curtain for an unfiltered look at the biggest cloud native event of the year! We’re diving deep into the massive shifts, key announcements, and major trends that are set to redefine the Kubernetes ecosystem in the next 12 months.

New features: Introducing Metrics Usage and Query Usage analyzers

As teams grow and telemetry scales, it becomes harder to keep track of which metrics matter. Labels pile up, cardinality increases, and costs start rising faster than anyone expected. At the same time, dashboards often stay quiet and alerts go untouched. The truth is, most teams don’t actually know how and how much of their metric data is being used, let alone which metrics are driving cost. This is exactly the problem we set out to solve.

Transaction Monitoring Software for for businesses and financial institutions

In today’s digital financial landscape, transaction monitoring software has become essential for businesses and financial institutions aiming to prevent fraud, money laundering, and other illicit activities. A transaction monitoring tool automatically tracks and analyzes financial transactions in real time, using predefined rules, algorithms, and AI-driven analytics to detect suspicious behavior.

Deeper Coverage with Less Complexity - New in DataStream

This month’s DataStream update brings meaningful improvements across pipeline management, MSSP workflows, and endpoint visibility. We’ve focused on giving security teams more control over how data moves through their environment, expanding coverage for both Windows and Linux, and strengthening governance for multi-tenant deployments. Let’s walk through what’s new.

How to monitor AI agent applications on Amazon Bedrock AgentCore with Grafana Cloud

Today’s AI agents have grown increasingly sophisticated, moving into production environments and becoming integral parts of engineering workflows. But these agents can also be black boxes for engineers, which makes observability more critical than ever. Without proper monitoring, you’re often left feeling like you’re flying blind as you try to debug agent failures, understand performance bottlenecks, and track costs.

The Complete Guide to Solving HTTP Server and HTTPS Transaction Monitoring Challenges

In today’s digital-first world, website uptime, speed, and security directly influence user experience and business success. Whether you’re managing an enterprise web application, a SaaS platform, or a multi-region eCommerce site, HTTP and HTTPS monitoring is non-negotiable. However, DevOps teams often face complex challenges—ranging from inconsistent server response times to SSL/TLS misconfigurations and broken transactions.

How to Receive Cloud Outage Alerts in Microsoft Teams

Cloud outages like the recent ones at Cloudflare, Microsoft Azure, and AWS can have a significant impact on your business with downtime, lost revenue, and unhappy customers. They can also disrupt your team's ability to work effectively. To stay on top of such outages, your team needs to know about them in an easy and timely way. In this article, we will see how to integrate IncidentHub cloud outage alerts with Microsoft Teams.

Stop Leaking PII in Your #Telemetry with Cribl Guard

Sensitive data sneaks into destinations more often than teams realize. In this clip, we capture live events, spot emails and login tokens slipping through, and fix it instantly with Cribl Guard. A few clicks, a commit and deploy, and Guard redacts the data in real time. No complex configs. No regex nightmares. Just fast protection that keeps your telemetry clean and your security tight.

How to Monitor Unmanaged Networks & Remote Workers

Your remote developer can't access the VPN. Is it his home router? His ISP? Your network? You have no idea and no way to find out. This is the reality of modern IT. Your network doesn't end at your office perimeter. It extends into hundreds of homes, coffee shops, branch offices, and third-party locations you'll never set foot in. And when performance tanks, you're troubleshooting blind.

How Log Management and NDR Work Together to Speed Up Incident Response

Log management and Network Detection and Response (NDR) solutions are closely related but offer different layers of visibility. Rather than overlapping, they complement each other, together providing a connected view of what’s happening in your environment. How exactly? Let’s take a closer look.

Introducing Dataspaces & Datasets

Observability data has a habit of outgrowing everything else. As telemetry volume, variety, and velocity increases, staying organized gets harder. Governance becomes messy, and the cost of digging through “everything” keeps rising. Over the past year, Coralogix’s DataPrime engine has been addressing these challenges by laying a new foundation for observability at scale.

Detecting Anomalous Spans at Scale with DataPrime

Tracing is one of the most transformative gifts of observability. It allows engineers to follow a single request through a distributed system and see every span and dependency along the way. However, even with that visibility, some of our most basic questions stay unanswered. Why did a specific span behave differently today than it did yesterday? Why did latency rise even when nothing “broke”?

How to Stream AWS CloudWatch Metrics into Grafana Cloud (10× Cheaper + Near Real-Time)

Unlock faster, cheaper, and more reliable AWS observability with CloudWatch Metric Streams in Grafana Cloud. In this video, Tristan from Grafana Labs gives a full walkthrough of our new AWS Metric Streaming integration, showing how to stream CloudWatch metrics directly into Grafana Cloud using Amazon Data Firehose and Terraform.

The Silent Sabotage of Configuration Drift

Your network is not a static entity. It is a living system that has been running for years, absorbing countless changes. While your infrastructure may appear healthy on the surface, a slow and silent saboteur is often at work, methodically undermining your infrastructure from within. This is not the work of a malicious actor. It’s the inevitable result of a process you may not even be tracking: configuration drift.

Turn feedback into action across your engineering org with Datadog Forms

Engineering teams rely on forms for everything from approvals to checklists, yet the process usually lives outside engineering operations. Spreadsheets, one-off surveys, and external form builders capture inputs, but they create scattered data, slow follow-ups, and manual translation into actionable work. Datadog Forms enables teams to create and share interactive forms directly within Datadog.

How Email Blacklist Monitoring Works

Email servers can be added to blacklists without any visible warning. When this happens, emails stop reaching inboxes and businesses lose communication reliability. Email blacklist monitoring solves this problem by checking your IP addresses and domains against global blacklist databases. This article explains the monitoring process in a clear, simple, and structured way, so you understand how it protects your deliverability and reputation.

In the age of AI, measurement becomes our superpower

The last few years have felt less like a product roadmap and more like a scene from science fiction. Artificial intelligence didn’t simply arrive, it erupted. In what feels like a blink, we’re building software by prompting instead of programming. Our words now generate code, compose music, translate languages, and create entire digital experiences.

Pastries with SREs: Holding onto extra observability data and desserts

In this episode of Pastries with SREs, we dig into why you should keep all of your observability data, even if you don’t need it quite yet. We explore: With enriched logs and flexible, cost-effective storage, you can stop worrying about what you might need later and start answering questions with confidence, no matter when they arise. Additional resources.

How to Reduce Your Cloud Costs with Coroot

Cloud costs often grow quietly until they suddenly command everyone’s attention. Gartner estimates that companies overspend on cloud services by up to 70 percent, mostly because they lack clear visibility into where the money is actually being spent. Cloud invoices speak the language of infrastructure: nodes, instance types, regions, volumes, and egress. Engineering teams speak the language of services, deployments, and code.

Golang Monitoring Guide - Traces, Logs, APM and Go Runtime Metrics

Golang (Go) applications are known for their high performance, concurrency model, and efficient resource use, making Go an easy choice for building modern distributed systems. But just because your Go application is built for speed doesn't mean it's running perfectly in production. When things go wrong, just checking if your service is "UP" isn't enough.

How to Run a Page Speed Test with Uptime.com

In this video, we explore the features and reports of our Page Speed Check, including both the free tool and the subscriber-exclusive version. Learn how optimizing your page load speed can benefit user experience, search engine rankings, and overall site performance. We utilize Google Lighthouse for in-depth analysis and provide actionable insights through detailed reports. Discover how to run a Page Speed Test, navigate through different report sections, and share your findings with stakeholders.
Sponsored Post

Future-Proofing Your Business: Why RISE with SAP is More Than a Technical Upgrade

2025 and 2027 are well documented dates in SAP circles. Many SAP users have viewed the transition to S/4HANA and Cloud ERP as an opportunity for transformation, eliminating technical debt and enabling innovation. Just as many have recognized the complexity and size of the project, and opted for a technical upgrade, often called brownfield migration or a lift-and-shift, as the first step of a larger Cloud ERP journey. Driving the migration project urgency, commercial incentives and transition options from SAP may ease end of support dates for existing systems.
Sponsored Post

The Critical Role of CPU Monitoring for Modern Network Admins

For network administrators, maintaining seamless and uninterrupted system performance is an ongoing, vital responsibility. In environments ranging from hundreds of endpoints to complex hybrid clouds, CPU monitoring stands out as a critical tool. Without it, proactively identifying and resolving performance slowdowns, service lags, or outages is impossible, leaving you to reactively guess at solutions.
Sponsored Post

Early IT Outage Alerts in Action: 20+ Major Cloud Incidents of 2025

The IT cloud outages in 2025 are already shaping up to be a wake-up call for IT teams, MSPs, and developers worldwide. Even the most reliable services can experience disruptions, impacting workflows, customer experience, and business continuity. While major providers often take time to acknowledge incidents publicly, StatusGator's Early Warning Signals empower organizations to detect outages in real time, sometimes hours before official confirmation.

Beginner's Guide to OpenTelemetry & Django (2025)

Django is a popular open-source "batteries-included" Python web framework that enables rapid development while taking out much of the hassle from routine web development. By providing pre-built components like ORM integrations, authentication/authorization systems and more, it enables developers to focus on business logic and iterate fast. As such, developers and organizations worldwide use Django to build web apps of varying complexities.

From data management to an intelligent data fabric architecture

Large enterprises today manage more machine data than ever before. From legacy applications to modern, ERP and supply chain systems to cloud infrastructure, cybersecurity, and customer-facing applications, much of this valuable data remains trapped in silos, limiting its potential to drive faster decisions, strengthen resilience, and meet the demand for optimum service availability.

Microsoft Sentinel Cost Optimization with Staged Routes and Commit Processors

As security data volumes grow, so do the costs of processing and storing them. Microsoft Sentinel and other SIEM platforms charge based on data ingestion, which makes every decision about normalization rules critical and every duplicate log a direct expense. Enterprise-scale security data pipelines face a persistent problem: data duplication across normalization tiers. As logs move through multiple transformation stages, it’s often impossible to know in advance which version will succeed.

Rollbar Debugging with ChatGPT using Service Links

In this guide, we will walk you through how to add a service link that connects directly to the Rollbar Debugging Assistant—our ChatGPT-powered tool that helps you quickly analyze and debug an occurrence’s raw.json without needing access to your code repository. The Rollbar Debugging Assistant makes it easier and faster to understand what went wrong in a specific error occurrence.

Skylar One Juneau: Real-World Intelligence for Service-Centric Ops

Service-centric operations demand more than observability, they demand understanding. The Juneau release of ScienceLogic Skylar One brings that understanding into sharp focus with greater clarity, intelligence, and ease-of-use for the IT and service operations teams who keep modern digital businesses running. Engineering enhancements in this release of Skylar One (formerly SL1) make it even more accurate, more intuitive, and more aligned with the way operations teams actually work.

Smart Home Monitoring with InfluxDB 3, Google Nest, and Grafana

Your smart home devices generate vast amounts of scattered data—this tutorial shows you how to centralize it into a unified platform using InfluxDB 3 and Grafana. You’ll not only track your home’s vital signs but also learn professional software development concepts, such as time series database design and building resilient data pipelines, applicable to various monitoring and analytics systems. Before we begin, ensure you have.

SaaS Monitoring Best Practices

SaaS Monitoring is the process of continuously tracking the performance, availability, and reliability of Software-as-a-Service (SaaS) applications to ensure they operate efficiently and deliver the best possible user experience. In an era where organizations depend heavily on cloud-based tools for communication, project management, and analytics, maintaining optimal SaaS performance is more critical than ever.

Anatomy of an OTT Traffic Surge: Thursday Night Football on Amazon Prime Video

In this edition of Anatomy of an OTT Traffic Surge, we look at Thursday Night Football on Amazon Prime Video. Based on traffic stats, TNF is the most watched program on the streaming service. Using Kentik’s OTT capabilities, we’ll see how this program gets delivered and how that has changed over 11 weeks of the NFL season.

Ipl-html: Introducing new Form Element Decorators

Decorators have always been a powerful concept in Icinga Web’s form system — letting developers control how form elements are displayed without hardcoding markup everywhere. But until recently, the decorator system had its limits. The new implementation of form element decorators completely reimagines this approach, offering cleaner logic and better flexibility. In this article, we’ll explore what’s new, why it’s better, and how to use it effectively in your own forms.

Introducing SigNoz's LLM-Powered Datadog Migration Tool

But migration is painful. Moving from Datadog means manually rebuilding dashboards, rewriting every query, and reconfiguring panels one by one. What took months to build takes weeks to migrate. Engineering teams get pulled away from actual product work to rebuild monitoring infrastructure they already had working. Critical monitoring setups and the context around why dashboards were built a certain way often get lost. We kept hearing about this from teams evaluating SigNoz, so we built a solution.

What is OpenTelemetry? [Everything You Need to Know]

Observability used to be a fragmented mess. You had one agent for logs, a different library for metrics, and a proprietary SDK for distributed tracing. If you wanted to switch vendors, you had to rewrite your instrumentation code from scratch. OpenTelemetry (OTel) fixed this. It has become the second most active project in the CNCF (Cloud Native Computing Foundation), right behind Kubernetes.

Reality Bytes: The Rise (and Risks) of Vibe Coding

In this Reality Bytes reunion, Tom, Sean, Tim, Oriana and Megan unpack the buzzy rise of vibe coding — the AI-assisted development trend coined by Andrej Karpathy and already explored by companies like Meta and Microsoft. The panel digs beneath the hype: from accelerated prototyping and accessibility gains to serious risks around technical debt, shadow applications, governance, security and the loss of human accountability. Oriana and Megan highlight the importance of schema, context and genuine creativity, while Tim warns against mistaking speed for quality. Is vibe coding the future - or just another fragile shortcut?

What Is Voice over Internet Protocol (VoIP) Monitoring?

Voice over Internet Protocol (VoIP) has transformed business communication by enabling voice calls over the Internet rather than traditional phone lines. While VoIP provides flexibility and cost savings, call quality can be affected by network issues such as latency, jitter, and packet loss. VoIP monitoring is the process of tracking and analyzing the performance of your VoIP system to ensure smooth, clear, and reliable voice communication.

Define, run, and scale custom LLM-as-a-judge evaluations in Datadog

Teams deploying LLM applications face a critical blind spot: They can measure speed and cost, but not whether their AI is actually giving good answers. To build user trust in these applications, teams also need to measure response quality, including factual accuracy, safety, and tone. Operational metrics show how a system behaves, but not whether its responses are correct or on brand.

Simplify multi-cloud monitoring with Site24x7 | One tool for any cloud

If you’re juggling multiple tools and dashboards, running workloads across AWS, Azure, Google Cloud, and Oracle Cloud can be chaotic. That’s where ManageEngine Site24x7 steps in. With one unified platform, you can monitor all your cloud environments in real time, and gain full-stack visibility across every resource. Whether it’s a VM, container, or serverless function, you can detect performance issues early.

How MSPs can simplify multi-cloud cost management for their customers

Multi-cloud cost management has quickly evolved from a nice-to-have capability to a core expectation with the accelerated adoption of the cloud. Organizations that depend on managed service providers (MSPs) for cloud operations now look to them for financial clarity as well. Now, MSPs are not only expected to monitor and manage cloud usage, but also to act as trusted cost advisors. They must deliver transparency, predictability, and strong governance across all the cloud environments their customers use.

FinOps Strategy for Hybrid IT: Interview with Tim Conley

FinOps continues to grow in importance as organizations balance cloud services with on-prem systems, legacy applications, and evolving business demands. Many teams want to manage their costs more effectively but are unsure how to apply a FinOps strategy for hybrid IT outside the cloud.

Side-by-Side Variable Comparison for Snapshot Debugging

When you’re debugging a tricky issue in a distributed system, “what changed?” is often the most important question. You add logs, you capture data, you redeploy, and suddenly your browser is full of open tabs, copied JSON blobs, and screenshots of log lines. Comparing behavior between two requests, two users, or two releases turns into a manual, error-prone chore. Lightrun Snapshots were built to fix the data collection side of that story.

[Webinar] How AI monitoring can cut downtime twice as fast as traditional tools

Unlock the future of IT monitoring with Site24x7’s advanced AI capabilities! In this exclusive webinar, Varalakshmi, Site24x7 product specialist, guides you through Site24x7’s unified observability platform. Discover how Site24x7 uses AI to solve modern IT challenges like eliminating tool sprawl, reducing alert fatigue, and enabling faster, more accurate RCA. We also explore relatable customer scenarios for dynamic anomaly detection, predictive forecasting, automated event correlation, cloud cost forecasting, and AI-powered ITOps with Zia.

Monitor Temporal Workflows seamlessly: Introducing the Temporal Cloud integration for Grafana Cloud

Nishad Krishnan is a Software Engineer at Temporal Technologies, where he’s focused on observability and making the “unknown unknowns” slightly less unknown. At Temporal Technologies, our goal is to make it easier for developers to build and operate reliable, scalable applications without sacrificing productivity. Our platform, Temporal, helps ensure that code runs to completion once started, no matter how long it takes or what failures occur along the way.

Pastries with SREs: Enriched logs and filled donuts

In this episode of Pastries and SREs, we take a sweet dive into one of the most exciting evolutions in observability: enriched logs, also known as wide events. Gone are the days of toggling between tools and stitching together logs, metrics, and traces. Enriched logs consolidate the context, providing everything you need to understand and resolve issues in a single log entry. We explore.

How to Monitor Java Applications on Windows with SolarWinds Observability | APM Setup Guide

This video provides a step-by-step walkthrough for configuring monitoring for Java applications running on Windows using SolarWinds Observability. The demonstration covers the complete process—from adding a new service to instrumenting the application with the Java APM library and verifying connectivity. Topics covered in this video include: This guide is designed for developers, DevOps engineers, and system administrators who need to instrument Java applications on Windows for performance monitoring, distributed tracing, and full-stack observability.

Mezmo + Catchpoint deliver observability SREs can rely on

For SREs juggling multiple services, third-party dependencies, and constant alerts, a critical service slowdown can quickly turn into chaos. APM Dashboards may show everything is fine, yet users are still experiencing problems. That gap—between application telemetry and real-world performance—can turn a five-minute fix into a two-hour war room. ‍

Build custom apps in seconds with conversational AI in App Builder

Datadog App Builder is a low-code tool for creating internal apps, making use of a drag-and-drop interface that allows engineering teams to troubleshoot issues, optimize operations, and enable self-service while connecting directly to their Datadog data and permissions. Now, with conversational AI, teams can go from idea to working prototype even faster.

Introducing Bits AI SRE, your AI on-call teammate

Bits AI SRE is your AI on-call teammate, built to autonomously investigate alerts and coordinate incident response. Integrated with Datadog, Slack, GitHub, Confluence, and more, Bits analyzes telemetry, reads documentation, and reviews recent deployments to determine the root cause of alerts—often before you’ve even opened your laptop. In fact, if you're using Datadog On-Call, you can view Bits’s findings right from your phone—so you’re always one step ahead, no matter where you are.

Eliminating N+1 Queries with Seer's Automated Root Cause Analysis

When I was working at Shopify, Black Friday and Cyber Monday were our Superbowl. We initiated code-freeze weeks before to make sure merchants wouldn't have any unexpected issues during one of the most important times of the year. Sometimes, though, you need to ship updates last minute. Picture this: It's Black Friday Eve, 11:47 PM. You've just deployed a new /sale page with 50+ products at discounted prices. Marketing is about to email 500,000 subscribers. Everything tested fine with your sample data.

<100ms E-commerce: Instant loads with Speculation Rules API

In e-commerce, we all know that speed = money. I know it, you know it, Amazon knows it, eBay knows it, Shopify knows it, everyone knows it. In this article we’ll see how we can improve the perceived performance of our site’s critical pages, like the Product Details page, the Cart page, the Checkout page. We’re going to use the Speculation Rules API (SRA) to prerender/prefetch them, and also explain how certain frameworks like Next.js offer their own prefetching mechanisms.

How continuous profiling cut our cloud spend

At Coralogix, we’re constantly looking to evolve the measurements we take to better understand the efficiency of our infrastructure. We constantly assess and investigate sources of cost in our cloud infrastructure, to ensure we’re getting the best return on investment. This activity, often referred to as FinOps, is becoming a cornerstone of engineering teams.

7 Observability Solutions for Full-Fidelity Telemetry

You don’t have to choose between capturing every signal and keeping costs predictable. Modern observability stacks blend full-fidelity storage (time series or columnar systems like ClickHouse and Apache Druid), tail-based sampling for heavy traffic, and tiered storage (hot/warm/cold with S3-backed archives). This gives you full-fidelity incident forensics with the day-to-day cost profile of a sampled setup.

What's Special About MCP?

AI agents can interact with the world using tools. Those tools can be generic or specific. For example: Generic: Specific: The most general ones, like “run a bash command” and “read and write files” are built into the agent. More specific ones are provided through Model Control Protocol (MCP) servers. Every tool provided to the agent comes with instructions sent as part of the context.

How To Enable Real-Time Endpoint Visibility for L1 Support:

In today’s digital workplace, speed and precision in IT support can make or break the employee experience. Long resolution times, repetitive troubleshooting, and lack of visibility often frustrate both users and support teams. That’s where Nexthink comes in—bringing powerful capabilities like Amplify, Device View,and Assist to transform how Service Desk teams operate.

Installing TrackJS on Certkit

I recorded a video showing how to properly set up TrackJS for a new production website, specifically CertKit, our new certificate lifecycle management tool. The key to effective error monitoring isn’t just installing the tracking snippet, it’s configuring the system to surface real issues while filtering out the noise. I configure a forwarding domain (errors.certkit.io) to bypass ad blockers that might prevent error reporting.

Grateful for Good Connections: Finding Calm in a Demanding Financial World

As the year winds down, my inbox is overflowing with Black Friday offers and festive greetings. It’s that time when Thanksgiving and the run-up to December holidays remind us to pause and appreciate what truly matters. Yet, in my recent conversations with IT leaders in financial services, I’ve noticed something: the time and calm need to do this feels elusive.

How Live Monitoring Supports Smarter Decision-Making and Safety

Here's the reality: most businesses don't fail because of small ideas; they fail because warning signs were missed until it was too late. The gap between reacting to problems and preventing them can make or break operations. That's where live monitoring comes in. By combining real-time surveillance with smart analytics, businesses can spot risks before they escalate, protect people and assets, and make informed decisions on the fly. It's not just about watching, it's about anticipating.

Top 7 Observability Platforms That Auto-Discover Services

You can use an observability platform that automatically discovers your services and provides ready-to-use dashboards with minimal setup. If you're running a system where microservices come and go, containers shift around, or serverless functions scale up quickly, this kind of experience saves you a lot of time. You gain visibility as soon as something goes live, without requiring any additional steps on your part. In this blog, we talk about the top seven platforms that offer these capabilities.

What to Expect When You Migrate to Atatus APM

As organizations aim for exceptional software reliability and user satisfaction, migrating to Atatus APM is a key upgrade in application monitoring. With nearly 80% of companies facing costly downtime exceeding $300,000 per hour, robust APM solutions like Atatus are crucial. It helps teams quickly identify bottlenecks, optimize performance, and improve the customer experience through comprehensive, real-time insights.

ScienceLogic Named a 2025 NVTC Tech100 Honoree

ScienceLogic is proud to be recognized as part of the 2025 Northern Virginia Technology Council Tech100. The annual list highlights the companies, executives, entrepreneurs, and emerging leaders who are shaping the region’s technology landscape and strengthening its economic growth. Earning a place on this list again underscores our momentum and commitment to helping organizations modernize IT with trusted data, intelligence, and automation.

A comprehensive Guide for Synthetic Transaction Monitoring

Synthetic Transaction Monitoring is a technique that uses automated scripts to simulate user activities on an application to test performance and functionality. By using automated scripts, it creates fake transactions such as logging in, searching for a product, or completing a purchase without requiring real users. These transactions are executed regularly from various locations to ensure the application is performing smoothly and as expected, even during off-peak hours.

What Is Infrastructure Monitoring? - Dotcom-Monitor

In today’s always-on digital world, the health of your IT infrastructure directly impacts business performance and customer satisfaction. Even a few minutes of downtime can result in lost revenue, broken user trust, and costly disruptions. As organizations increasingly adopt hybrid and cloud-native architectures, keeping track of every server, database, container, and network component has become more complex and more critical than ever.

Icinga Notifications v0.2.0 Release

Some of you might have already heard about this at OSMC, or you may have received a release notification from GitHub already: our Icinga Notifications project made a step forward and we are happy to announce that version 0.2.0 is now available for you to try out. It addresses feedback that we have received for the previous versions with the most important changes highlighted below.

The Hidden Cost of Untagged Cloud Resources for SMBs

Cloud computing is a powerful enabler of growth and agility for small and medium businesses (SMBs). However, untagged cloud resources are one of the primary challenges most SMBs face in cloud environments. These untagged resources lead to a lack of visibility and accountability over cloud spending, which leads to wasted budgets and cost overruns.

Data Observability: Build confidence in the data life cycle

Datadog Data Observability provides a complete solution with quality checks (e.g., volume, row changes, freshness), custom SQL-based monitors, anomaly detection, column-level lineage across systems like Snowflake and Tableau, full pipeline visibility, and targeted alerts when data issues arise.

Breaking siloes: How to use cross-store correlations with Grafana

Grafana is great at hopping between signals in its native backends (Grafana Loki, Grafana Mimir, Grafana Tempo). But your data doesn’t have to live there to get the same smooth workflow. Afterall, we don’t just pay lip service to our “big tent” philosophy—we want to meet all our users’ diverse needs, regardless of what kind of data you have or where you store it.

Coordinate large-scale engineering initiatives with IDP Campaigns

As organizations grow, engineering leaders often need to drive cross-team initiatives such as reducing cloud spend, upgrading runtimes, or strengthening security controls. Tracking this work can quickly become fragmented across spreadsheets, dashboards, and status meetings. Progress is hard to measure, accountability is unclear, and the impact of each effort can be difficult to demonstrate.

Synthetic Monitoring for ServiceNow: Tables, Rules & Endpoints

ServiceNow is one of those platforms that looks simple from the outside but turns into a labyrinth the moment an organization starts customizing it. What begins as a service catalog or an HR portal quickly evolves into a tangle of custom tables, UI policies, business rules, Flow Designer actions, and scripted REST endpoints. None of this is bad. In fact, it’s the whole reason companies choose ServiceNow: you can build anything.

Stop the guesswork: Troubleshoot with confidence with process monitoring

If your organization runs on tech, everyday issues can be expected. This includes application downtime, erratic connectivity, and failures in remote access, database reachability, site-to-site VPNs, and web-based services. But how do you know if an issue is caused by: Sysadmins usually learn the root cause of an issue after a ticket comes in from the team or customer.

Search Telemetry Without Limits in a Multi Cloud and AI World

Cribl Search gives you one lens across all your telemetry data no matter where it lives. Instead of forcing teams to move data into one system or jump between tools, you get a familiar pipe based query experience with dashboarding and alerting built in. Storage and query processing stay separate so you decide where your data lives while your users get fast, simple access in one place.

How to Choose the Best Synthetic Monitoring Solutions & Software

To have a fast and reliable experience digitally you would need to do more than resolving issues. This is why people prefer synthetic monitoring which simulates real user actions with regular intervals. Using this method, businesses can detect performance shortcomings and any technical issues. From testing website load to full flow checkout, everything can be tested before users face any issues.

What Are AI Workloads? Everything Ops Teams Need to Know

AI workloads break every assumption you have about infrastructure management. AI is everywhere. Machine learning-based tools are answering customer service questions, accelerating incident resolution, catching fraudulent transactions, spotting defects on production lines, and powering late-night searches that delve into the random topic that pops into your head right before bedtime. Behind every prediction, response, or generated sentence is massive computing power doing serious, continuous work.

AI for Good: Securing Networks in the Age of Autonomous Attacks

The rise of autonomous AI attacks operating at machine speed demands that network security evolve beyond human capacity and manual processes. Kentik AI Advisor counters this threat by using AI for good, reasoning across full network context to proactively eliminate vulnerabilities and guide immediate, confident defense.

AI Workload Infrastructure Requirements: What You Actually Need

Artificial intelligence (AI) infrastructure requires four pillars working in tandem as a system (compute, storage, networking, and orchestration) tailored to your actual workload needs, not hype. Artificial intelligence (AI) infrastructure isn’t just more hardware. It’s a new class of system—highly distributed, resource-intensive, and tightly coupled across compute, storage, and network layers.

AI Monitoring, Explained: Challenges, Core Components, and Why Observability Is the Next Step

Monitoring AI systems isn’t business as usual. Monitoring AI isn’t like monitoring traditional systems. You can’t just track uptime or response times and call it a day. AI models evolve, data shifts, and behavior drifts over time, which means your monitoring has to evolve, too. If you’re running AI workloads in production, you already know this. Your models might look healthy according to your infrastructure metrics, but they’re still making bad predictions.

AI Observability: How to Keep LLMs, RAG, and Agents Reliable in Production

AI observability closes the gap between “something’s wrong” and “here’s what to fix.” If you run AI in production, you might have felt the whiplash. Yesterday, your LLM answered in 300 milliseconds (ms). Today p99 crawls, costs spike, and nobody’s sure if the culprit is model behavior, data freshness, or GPUs stuck at the ceiling. Dashboards light up, but they don’t tell you which issue puts customers at risk. That’s the gap AI observability closes.

Use OpenTelemetry with Observability Pipelines for vendor-neutral log collection and cost control

Today, many DevOps and security teams operate in a world of complex, hybrid, or multi-vendor environments. As more teams look to avoid lock-in by adopting open standards, OpenTelemetry (OTel) is quickly gaining adoption as the primary open source method for DevOps and security teams to instrument and aggregate their telemetry data. However, OTel alone may lack the advanced processing functions, native volume control rules, and hybrid environment support that large organizations need.

How to Reduce Log Data Costs Without Losing Important Signals

You can cut your log costs by removing repetitive, low-value logs early and keeping only the parts that genuinely help you understand issues. Modern systems generate logs far faster than you expect. Even when your workload stays stable, infrastructure components, retries, and background workers continue producing a steady stream of repeated entries.

What's New in InfluxDB 3.7: One-Click Monitoring, Faster Configuration, and Better Operational Clarity

InfluxDB 3.7 is now available for both Core and Enterprise, landing alongside version 1.5 of the InfluxDB 3 Explorer UI. This release focuses on giving developers faster visibility into what their system is doing with one-click monitoring, a streamlined installation pathway, and broader updates that simplify day-to-day operations. InfluxDB 3 Core is free and open source, optimized for recent data, and licensed under MIT and Apache 2.

Is Your Network Modernization Frozen by Fear?

Have you ever stood before a critical piece of network infrastructure, knowing it desperately needs an upgrade, yet felt a wave of paralysis wash over you? You’re not alone. It’s a common feeling when facing a project as significant as a data center migration or a move to a modern leaf-spine architecture.

Inside the Cloudflare Outage: Real-World Data from UptimeRobot

On November 18th, 2025, a large Cloudflare outage briefly broke big chunks of the internet. For several hours, users around the world were greeted with 500 errors, including platforms like X, ChatGPT, Spotify, and many others that run behind Cloudflare’s network. At UptimeRobot, we sit in a slightly unusual spot during events like this: So when Cloudflare has a bad day, we see it twice: once in the alerts we send to our customers, and again in how it affects parts of our own infrastructure.

Introducing Logs, User Feedback, and more in the Sentry Godot SDK

With the first stable releases out of the gate, we’re happy to announce that Sentry’s Godot SDK is now ready for general use, supporting Windows, Linux, macOS, iOS and Android. We started full-time development a year ago with just a few prototypes, and now it's finally here - built on top of the mature Sentry platform SDKs, it comes as a GDExtension add-on that you can easily add to your Godot projects.

Cloud Status Check Overview

In this video, we provide an overview of Uptime.com's Cloud Status check feature, designed to monitor the status of common cloud services within your technology stack. We walk you through the step-by-step process to configure a Cloud Status check, including how to select third-party services, add contacts, and organize checks with tags. Learn how to view incident history and get detailed updates from third-party providers. For more information, visit our documentation or contact our support team.

Episode 1 - Preparing the workforce for AI | The Intelligent Enterprise

In our first podcast episode of The Intelligent Enterprise, Ricardo Costa, Senior Vice President and Chief Technology Officer at Purolator, gives us his views on how to prepare the workforce for AI. In his role as a technology "translator" connecting business strategies with tech implementations, Ricardo highlighted the importance of translating complex tech concepts into simple, understandable stories and addressing leadership challenges in preparing the workforce for AI, including upskilling and ethical considerations.

AI Isn't Here to Replace Your Dashboard... Yet

Non-deterministic UIs are the future and will replace your dashboards, but they’re not here yet. So until then, we’re stuck with conversational interfaces. In an effort to try and describe what I consider the future of UIs to look like, I wrote about how you (and I) have been designing dashboards wrong. The core insight was that we've been designing for static representations of data that sit on a TV in the office, when the actual use case is someone at a desk using them to debug an issue.

Best New Relic Alternatives & Competitors in 2026

If you are someone who has explored monitoring and observability solutions for your program, New Relic One is hard to miss. It is a comprehensive monitoring and management application initially started by Lew Cirne in 2008. Then on, it expanded its product base to include over twenty products ranging from front-end to backend, infrastructure, logs, and even vulnerability addressing. Today, it is one of the most successful analytics platforms for enterprises dealing with data.

Architecture for the agentic era: How AI will reshape data, security, and observability

As AI agents move from copilots to autonomous systems, they’re generating and consuming data at unprecedented scale. The result is a new kind of infrastructure pressure — one that’s quietly reshaping how organizations think about data, cost, and control. Across IT, Security, and Observability, leaders are realizing a hard truth: too much data is too costly.

Lightweight Open-Source APM with OTel Demo (Grafana OpenTelemetry Community Call)

We’re back with the second Grafana OpenTelemetry Community Call! Join us as we continue exploring how to get observability into your apps and infrastructure with Grafana, powered by OpenTelemetry. In this session, we’ll walk through the basics of application monitoring using the OpenTelemetry Demo — a realistic example of a distributed system built on a fully open-source stack: Prometheus, Jaeger, and OpenSearch, with dashboards powered by Grafana.

The Internet broke again. StatusGator can help

November 18, 2025 — Cloudflare is back online after a sweeping global outage disrupted millions of people across the world. For several hours, websites, apps, APIs, and entire business operations were knocked offline. IT teams everywhere were once again scrambling to make sense of the chaos. Outages are no longer rare. Every week it feels like another provider takes down a huge portion of the internet.

Top 7 reasons behind poor user experiences and how to fix them

User experience (UX) has become a pivotal factor in influencing the success of a product. You've probably experienced it yourself by clicking away from a slow website or abandoning an app that just doesn't work right. For product owners, the difference between success and failure often comes down to how smoothly users can interact with your product. But here's the problem: Creating that seamless experience is tougher than it looks.
Sponsored Post

Extending Microsoft SCOM's Reach

As IT landscapes evolve toward Azure, SaaS, and multi-cloud platforms, traditional monitoring approaches often leave gaps that hinder performance and reliability. Modern Management Packs provide a practical solution, enabling Microsoft SCOM to seamlessly monitor new technologies, specialized applications, and non-Microsoft systems, without the need for separate tools.

Introducing Honeycomb Private Cloud

More and more enterprises are shifting toward private cloud and hybrid deployments for control, data residency, and security. At the same time, observability is no longer a “nice to have” tool. It's mission-critical for teams driving rapid change across cloud-native, multi-service architectures. Leaders are realizing they need deep visibility and rapid debugging everywhere their systems run.

Enhancements to Honeycomb Telemetry Pipeline Deliver Greater Visibility, Smarter Control, and Lower Costs

In July, we introduced powerful new Honeycomb Telemetry Pipeline features that helped teams take control of their observability data with safe sampling, flexible rehydration, and a visual pipeline builder. Since then, we’ve built on that foundation. Today, we’re introducing the latest enhancements to Honeycomb Telemetry Pipeline, which give teams deeper visibility into pipeline health, more efficient access to archived telemetry data, and reduced operational complexity.

The "Meh-trics" Reloaded: Why I Was 100% Wrong About Metrics (and Also 100% Right)

Okay, I'm going to say something that would make 2016 Charity want to throw her laptop across the room: we're making a major investment in metrics at Honeycomb. I know, I know. "But Charity, you literally called them ‘shit salad!’" I did. Also "nerfed dimensions." I said they would "fucking kneecap you." For most of the past decade, I've been social media’s most reliable anti-metrics evangelist. Have I repented? No.

KubeCon North America 2025: OpenTelemetry Recap from Atlanta

KubeCon + CloudNativeCon North America 2025 wrapped up in Atlanta last week, and it sure did feel like a big one for OpenTelemetry. Between Observability Day, the project updates, and the activity around the OpenTelemetry Observatory booth, you could feel how quickly the ecosystem is maturing.

Why Gaining Control of Your #telemetry Data Is a Game Changer

Disconnected pipelines. Unknown data sources. Costs that do not add up. Many teams struggle to answer a simple question. What data do we have and where is it going? In this clip, a Cribl customer explains how bringing all telemetry data together changed everything. With Cribl, their team can finally see what they collect, where it flows, and what it costs. That clarity unlocked smarter reduction, better routing decisions, and major optimization across security and observability workflows.

Canvas Is Now GA: AI-Guided Observability for Modern Teams

When we introduced Canvas in beta, our goal was to reimagine how teams explore and collaborate around their observability data without requiring manual querying. Canvas has quickly become the AI-guided workspace that helps teams transform raw telemetry into meaningful, shared understanding faster than ever before. And today, we’re thrilled to announce that Canvas is now Generally Available (GA) for all Honeycomb users.

How to Onboard AWS & Azure Hosts in SolarWinds Observability

Connecting your cloud infrastructure has never been easier. In this quick walkthrough, you’ll see how SolarWinds Observability natively integrates with AWS and Azure to onboard virtual machines and supported managed services—fast. Select your hyperscaler Click “Add Data” → Choose “Hosts” Follow simple steps to connect your cloud environment via API Whether you're running AWS EC2, Azure VMs, or other managed services, SolarWinds helps you get visibility in minutes.

Pastries with SREs: FinOps is to ROI as a coffee is to cannoli

In this episode of Pastries and SREs, our hosts tackle one of the hardest questions observability leaders face: "How do you prove the ROI of observability?" This isn’t just about uptime or dashboards. It’s also about aligning observability with business outcomes, cloud cost savings, and FinOps metrics that matter to leadership.

AI as Monitive's CEO

Recently I've been to Lisbon's Web Summit conference, a 3 day, 70,000 participants, 15 stages, 800+ speakers event. Even though there was a track called "AI Summit", all the talks were about AI and AI Agents and how the future of the web, business, economy is more and more AI, and how businesses and people should take steps to adapt as soon as possible to an online world managed and operated by Artificial Intelligence.

OnlineOrNot's lessons from Cloudflare's outage on 2025-11-18

On 2025-11-18 at 11:48 UTC, Cloudflare declared an incident affecting the global network (that also affected OnlineOrNot). OnlineOrNot monitors websites, APIs, web apps, and cron jobs, while providing status pages as well. While we partially mitigated the issue by enabling a fallback to AWS-based monitoring, between 13:00 UTC and 14:33 UTC failing checks went unreported, heartbeat checks over-reported, and status pages were unavailable.

How Datadog Feature Flags is resilient to cloud provider failures

As major incidents like AWS’s October 2025 outage illustrate, modern systems are immensely interconnected. A failure in one can lead to a cascade of downstream problems. In this case, issues with DNS resolution for DynamoDB led to widespread disruptions with other AWS services and, subsequently, thousands of applications and services that rely on that infrastructure.

AI-Suggested Alert Thresholds for Mobile Telemetry

Life is pretty good. I’ve shipped a mobile app and I’m (happily) drowning in telemetry. Battery impact, time in foreground/background per screen, crash rates, slow frames, network retries – the works. The data is brilliant; the challenge is turning signals into reliable alerts that catch real issues which are relevant to my app’s functions. So… what should I actually listen for, and where should I set the thresholds?

Navigating External Outages: How Selector Cuts Through the Cloudflare Noise

Yesterday’s widespread Cloudflare outage reminds us how crucial external dependencies are to the stability of our own applications. When a key edge provider like Cloudflare goes down, the impact on your internal monitoring systems can look like a catastrophic, internal system failure triggering a massive storm of alerts and sending engineering teams into frantic, misdirected debugging sessions.

What is AWS Fargate for Amazon ECS?

As cloud applications moved from VMs to containers and then to microservices, the amount of background work needed to keep everything running grew just as quickly. You gain speed and flexibility, but you also end up managing clusters, scaling rules, and capacity choices that don’t really add to the product you’re building. AWS Fargate steps in right there. It lets you run your ECS tasks without looking after any servers at all.

OTel Updates: Complex Attributes Now Supported Across All Signals

OpenTelemetry now supports maps, heterogeneous arrays, and byte arrays across all signals. Here’s where these new types shine — and where simple primitives still fit naturally. If you’ve been working with OpenTelemetry for a while, you’re likely familiar with the straightforward key-value approach to attributes. It’s simple, fast, and works well with how most telemetry backends store, index, and query data.

The metrics product we built worked - But we killed it and started over anyway

Two years ago, Sentry built a metrics product that worked great on paper. But when we dogfooded it, we realized it was not what our customers really needed. Two weeks before launch, we killed the whole thing. Here’s what we learned, why classical time-series metrics break down for debugging modern applications, and how we rebuilt the system from scratch.

Node.js Performance Monitoring Guide

Node.js applications power millions of APIs, microservices, and real-time systems. But without proper monitoring, performance issues, memory leaks, and errors can go undetected until they impact users. This guide explains how to monitor Node.js applications in production, what metrics to track, and which tools deliver the best results.

Explore Cloud Instance Pricing and Performance with Datadog Instance Explorer

Meet Datadog Instance Explorer — a way to explore, compare, and monitor cloud instance pricing and performance across AWS, Azure, and Google Cloud in one place. In this quick overview, you’ll learn how to: Start exploring your instance options today and make smarter, data-driven infrastructure decisions.

Grafana 12.3 release: Interactive learning experiences, new and improved logs visualizations, and more

Grafana 12.3 is here, delivering new features for interactive learning, deeper insights into logging data, and so much more. Overall, a big theme in the latest minor release is to make data exploration easier, faster, and more customizable. Grafana 12.3: Download now! Below are just some of the highlights from Grafana 12.3. If you want to explore all the latest updates, please refer to the changelog or our What’s New documentation, and be sure to check out the TL;DR video below.

Grafana Data Visualization Update: Panel Time Settings & Time Comparison in 12.3

The new panel time settings drawer gives you greater control over time ranges and shifts at the panel level without editing the dashboard. The time comparison feature, in particular, was a request from the community, and allows you to easily perform time-based (for example, month-over-month) comparative analyses in a single view. This eliminates the need to duplicate panels or dashboards to perform trend tracking and performance benchmarking.

Monitor SolarWinds in Grafana: Demo + Setup

Matt from Grafana’s Enterprise Data Sources team demos the SolarWinds plugin: URL/credentials setup, TLS options, health check, and built-in dashboards. See how to query SolarWinds data with SWQL in Grafana and where to learn more (docs + SolarWinds SWQL resources). As of Nov 19, 2025, this is available in public preview in Grafana Cloud (including the free tier) and Grafana Enterprise.

StatusGator earns SOC 2 Type 2 certification

We are absolutely thrilled to share some momentous news: StatusGator has officially achieved SOC 2 Type 2 certification! This isn’t just another checkbox on a compliance list – it’s a powerful validation of our dedication to safeguarding your data and delivering the reliable service you depend on.

Outage map now available in your StatusGator board

We’re excited to introduce a helpful new update to your StatusGator experience – the service outage map is now built directly into your StatusGator account. StatusGator has displayed outage heatmaps on our public website’s service landing pages. These maps helped users understand where issues were being reported across the globe. Now, we’ve taken that same valuable visibility and placed it inside your board.

Stay audit-ready with real-time file change alerts in Site24x7 server monitoring

Maintaining the integrity of server files and directories is essential for security, operational resilience, and compliance. Whether it’s business-critical application configurations, sensitive data files, or audit logs, any unauthorized, unexpected, or accidental modification can jeopardize service continuity and expose an organization to regulatory risks. Manual file monitoring is impractical at scale.

How OpManager powered IT reliability for DWHIN

In healthcare, every moment counts—and for Detroit Wayne Integrated Health Network (DWIHN), every heartbeat depends on a network that doesnt skip one. Serving over 75,000 patients across Detroit and Wayne County, DWIHN’s IT network powers essential behavioral health services, from autism care to crisis intervention. When its systems started showing signs of strain, DWIHN turned to ManageEngine OpManager to bring reliability, clarity, and calm back to its IT operations.

How to Speed Up Incident Response With Guided Remediation

Most teams picture incident response as a linear sprint from alert to resolution. A notification appears, an analyst pivots across screens, a decision gets made, and the workflow moves on. It works, but it is mechanical, tiring, and fragile. Graylog 7.0 aims for something more impactful. Guided remediation gives analysts clarity during the moments when pressure rises and context usually scatters. It takes raw detection data and turns it into a clear path forward. No theatrics.

Optimizing Ruby performance: Observations from thousands of real-world services

Over the past three decades, Ruby has assumed a pivotal role in the modern web stack and become a fixture in the tool kits of countless DevOps and platform teams. Today, it is a driving force in contemporary application development, testing, automation, and CI/CD. For this blog post, we used data from our always-on continuous profiling of more than 3,000 real-world services from hundreds of organizations to track trends in Ruby usage and performance.

Introducing Datadog Agent Builder: Build agentic workflows for alert response and remediation

Building automated workflows that adapt to real-world complexity can be a challenge. As systems scale and scenarios multiply, teams often end up hardcoding endless logic branches just to handle every potential outcome. That’s why we’re introducing Datadog Agent Builder, a powerful new tool that lets you create custom AI agents that are fully hosted by Datadog.

The Shifting Nature of Organic Search in 2025

For decades now, search engine optimization (SEO) has been viewed as a “cheat code” channel – a method for businesses of any size or budget to achieve organic growth and scale against larger competitors. Industry research over the past two years has valued the SEO industry itself at over $150 billion, and projected to grow by an additional 20% by 2030, as there are thousands of case studies evidencing the value of investing in SEO as a growth channel.

Elasticsearch: The context engine for grounding and orchestration in Microsoft Azure AI Foundry Agent Service

The rise of large language models (LLMs) and agentic applications promises to transform enterprise workflows. Yet, the core challenge remains: How do we ensure these powerful agents generate accurate, relevant, and trustworthy responses based on proprietary enterprise data rather than relying solely on their generic training knowledge? The answer lies in grounding — connecting the LLM to verified, trusted, and up-to-date information.

How to pair Grafana Drilldown with Loki for faster logging insights

Our logs can tell us so much about the state of our systems, but they can also be a bit overwhelming. Yes, Grafana Loki—and, by extension, Grafana Cloud Logs, which is powered by Loki—reimagined the way log aggregation systems could meet modern engineering demands, but logs, by their very nature, are still voluminous.

Azure Monitor offers Grafana dashboards natively for immediate, real-time operational monitoring

Editor’s note: This blog originally published in May 2025 when Azure Monitor dashboards with Grafana became available in public preview. It was updated in November 2025 to reflect general availability. The Grafanaverse just got a little bit bigger.

Uptrends x OpenTelemetry: Stream browser-level synthetic data into your observability stack

Dashboards and alerts can tell you something’s wrong, but they don’t immediately tell you why. A red indicator or synthetic test failure prompts detective work. You flip between dashboards, timestamps, and logs, trying to line up what the check saw with what the system did. Now imagine your monitoring could explain itself by sending traces directly into your OpenTelemetry (OTel) backend.

Introducing webvitals.com: Find out what's slowing down your site

Developers don’t need another “run this tool, stare at a number, and feel bad about it” website. So we built something different. WebVitals helps you analyze, optimize, and ship faster websites, all in one place. Built by the same folks who obsess over stack traces and slow queries, it connects the dots between performance metrics and what’s actually slowing your users down. In one place, you can.

Cloudflare outage: another wake-up call for resilience planning

Another day, another massive Internet disruption, and this time it’s Cloudflare taking huge parts of the Internet offline. This incident is not an anomaly. It is part of a recurring pattern that has become standard in digital infrastructure. We have reached an inflection point in digital operations. Outages at major cloud and content delivery network (CDN) providers are now expected. The only real uncertainty is when it will happen next.

Prioritize errors and create tickets using Rollbar's MCP Server

Production errors can feel overwhelming. Your Rollbar dashboard is filling up with alerts, your team is scrambling to understand what needs immediate attention, and critical revenue-impacting issues might be buried among less urgent problems. Sound familiar? In this post, I'll walk you through a workflow that transforms production error chaos into organized, prioritized action items. We'll cover everything from analyzing Rollbar errors to creating properly linked Linear tickets.

Introducing Kentik AI Advisor: The Future of Network Intelligence

Introducing Kentik AI Advisor, a powerful new AI designed to deeply understand your network, reason through complex issues, and deliver clear, actionable guidance for designing, operating, and protecting your networks. By autonomously querying Kentik’s rich telemetry and tools, it explains what’s happening, why it matters, and what to do next — from troubleshooting and capacity planning to cost optimization and risk mitigation.

#observability needs more than tools. It needs the right data.

Good observability starts with good data. In this clip, we hear how Cribl gives teams real control over their data pipelines so they can collect, enrich, and route telemetry from any source to the right destination. It is not just about more dashboards or another platform. It is about building an observability ecosystem that connects IT, security, and the business through cleaner data and smarter AIOps. Tool rationalization and AI driven pipelines are not future goals. They are happening right now.

Distributed Tracing for Microservices: 10 Essential Best Practices for 2026

Distributed tracing tracks how a single request moves across multiple microservices, helping teams see the entire execution path end to end. In modern architectures where dozens of services interact, it becomes difficult to understand where latency starts, why bottlenecks appear, and which component breaks under load. Traditional monitoring only shows isolated metrics. Distributed tracing connects those dots.

Introducing Kentik AI Advisor

Introducing Kentik AI Advisor. AI with a comprehensive understanding of your network that thinks critically and advises how to design, operate, and protect infrastructure at scale. With the rise of hybrid cloud networks and the growing demands of AI infrastructure, network teams are under pressure to balance cost, performance, and security, often with limited resources that delay critical strategic initiatives.

Ep 18: AI has a memory problem, just like you do

In this episode of Masters of Data, we dive into how AI learns, examining both how we teach it and what it derives from human performance, as well as why context plays a crucial role in AI interactions. We break down five key components of AI training and talk about why we should view AI as a tool under human control rather than an autonomous entity. We explore the challenge of maintaining context in AI—much like our own memory struggles—and discuss methods, such as retrieval-augmented generation, that can help AI retain context more effectively.

How to Monitor RabbitMQ

A queue quietly fills up overnight. Memory hits the configured watermark and RabbitMQ blocks all publishers. Your entire message pipeline freezes, and you discover the problem when users start complaining. This scenario repeats across thousands of production systems because teams don't monitor RabbitMQ properly. The broker exposes comprehensive metrics, but most engineers don't know which ones predict failures or how to track them.

Datadog GPU Monitoring: Optimize and troubleshoot AI infrastructure

With Datadog GPU Monitoring, engineering and ML teams can monitor GPU fleet health across cloud, on-prem, and GPU-as-a-Service platforms like Coreweave and Lambda Labs. Real-time insights into allocation, utilization, and failure patterns make it easy to spot bottlenecks, eliminate idle GPU spend, and resolve provisioning gaps. By tying usage metrics directly to cost and surfacing hardware and networking issues impacting performance, Datadog helps teams make fast, cost-efficient decisions to keep AI workloads running reliably at scale.

Better together: Cribl and Microsoft Fabric just got radically simpler

In September, I wrote about how Cribl and Microsoft Fabric Real-Time Intelligence provide a powerful combination, unlocking new analytics capabilities for security and IT teams. I also said there was more to come… Today, Cribl is thrilled to announce a new Cribl Destination for Microsoft Fabric Real-Time Intelligence, marking another big step forward in our collaboration with Microsoft to make it much easier for Cribl customers to use Fabric.

Agentic AI: Ushering in the Next Era of Intelligent IT

IDC predicts agentic AI will command over 26% of global IT spend, hitting $1.3 trillion in 2029. How do IT Ops teams prepare for the reality of agentic systems being embedded across workflows, interfaces, and enterprise platforms? We went straight to the source—IT Ops leaders—to learn how they’re tackling agentic AI.

How to Achieve Deep Network Visibility with SolarWinds Observability SaaS

Looking for a faster way to discover every device on your network? This video walks through how SolarWinds Observability automatically scans and classifies network gear—including routers, switches, access points, firewalls, and SD-WAN devices—in seconds. You’ll learn how to: This is the easiest way to get full network visibility without scripts, config files, or manual inventory work.

Unlocking Full Application Visibility with LogicMonitor

In today’s digital landscape, application performance isn’t just about monitoring several key apps and “keeping the lights on,” it’s about understanding the full breadth of your interconnected business services and ensuring you’re delivering seamless, reliable experiences to customers and teams alike. But as applications grow increasingly distributed across cloud, on-prem, and hybrid environments, monitoring them holistically can become a serious challenge.

How to Monitor .NET Applications on Linux with SolarWinds Observability | Step-by-Step Setup

This video provides a step-by-step walkthrough for configuring monitoring for.NET applications running on Linux using SolarWinds Observability. The demonstration covers the full setup process—from adding a new service to verifying the APM library connection. Topics covered in this video include: This guide is intended for developers, system administrators, and DevOps engineers who need to quickly and reliably instrument.NET applications on Linux for performance monitoring and observability.

Why IT Outsourcing Is Becoming a Must-Have for Modern Operations

There's a quiet shift happening inside many organizations. Not the kind that makes headlines, but one that shows up in smoother workflows, fewer emergency calls, and teams that aren't constantly scrambling to "just keep things running." Operations leaders are realizing that the technology foundation of a company is no longer something that can be handled casually or reactively. Everything - processes, productivity, customer experience, and even employee morale - leans on the stability of IT.

Prioritize errors and create tickets using Rollbar's MCP Server

Production errors can feel overwhelming. Your Rollbar dashboard is filling up with alerts, your team is scrambling to understand what needs immediate attention, and critical revenue-impacting issues might be buried among less urgent problems. In this post, we'll walk you through a workflow that transforms production error chaos into organized, prioritized action items. We'll cover everything from analyzing Rollbar errors to creating properly linked Linear tickets.

What Is a Data Pipeline

In today’s tech world, IT and security technologies are the functional equivalent of Pokemon. To gain the insights you need, you “gotta catch ‘em all” by ingesting, correlating, and analyzing as much security data as possible. Data pipelines organize chaotic information flows into structured streams, ensuring that data is reliable, processed, and ready for use.

Grafana Play updates: A redesigned homepage to celebrate our community

Grafana Play is a free, publicly accessible sandbox environment where anyone can explore and learn about Grafana, no setup or sign-in required. It comes preloaded with sample dashboards demonstrating how to connect to data sources, build visualizations, and experiment with Grafana’s advanced features. Hosted on Grafana Cloud, Grafana Play has grown significantly over the years. With thousands of public dashboards, it’s now a go-to destination for Grafana learning and exploration.

Catchpoint Peak Performance Summit 2025: Redefining Observability for the Outcome Economy

We recently hosted our first-ever Peak Performance Summit in Bangalore, India, a one-day event focused on how value-based observability drives digital business outcomes. The summit brought together customers, partners, and technology leaders to share real-world experiences, live demos, and forward-looking ideas. The message running through every session was clear: performance isn’t just about speed. It’s about measurable business results.

Top 9 Web Application Performance Monitoring Tools for 2025

You know that uneasy pause before opening your monitoring dashboard? The one where you're hoping nothing's broken—but a part of you knows something probably is. Performance issues often start quietly: a few slow endpoints, a checkout that takes longer than usual, a graph that looks a little off. Before long, those small signals turn into alerts and support tickets.

MachineGPT: Speaking the Language of Machines to Shape the Future of AI

At.conf25, we took a bold step forward—introducing the concept of MachineGPT, which brings the power of generative AI to one of the most overlooked resources: machine data. MachineGPT speaks the language of machines. Just like ChatGPT learned the grammar of words and sentences to understand questions and respond in human language, MachineGPT can learn the hidden “grammar” of how systems behave through machine data.

Sentry has a bold new look

As you may have noticed, Sentry just got a major glow-up. For too long our product looked like boring enterprise software, while our brand screamed bold and irreverent. No more. From this moment forward our product now matches the vibe you’ve come to expect from us. The result is something that’s more vibrant, more tactile, and more Sentry. Welcome to the S.C.R.A.P.S.

Introducing the New Cloud Dedicated Admin UI

InfluxDB Cloud Dedicated provides hosted and managed InfluxDB Cloud clusters in a single-tenant environment and is optimized to handle high write and query loads. Today, InfluxData is releasing a visual overhaul and new features for its Admin UI. Among the recent updates are live observability for customer clusters, overhauled site navigation, and improved visibility into table schemas.

Agentic AI and the End of Traditional IT (w/ Robb Wilson)

In a wide-ranging conversation, Robb Wilson—CEO and co-founder of OneReach.ai and author of The Age of Invisible Machines—joins Tim and Tom to explore the rise of agentic AI and its seismic implications for IT, organizations, and society. Robb breaks down the concept of agent runtimes, why conversational interfaces matter more than ever, and how adaptive, self-orchestrating systems will reshape work far beyond today’s service models.

Modernising Middleware and B2B Integration with Assurance

Modernising enterprise middleware is now a strategic necessity for cost efficiency, AI-readiness, and operational clarity. Hybrid estates of IBM MQ, Apache Kafka, and other brokers hide inefficiencies that drain profitability, but an operating model built on Assurance and Optimisation restores transparency and control. By unifying data, rebalancing workloads, and enabling safe AI autonomy, organisations can build a resilient “Confidence Economy.”

A tale of two incident responses: How our AI assistant found the root cause 3.5x faster

About two months ago, an incident at Grafana Labs was kicked off in typical fashion: A series of alerts were triggered, our on-call engineer acknowledged it on Slack, and the rest of the team quickly began hypothesizing about the potential culprit. But the way the incident was resolved was anything but typical. Yes, our internal team followed best practices to resolve the incident as quickly as possible.

The Dawn of the 10x Team

Previously, I wrote about how debugging, whether done by humans or AI powered tools, depends on context. Without it, even the most capable systems can only tell you what code is broken, but not why it broke. Now that AI can access the same depth of context developers rely on (stack traces, traces, logs, commits, and code), the way we build and operate software is changing. We’re moving from an era of monitoring to one of reasoning.

Mezmo's AI-powered Site Reliability Engineering (SRE) agent for Root Cause Analysis (RCA)

We are thrilled to announce the availability of Mezmo’s AI-powered Site Reliability Engineering (SRE) agent for Root Cause Analysis (RCA)—a truly transformative leap forward for engineering and operations teams included in your existing subscription at no additional charge. We are paving the way for a new era of observability, moving beyond passive, reactive monitoring to a world of proactive AI-driven observability.

What is Network Observability vs. Network Monitoring?

Network observability may be seen as a newer term in the world of networking, but it has become critical for managing modern distributed networks. As networks grow more complex with cloud services, remote workers, and distributed applications, traditional network monitoring approaches no longer provide sufficient visibility into network health and performance.

Synthetic Monitoring for Internal Applications: SAP, ERP & More

Modern IT teams know the story by heart: uptime dashboards look green, the public website is fast, yet somewhere inside the corporate network, the finance team can’t submit purchase orders and the factory floor’s ERP terminals are frozen. What broke isn’t the internet—it’s the internal backbone. These internal systems—SAP, Oracle, Microsoft Dynamics, homegrown ERPs, HR and payroll platforms—keep the business running.

Google Workspace outage on November 12: How StatusGator detected it first

On November 12, 2025, users around the world faced difficulty accessing Google Workspace products including Google Drive, Google Docs, Google Sheets, and Google Slides. While the outage did not impact every user, it was widespread and disruptive. StatusGator detected the incident early using real user data and issued an Early Warning Signal long before Google officially acknowledged the issue.

The Hidden Bottleneck in Latency: GetYourGuide's Database Performance Journey

Fast front-end and back-end code alone won’t guarantee low end-to-end latency as hidden bottlenecks in the database can undermine even the best engineering efforts. In this session, Oleksii Serhiienko, Senior Site Reliability Engineer at GetYourGuide, will share how his team put database performance at the center of their monitoring strategy. He will highlight how they identified and fixed slow queries, uncovered load balancing issues that drove significant cost savings, and built monitoring practices that improved both reliability and investigation workflows.

From Error to Fix: AI-Powered Debugging with Sentry and GitHub

​This session will focus on the agent based features of Sentry for debugging an issue in a web application. We'll move through the broken issue - and show how tools like Sentry Seer and the GitHub repo integration make it easy to determine the root cause of an issue by bringing all the context of Sentry and code in GitHub together, and how the Sentry MCP makes it easy to pull all that context down into GitHub CoPilot to fix it locally.

LogicMonitor Named to CRN's 2025 Edge Computing 100: Proof That the Edge Finally Has Some Brains

Edge computing has been the buzzword of the decade. Everyone is talking about pushing intelligence closer to the edge, but most of that intelligence still needs a map and a flashlight. This week, CRN named LogicMonitor to its 2025 Edge Computing 100, recognizing companies that are actually doing something useful at the edge instead of just hyping it. We are honored. We are also a little amused.

Beyond Isolated AI: How the Selector MCP Server Connects Agents, Context, and Action

AI in network operations is evolving faster than ever. But while new models and agents are emerging almost daily, they’re often working alone, with each confined to its own context, data, and domain. One model might analyze telemetry, another handles automation scripts, and a third generates summaries or recommendations. Each model might be intelligent on its own, but without a way to share context, they end up thinking in isolation, limiting what they can achieve together.

Sysdig Team - What does good collaboration look like?c

In this video, our team shares how we work together to move fast, stay aligned, and build impact- across engineering, product, design, marketing, and beyond. You’ll hear honest perspectives on: Whether you're part of Sysdig or just curious how high-performing teams operate, this behind-the-scenes look highlights the mindset and culture that power everything we do.

Elastic named a Leader in the IDC MarketScape: Worldwide Observability Platforms 2025 Vendor Assessment

We're proud to share that Elastic has been named a Leader in the IDC MarketScape: Worldwide Observability Platforms 2025 Vendor Assessment (doc, November 2025). We believe this recognition validates our ongoing mission: to deliver an observability platform that is open, extensible, and AI-driven to power full-stack observability that unifies operational and business data at scale, allowing SRE teams to move from detect and resolve problems faster.

Expanding Access, Not Risk: Using the Read-Only Role in Honeycomb Teams

Observability works best when everyone who needs visibility can get it without the risk of unintentional changes. Honeycomb’s role-based access control system helps teams strike that balance with a selection of Owner, Member, and Read-Only member roles. This control gives teams more flexibility in how they share access across their organization, helping you scale visibility safely without sacrificing control.

Bringing Observability to Data

While observability practices have evolved in recent years, they have largely focused on application services and infrastructure. Yet it is data what powers our applications, businesses, and AI models. When data issues occur, the consequences can be far reaching, from poor product experiences to billing errors to misinformed AI outcomes. In this session, Jonathan Morin, Group Product Manager at Datadog, shares real-world examples of incidents and explains how data observability can address them, helping teams detect issues earlier, reduce costly downtime, and restore trust in their data.

Announcing 1B+ Downloads & Product Development With Logs, Traces, Metrics

We’re currently at KubeCon + CloudNativeCon North America 2025 in Atlanta, and it’s a great opportunity to connect with the community and share some of the progress we’ve made this year. It’s been a busy period of development, new releases, and community engagement, all guided by our focus on delivering simple, reliable, and efficient monitoring & observability solutions.

Graylog MCP Integration: Real-Time LLM Access to Your Data

Graylog V7.0 supports integration with the Model Context Protocol (MCP), which allows large language models (LLMs) to access and interact with Graylog data and workflows in real time. Graylog exposes an MCP-compatible endpoint for LLM clients, such as Claude and LM Studio. MCP integration allows Graylog users to interact with their data through LLMs. With MCP, an LLM can connect directly to Graylog as a remote tool interface, performing queries, retrieving system information, and assisting with common administrative or investigative tasks. This capability may make it possible to.

How to Measure Digital Employee Experience (DEX)

Digital Employee Experience is quickly moving from an IT concern to a boardroom priority. According to Gartner, “By 2026, 50% of digital workplace leaders will have established a DEX strategy and tool, up from 30% in 2024.” However, enterprises can still lose up to 470,000 hours per year due to poor DEX highlighting the need for organizations to pay close attention to the experience of their employees. However, implementing a DEX tool alone isn’t enough.

ignio AI Agent for IT Event Management | AI Agent for alert noise reduction

Discover how ignio’s AI-powered agents are transforming IT event and alert management by combining Agentic AI, AI/ML algorithms and automation. In this video, we introduce ignio AI Agent for IT Event Management — a purpose-built, autonomous agent designed to reduce alert noise, group related alerts and predict future events. Whether you’re managing a large-scale enterprise infrastructure, cloud-native environment, or hybrid IT setup, this AI agent empowers your SRE and IT operations (ITOps) teams with real-time observability, automated alert correlation and suppresion, and predictive intelligence What You’ll Learn in This Video.

Unleashing Progress Flowmon 13: Speed, Smarts and Security Redefined

At Progress, we continue to develop and enhance the Progress Flowmon product family. The latest update brings the core Flowmon product to release 13.0, and it includes remarkable performance improvements, strengthened security and expanded protocol support. Full details of what’s new and improved in the latest release are available on the Flowmon product page. In this blog, we’re excited to highlight the newest features and improvements to the Flowmon solution.

The High Stakes of Aerospace Reliability

Aerospace systems operate in one of the most unforgiving environments imaginable. Each flight test, orbital maneuver, or satellite transmission subjects avionics, propulsion systems, sensors, and telemetry hardware to extreme conditions. Even a minor failure can cascade into grounded aircraft, interrupted communications, or compromised missions.

Customer panel: Transforming IT & security

In an era where telemetry data grows at a 28% compound rate while budgets remain flat, traditional IT and Security approaches are facing unprecedented pressure. Join our distinguished customer panel as they share their transformative journeys with Cribl's data engine solutions. Our panelists will discuss how Cribl's vendor-neutral portfolio has enabled them to regain control over their data infrastructure, achieving both immediate operational improvements and strategic long-term advantages.

APM vs Observability: What comes next?

Remember how I said that blog was going to be my last entry on the topic of "APM vs Observability?" Well, it turns out I had a little more to say. I'd like to spend a few moments talking about the future of APM and Observability. I think it comes down to two major initiatives: AI and Open Telemetry. (NOTE: in this section, I'm using the word "observability" to refer to the discipline of monitoring and observability as a whole, rather than any specific tool, technique, or vendor-based solution.)

The Seven Wastes of Network Operations

Does it ever feel like your network operations team is constantly running, yet always struggling to keep up? The ticket queues are long, troubleshooting is a complex detective story, and every new application deployment adds another layer of anxiety. This constant state of reactive firefighting isn't a sign of a bad team; it's the symptom of a broken process. This operational friction, the invisible tax on every action your team takes, has a name: waste.

Your NOC's Most Important New Skill? Ignoring Things

I want to challenge a deeply held belief in our industry, one that I once championed myself: the idea that more data is the answer. We've spent a fortune building vast data lakes of network telemetry, believing that if we could just collect everything, we would achieve a state of operational nirvana.

Introducing the Splunk Technology Add on for Ollama Illuminating Shadow AI Deployments

Without strong visibility and governance, local LLMs risk replicating the fragmented, unsupervised sprawl once seen in shadow IT, complicating security postures and making it difficult for organizations to ensure proper oversight and compliance as these powerful AI tools become embedded in daily workflows. To address this challenge, The Splunk Threat Research Team has released the Splunk Technology Add-on for Ollama that provides comprehensive monitoring and observability capabilities specifically designed for local LLM deployments.

OpenTelemetry Java Agent for Spring Boot: Complete Setup Guide

The OpenTelemetry Java Agent provides zero-code instrumentation for Spring Boot applications through bytecode manipulation. This guide covers setup, configuration, auto-instrumentation capabilities, and production deployment strategies for implementing distributed tracing and observability.

Understand, diagnose, and optimize SQL queries: Introducing Grafana Cloud Database Observability

It’s widely acknowledged that most application performance problems stem not from the application itself, but from the underlying database. Slow or inefficient database queries are often the primary cause of these issues, acting as the biggest driver of application performance incidents. If you’ve been troubleshooting slow API calls or sluggish services, chances are the root cause likely resides within your database layer.

Build Your Kubernetes Monitoring Foundation with kube-prometheus-stack

When you run Kubernetes at scale, one of the first challenges is understanding what the cluster is actually doing. Workloads shift around, pods restart for normal reasons, and traffic doesn't always follow the patterns you expect. Having clear signals makes day-to-day operations much easier. That's where kube-prometheus-stack helps. It brings Prometheus, Grafana, Alertmanager, and supporting components together as a single package.

Network Monitoring vs. Network Observability: What Do You Need?

A decade ago, network monitoring was straightforward. You had a data center, some branch offices, MPLS circuits connecting everything, and a handful of applications running on-premises. Set some SNMP thresholds, configure a few alerts, and you were covered. When something broke, the problem was usually obvious: a failed switch, a saturated link, a misconfigured router. Today's networks bear zero resemblance to that world.

OpManager streamlined IT for Detroit Wayne Integrated Health Network

When Detroit Wayne Integrated Health Network needed reliability at every heartbeat, they turned to ManageEngine OpManager. From chaos to clarity, OpManager unified their IT, reduced downtime, and powered faster, smarter care delivery. Discover how you can do the same.

How OpenTelemetry can enhance observability in distributed systems: Practical examples

Observability has become one of the fundamental elements of performance and reliability as modern applications move toward cloud-native architectures, microservices, and multi-cloud. Traditional monitoring techniques often fall short in such dynamic, distributed environments. That’s where OpenTelemetry (OTel) , an open-source observability framework comes into picture.

What Is Synthetic Monitoring?

Synthetic Monitoring is a proactive approach to testing a website or web server to ensure that digital services stay available, responsive, and functional at all times. Instead of waiting for real users to encounter a problem, synthetic monitoring uses automated scripts to imitate user interaction, such as visiting pages, submitting forms, or performing transactions from multiple global locations.

OTel Updates: OpenTelemetry eBPF Instrumentation (OBI) Hits Alpha

Some parts of a system don’t lend themselves to quick instrumentation changes. You might have a production binary that hasn’t been rebuilt in years, or a stack made of several languages where each team manages telemetry differently. In those situations, getting consistent signals often means touching code you’d rather leave alone or coordinating updates across many services. OpenTelemetry eBPF Instrumentation (OBI) approaches this from the kernel side.

If it Wanted to, it Would: The Bitter Lesson for LLM Users

There’s a viral saying folks use about flaky crushes, spouses, and forgetful friends: "if he wanted to, he would." The idea is straightforward: when someone cares, they make the effort. As it turns out, the same principle applies surprisingly well to AI. Systems, like people, have things they "want" to do. Each model has patterns of reasoning and synthesis it performs naturally.

What is Active Telemetry

Active Telemetry is the evolution in how organizations collect, process, and use observability data. In traditional observability, telemetry is passive: systems emit logs, metrics, and traces that are stored and visualized after the fact. This model worked when systems were simpler and changes were predictable. But in today’s world with distributed microservices, Kubernetes, and AI workloads, passive telemetry can’t keep up. Active Telemetry changes that.

Conquer Complexity, Accelerate Resolution with the AI Troubleshooting Agent in Splunk Observability Cloud

The digital landscape has transformed dramatically, and with it, the demands on our systems have grown exponentially. Traditional monitoring tools struggle to provide sufficient insight into complex, distributed, cloud-native environments. Observability is the answer, moving beyond merely knowing "what" is happening to understanding "why" it's happening, and its impact on user experience and business outcomes.

Redgate Software recognized as a Strong Performer in Gartner Peer Insights Voice of the Customer for Infrastructure Monitoring Software

We’re thrilled to share that Redgate Software has been recognized as a ‘Strong Performer’ in the 2025 Gartner Peer Insights Voice of the Customer for Infrastructure Monitoring Tools category with our Redgate Monitor solution. We believe this recognition is a reflection of the trust and feedback from the people who matter most: our customers.

Top DevOps Challenges in 2025 and How APM Solves Them

In 2025, DevOps continues to grow and change quickly, helping teams deliver software faster and more securely. But as systems become more complex with microservices, cloud platforms, and AI-driven tools, new challenges arise. Teams now need to balance speed with security, manage too many tools, control rising cloud costs, and still maintain high-quality software. This is where Application Performance Monitoring (APM) becomes essential.

Cisco & Auvik: Total Visibility and Control for Your Network with Auvik

Managing modern networks is complicated, and it’s easy for critical Cisco gear to quietly hit End-of-Sale (EOS) or Last Date of Support (LDOS) without anyone noticing. That can open the door to serious risks, technical debt, and compliance issues. Manual tracking and scattered tools just can’t keep up anymore. Watch this video to see how to stay ahead: Save Money and Reduce Headaches: Lower costs and tackle technical debt with smarter lifecycle management for your Cisco hardware.

Use Grok parsing to extract fields from logs | Datadog Tips & Tricks

When your logs don’t follow a standard format, it can be difficult to extract valuable information, like key-value pairs and nested JSON objects. Grok parsing lets you define flexible patterns that match unstructured log data so you can extract specific fields to query, filter, and visualize. In this video, you’ll learn how to: By refining your Grok parsers, you can make your logs more useful for analytics, dashboards, or alerts, and get even more value from your logs.

Pastries with SREs: No compromises on cost-effective observability or donuts.

In this episode of Pastries and SREs, we dig into how vendor lock-in and sky-high observability costs are forcing teams to choose between coverage and budget, AND why you shouldn’t have to settle. With donuts in hand, we explore how to take back control of your observability strategy by making it cost-effective, comprehensive, and flexible.

Investigating SIEM Incidents with Logz.io

A short demo showing how Logz.io, powered by the AI Agent, helps investigate security incidents by analyzing and correlating data. The AI Agent uses natural language to: Query and correlate SIEM questions with related logs Detect anomalies and highlight unusual activity Summarize findings to speed up root cause analysis Provide recommended actions This video demonstrates a practical SIEM use case for the AI Agent inside Logz.io.

Performance testing best practices: How to prepare for peak demand with Grafana Cloud k6

For many organizations, periods of high customer activity are anything but relaxing. Events like Black Friday, product launches, or major sales can put intense strain on the software and infrastructure systems that support a company’s web applications. Without proactive performance testing, these moments can quickly turn into poor user experiences and lost revenue.

Customer Corner: Driving Innovation at Scale with Kyle Hill, CTO, ANS Group

At LogicMonitor’s Senior Leadership Team Offsite in July, I sat down for a candid conversation with Kyle Hill, CTO of ANS Group. As a longtime LogicMonitor customer and leader of a 700+ person tech powerhouse, Kyle offered sharp insights into scaling infrastructure, unlocking AI-driven value, and what true partnership looks like in today’s MSP world. Here’s an edited and condensed version of our conversation.

How to Manage Grafana Access Groups for Team Control

Managing team access in Grafana can be tricky—especially as your organization grows. That’s where Grafana access groups (also known as Limited Access Groups in Hosted Graphite) come in. They allow you to define groups of dashboards and restrict which team members can access them. If you’re using Hosted Graphite with Grafana dashboards, this feature helps you organize teams, maintain data privacy, and simplify access control—all while giving users just the permissions they need.

Observabili-Mystery Solved: From Clues to Answers in 3, 2, 1...

Observability doesn’t have to be a mystery. Join SolarWinds Tech Evangelist Chrystal Taylor and THWACK MVP Jez Marsh, Owner of Silver Back Systems, as they crack the code on turning noisy data into actionable insights. Part of the THWACKcamp lineup from SolarWinds Day, in this session you’ll learn how to analyze raw logs and metrics to uncover trends, catch issues early, and make smarter, faster decisions. Discover practical techniques for linking cloud and on-premises data, reducing false alerts, and automating repetitive tasks with tools like Custom Properties.

Reality Bytes: Beyond the Shiny DEX Tool (w/ Monica Filak)

Tim, Tom, and Oriana sit down with Monica Filak, Director of Customer Success at Nexthink, to explore what it really takes to turn DEX from a tool into a transformative methodology. Monica shares how her team helps organizations move beyond technology to rethink their people, processes, and communication — bridging the long-standing divide between IT and the business. From creative internal branding to AI-driven efficiency gains, she explains how companies can evolve from “shiny tool” thinking to achieving measurable, human-centered value.

Sync your Backstage catalog with Datadog IDP

Backstage is a popular open source framework for building internal developer portals (IDPs) used by organizations to aggregate service metadata and create a single source of truth for their software developers. However, data stored in the Backstage Software Catalog can quickly become siloed and inaccessible from monitoring tools such as Datadog.

How to Monitor AI Agents in Commerce Systems

Artificial intelligence (AI) isn’t just writing text or generating images anymore. It’s starting to make real-world decisions. Now, with agentic systems, we’re entering an era where AI models don’t just respond; they act autonomously, buying, booking, and negotiating on behalf of users. That may sound promising, but those of us in the trenches of reliability know that progress always comes with trade-offs. Make no mistake, this shift fundamentally changes how observability works.

Getting Started with InfluxDB 3 Core: From Installation to First Query in 10 Minutes

Getting started with any database technology can be daunting, and nothing is ever as easy as a snap of the fingers. With InfluxDB 3, we’ve made it as painless as possible. If you want to do some testing, development, or exploration, you’ve read the title: you should be up and running in under 10 minutes with very little hassle.

Messaging Infrastructure Is Still in the Dark: The Observability Illusion Costing Millions

In today’s always-on digital world, even the best messaging platforms—like Apache Kafka and Apache ActiveMQ—can become blind spots that undermine resilience. This article exposes the “observability illusion” many organizations face, showing how limited visibility and manual processes lead to outages, high costs, and constant firefighting. Learn how meshIQ transforms reactive operations into proactive engineering through unified observability, automation, and self-service.

Introducing Network Destinations: ICMP Monitoring for Any IP

For those who don't know Obkio, we're a synthetic Network Performance Monitoring, Troubleshooting and Diagnostics platform. We help network teams identify, diagnose, and resolve performance issues across distributed networks, from remote offices to cloud applications. For years, we've focused on what we do best: agent-to-agent performance monitoring.

Making Observability AI-Native with the Logz.io MCP Server

Now available: Secure, real-time access to your observability data via Logz.io’s Model Context Protocol (MCP) Server. The Logz.io MCP Server brings your logs, metrics, and telemetry data into the Model Context Protocol (MCP), an emerging open standard that lets AI systems query real data securely and contextually, in real time. That means any MCP-compatible LLM, like Claude Desktop, Cursor, your own AI agent… can now connect directly to your Logz.io environment.

VoIP Jitter Survival Guide: How to Diagnose, Monitor & Troubleshoot

VoIP jitter is the variation in packet arrival time during voice calls, measured in milliseconds. When voice packets travel across your network at inconsistent intervals; some arriving faster, others slower—you experience jitter. Acceptable jitter for VoIP is 30 milliseconds or less. Above this threshold, you'll notice choppy audio, robotic voices, delays, and call drops that disrupt business communication.

AFK: The Only Status You Need When AFW - SolarWinds TechPod 103

What does real work-life balance look like in tech? In this honest and funny conversation, Sean Sebring and Chrystal Taylor share stories from their sabbaticals, burnout moments, and how they learned to unplug — for real this time. Topics we get into: Why burnout hits tech pros harder than most The art of saying “no” and setting healthy boundaries How to stop chasing “perfect” and embrace good enough Why taking time off sets a better example for your team Reframing success: from “always on” to sustainably productive.

SIEM Migration in 68 Days

In this session, we will discuss how the University of Pittsburgh was able to modernize their data processing strategy, migrate to a new SIEM solution, and avoid ballooning SIEM costs all within 68 days from the first install of a Cribl product. We will showcase how we were able to use Cribl's software to easily handle the following scenarios: 100% agent replacement and consolidation using Cribl Stream Workers and Edge.

Improve Observability in Your CI/CD Pipeline

The backbone of modern software development is automation and at the heart of that lies the CI/CD pipeline. It’s what turns code into deployable software, delivering changes to users faster, safer, and more predictably. In simple terms, a CI/CD pipeline automates everything from the moment developers push code to when it reaches production. It integrates, tests, builds, and deploys software continuously ensuring faster releases with fewer human errors.

What the RFC?! Making sense of syslog before you migrate

Syslog: it's everywhere, it’s ancient, and let’s be honest — it rarely shows up the way the RFC says it should. Before you cut over to Cribl Stream, it pays to understand exactly what you're dealing with and why it matters. In this talk, we’ll demystify the syslog format (yes, the actual RFC 3164 and 5424 stuff), look at what happens when data goes rogue, and explore how Cribl can help bring order to the chaos.

The Modern SOC: Transforming security operations with Al and automation

Security teams are dealing with massive data growth, siloed tools, and constant alert fatigue. All of this makes it harder to detect and respond to threats. AI has become a key part of the solution, but its effectiveness depends on having access to complete, high-quality data. In this session, Palo Alto Networks and Deloitte will explore how AI and automation are redefining the modern Security Operations Center (SOC). Learn how leading organizations are leveraging intelligent workflows, automated threat detection, and machine learning to accelerate response times, reduce analyst fatigue, and strengthen overall security posture.

5 Ways to Strengthen IT Governance Through Better AI Visibility

AI is transforming businesses fast, but most organizations are diving in without a clear view of what's actually running in their systems. That lack of visibility is more than a small oversight, it's a ticking time bomb. When you don't know which AI tools are active, it's nearly impossible to protect sensitive data, stay compliant, or manage costs effectively.

IBM TechXchange 2025 Takeaways: Key Insights for IT Leaders

This year at IBM TechXchange 2025, we had the privilege of not only attending but also sponsoring the event and hosting a booth in the expansive expo hall. From the moment we arrived, it was clear: IBM’s ecosystem is thriving once again. Between the buzz of innovation, the depth of technical sessions, and the sheer energy of the crowd, TechXchange 2025 stood out as one of the most impactful IBM events in recent memory.

Eliminate unnecessary costs in your Amazon S3 buckets with Datadog Storage Management

Cloud object storage powers a wide range of workloads, from AI training datasets to customer-facing media libraries. As your data grows into the petabyte scale, managing storage costs and ensuring reliability requires fine-grained visibility. You need answers to questions like: Which specific teams, services, workloads, or datasets are driving spend? Which data is cold and should be archived? What fixes will have the biggest impact on cost and performance?

The Architecture of Automation: Why IT Doesn't Lie

Let’s start with something most people get wrong. Automation isn’t magic. It’s math. It does exactly what it’s told. Nothing more, nothing less. Every action, every response, every output is a reflection of truth in motion. And that’s where value actually begins. Most organizations still treat automation like a shortcut: a way to go faster, to handle more alerts, to “keep up.” But speed isn’t the value. Truth is.

OpenTelemetry Metrics in Quarkus Explained

When you run services on Quarkus, you need a steady stream of signals to understand how the application behaves—CPU trends, request timings, memory patterns, and how each endpoint responds under load. Metrics give you that visibility. They help answer questions like: OpenTelemetry fits well here because it gives Quarkus a common way to generate and export metrics without locking you into a specific monitoring tool.

The four pillars holding up your digital business, and what happens when they crumble

When we published the first Internet Resilience Report in 2024, the world was still reeling from the CrowdStrike outage that left airlines grounded and financial institutions scrambling. A year later, the stakes are even higher. The 2025 edition confirms what many of us already feel every day in IT Operations: resilience is no longer about uptime alone. It’s about protecting revenue, customer trust, and digital performance at scale.

From Crashes to Clarity: What's New in Percepio Detect 2025.2

Think of Percepio Detect as a security camera for your firmware—always monitoring, but only storing data when something unusual happens, such as crashes or performance anomalies. By providing rich debugging information when needed while keeping the overall data volume to a minimum, Detect enables continuous observability over unlimited time, even on resource-constrained devices such as 32-bit microcontrollers.

From Middleware to Mission Control: How Transaction Visibility Turns into Operational Intelligence

In today’s digital enterprises, middleware isn’t just infrastructure—it’s the heartbeat of every mission-critical operation. Yet too often, it operates as an invisible black box. This article explores how organizations can transform middleware management into true Operational Intelligence—gaining complete transaction visibility, intelligent automation, and unified governance.

Rollbar + Vercel built for how you ship

Vercel helps you ship fast. We help you ship safe with code‑first observability that connects errors to the code and deploys behind them. Together you get speed with clear insight into what is running in production. Today we’re launching our native integration in Vercel’s Observability category so you can connect Rollbar to your Vercel projects in minutes, map environments cleanly, and track deployments from day one.

MCP found a thankless bug faster than us, and it was actually fun

Once, when I was a very junior developer, I was discussing a bug with a very senior developer (let's call him Burt). Satisfied with the fix, I said something like "oh, that was a great bug". He looked at me as if his eyes were going to fall out of his head. Clearly, this enraged him. He briefly went off about how there are no great bugs, there are only bugs to squash – and that’s all.

Observability and FedRAMP in Action: The VA's Mission to Deliver Reliable Digital Service

Ensuring digital services remain accessible, reliable, and secure is a high priority for any organization operating at scale. For the Department of Veterans Affairs (VA), this focus is central to its mission of providing quality care to veterans, their families, and caregivers. Often described as “the largest IT shop in the United States,” the VA manages 2.7 million pieces of equipment across a vast network of interconnected systems.

How feedback loops power progressive software delivery

Modern engineering teams face competing priorities. Developers are expected to deliver new features faster than ever, but users expect rock-solid reliability with every release. Shipping quickly can feel like you’re gambling with user trust. If you move too fast, you risk outages, but if you move too slowly, innovation stalls.

Splunk Developer Program

A short video that introduces the Splunk Developer Program, highlights the end-to-end support and tooling it offers, and showcases how developers can build, test, and grow impactful apps with confidence. The video will follow the journey of a first-time app builder who discovers the program, uses its resources, and becomes an active, recognized contributor in the Splunk community.

Why the Gaming Industry Needs Application Performance Monitoring (APM)?

Performance defines player experience. When a game lags, crashes, or delays inputs, players lose patience. In competitive and live-service titles, even a few hundred milliseconds can decide whether someone keeps playing or uninstalls for good. Modern games rely on complex ecosystems built on cloud servers, microservices, and real-time data synchronization. Millions of concurrent players generate massive workloads that test the limits of any infrastructure.

Detecting an AWS Outage and DR Lessons

A few weeks ago, on 20th October 2025, AWS suffered a widespread outage in its US-EAST-1 region that affected a large number of customers globally. More than 1,000 apps and websites were impacted including major banks and popular games, streaming and social platforms such as WhatsApp, Snapchat, Fortnite and Pokémon Go.

Network Destinations: ICMP Monitoring Feature Highlight - Obkio

Introducing Network Destinations, Obkio's all-new ICMP Monitoring feature. Quick question: are you paying for a second tool just to ping a few IPs? We heard that a lot. So we fixed it. Now, here's the thing about Obkio. We've always been laser-focused on agent-to-agent performance monitoring. Our secret? Distributed monitoring agents. Deploy them at your sites and over the Internet, and you get complete visibility between them, all in the app.

Generation AI (Episode 5): How generative AI Is shaping the future of the marketing technology stack

Description: The next golden age of artificial intelligence has arrived, but the path forward is far from certain. Technology leaders are presented with a tremendous opportunity to revolutionize their business — that is, if they can find a way to tap into the full potential of their organization's data. In Episode 5 of Elastic's new limited series, Generation AI, marketing and IT leaders share how they believe AI will shape the future of marketing technology and workflows.

Unify Observability, Surface Business Impact, and Solve Problems Using AI Agents with Latest Splunk Observability Innovations

In September at.conf25, we announced how Splunk is shaping the future of digital resilience in the age of AI. Agentic AI is rewriting what it takes to build a leading observability practice. As vibe coding gains steam, applications will be built with less human involvement. At the same time, the rise of AI agents demands specialized telemetry to ensure models are performing as intended—aligned to their business purpose and cost.

Splunk Advances the OpenTelemetry Project with Its Latest Donation, the OpenTelemetry Injector

Splunk is very excited to be sponsoring Kubecon North America once again, kicking off this week in Atlanta, GA. As many know, Splunk is one of the top contributors to the OpenTelemetry project. We’re happy to have sent many of the Splunkers who serve as project maintainers and contributors to lead SIG meetings and engage with the greater community in the OpenTelemetry Observatory, sponsored by Splunk.

Building the Next Generation of Defenders: From the Classroom to the SOC of the Future

Singapore’s digital economy is growing at a remarkable pace, but with that growth comes a challenge: the nation is on track to need more than a million additional digitally skilled workers by 2026, particularly in cybersecurity, data, and AI. This is not just about filling jobs — it’s about ensuring the country’s long-term digital resilience.

How Smart Robots Work: AI Perception, Planning & Execution Explained

Imagine a future where machines not only perform physical tasks but also learn, adapt, and make intelligent decisions in dynamic environments. This future is rapidly becoming a reality with the advent of smart robots, poised to revolutionize industries from manufacturing to healthcare. In this article, we'll delve into smart robots: what makes these intelligent machines 'smart', how they perform tasks, and how they are reshaping the operational landscape.

From Messaging Burden to Business Assurance: Rethinking MQ, Apache Kafka, Apache ActiveMQ, and RabbitMQ

Every enterprise depends on messaging and streaming platforms to keep transactions flowing, from purchase orders and invoices, to payments and claims, to the events that trigger customer experiences in real time. And yet, the very systems meant to assure reliability often create the opposite effect: cost, complexity, and blind spots that silently drain profit.

Salesforce API Monitoring: Synthetic Tests That Catch Failures

Salesforce APIs sit quietly behind countless customer interactions. They connect CRMs to billing, sync leads to marketing, and power dashboards that executives depend on daily. Yet when one of those APIs slows down or breaks, it often happens without alarms. Dashboards still load, integrations keep attempting retries, and somewhere data silently stops flowing. That’s the danger of invisible API failure—by the time someone notices, the damage has already been done.

Microsoft Teams Monitoring to Troubleshoot & Optimize Performance

Microsoft Teams has become the collaboration backbone for businesses worldwide. But when call quality drops during a crucial client meeting or files won't upload before a deadline, productivity grinds to a halt. The frustration? These issues are usually preventable. The challenge with Teams is that problems can originate from multiple sources: user devices, network infrastructure, or Microsoft's platform itself.
Sponsored Post

Preparing for cloud failures: Monitoring strategies for distributed hybrid infrastructure

When AWS experienced its recent outage, the ripple effect was immediate. Critical workloads slowed, dashboards went blank, and many teams realized multi-cloud isn't automatically resilient. Cloud-level failures are inevitable due to the interdependent components and complex IT architecture. The recent AWS disruption reminded many teams that the cloud isn't a magic uptime guarantee. Even the most mature providers can-and do-experience large-scale service interruptions.

Best 35+ Black Friday and Cyber Monday Software and SaaS Deals in 2025

The biggest shopping days of the year are coming up fast, and SaaS vendors are launching their most exciting discounts yet. Together with our SaaS partners, StatusGator has rounded up the best Black Friday and Cyber Monday deals you won’t want to miss. Are you a software provider offering a deal? Share it with us by filling out this form!

200 EPISODE ANNIVERSARY SPECIAL! PEDRO BADOS RETURNS!

It’s a milestone moment! Our 200th episode of The DEX Show (and the first ever late one — sorry about that!). To mark the occasion, Nexthink Founder and CEO Pedro Bados returns to reflect on Nexthink’s incredible journey and discuss the company’s next era — from the recent investment by Vista Equity Partners to the accelerating fusion of DEX and AI. Pedro shares his perspective on how AI is reshaping the workplace, Nexthink’s vision for “an IT agent for every employee,” and why he’s optimistic about the future of technology and innovation. A landmark conversation to celebrate our big birthday.

AI Agents Observability with OpenTelemetry and the VictoriaMetrics Stack

Nowadays, AI agents are becoming more and more popular and often deployed as part of production systems. However, this rapid adoption brings unique observability challenges that require flexible solutions. On the one hand, AI agents are fundamentally just like any other software services that produce the same classic observability signals we’re familiar with: metrics, logs, and traces.

Streamline Incident Management with the New Netdata-ServiceNow Integration

When a critical alert fires at 2 AM, the last thing your on-call engineer should be doing is manual administrative work. Yet, for many teams, that’s exactly what happens. You see the alert in your monitoring tool, then you have to switch contexts, open a new browser tab, log into your ITSM platform, and manually create an incident—all while your systems are failing.

Show Me the AI: Rethinking How AI Fits Into Network Operations

Over the last couple of years, nearly every network and infrastructure observability platform has added the word “AI” to its messaging. Some have introduced helpful capabilities. Others have simply added a chatbot on top of the same dashboards that have existed for a decade. In many ways, the term has started to lose meaning. But inside network operations, the conversation hasn’t disappeared. It has simply become more blunt.

Service Observability, Service Operations and Service Orchestration: Unifying Visibility and Action Across the Enterprise

For large enterprises, the health and resilience of Business Services define customer experience and business reputation. Yet as technology estates grow in complexity, fragmented toolsets and siloed teams make it difficult to maintain service availability and prevent incidents before they impact the business and ultimately, customers.

Fix an error in Copilot without leaving your IDE

Production errors are every developer's nightmare. You're enjoying your coffee when suddenly alerts start firing - users are experiencing crashes, and you need to find and fix the issue fast. In this video, we'll walk you through how to use AI to diagnose and fix critical errors in an application using Rollbar's MCP (Model Context Protocol) server.

How Auvik Helps MSPs Eliminate Network Alert Fatigue

When alerts come in hot and fast, alert fatigue can quickly set in, overwhelming you with the volume and becoming one of the biggest operational problems for MSPs. Not knowing what to handle first and prioritize in a long list of alerts puts a strain on one of the most valuable resources you have: focus. When your technicians are constantly switching contexts and sifting through a flood of low-priority alerts, it’s asking a lot of them to stay sharp. That constant mental juggling takes a toll.

What is APM? Understanding application performance monitoring

The rapid advancement of technology has revolutionised the way businesses operate and engage with their customers. A delay of even a few seconds can lead to significant drop-offs in engagement and conversions. According to Google's findings, "just a 100-millisecond lag can reduce revenue by 1%, and a half-second delay can cause a 20% drop in search engine traffic".

Top tips for staying focused in a notification-heavy world

Top tips is a weekly column where we highlight what’s trending in the tech world and list ways to explore these trends. This week, we’re tackling one of modern work’s biggest challenges: staying focused in a world overflowing with notifications. Focus has become an uncommon ability in today's hyper-connected world. Every ping, pop-up, and alert demands our attention, pulling us away from focused work and substance thinking.

Atatus 2025 Highlights: G2 Wins and Product Milestones

As we approach 2026, we’re taking a moment at Atatus to reflect on a year that pushed us forward in every way. 2025 was about raising the bar by expanding integrations, deepening data insights, broadening language support, and rolling out new capabilities that empower teams to see more and do more. Most importantly, the response from our customers and community made it clear that the work we’re doing is making a real difference.

How to Use MetricFire Logging: Visualize Logs & Metrics Together in Grafana

Want full visibility into your systems? In this step-by-step tutorial, we show you how to use Grafana Loki with Promtail on Hosted Graphite by MetricFire to stream logs alongside your metrics. All visualized in Grafana dashboards. No more toggling between tools — get the full observability stack in one place.

Choosing the Right Load Balancing Approach for Your Network: Static, Dynamic, & Advanced Techniques

Load Balancing is the process of distributing network traffic among multiple server resources. The objective of load balancing is to optimize certain network operations. Ensuring that a workload is spread evenly among the computing resources, this “balanced load” improves application responsiveness and accommodates unexpected traffic spikes — all without compromising application performance. Let’s take a deeper look at this important networking function.

Bits AI SRE, Flex Frozen, and GPU Monitoring | DASH 2025

Get a first look at Datadog’s biggest product reveals from DASH 2025. Meet Bits AI SRE, your 24/7 autonomous AI Site Reliability Engineer, Flex Frozen for up to 7 years of managed log retention, and GPU Monitoring for full visibility into your AI workloads. Experience the future of observability in action.

When payments pause: lessons from a global payments outage

In digital commerce, payment reliability is non-negotiable. The rise of instant payments highlights this need: global instant payment transaction volume reached 195 billion in 2022, with projections to surpass 500 billion transactions by 2027 as more countries adopt faster payment systems. This growing reliance on real-time payment rails raises the stakes for reliability, with any disruption posing major risks to trust and revenue.

Why Email Blacklist Monitoring Matters?

Email deliverability determines whether your messages reach inboxes or disappear without notice. When your domain or mail server appears on a blacklist, communication stops instantly, affecting customers, partners, and revenue. Blacklisting can happen silently, even to legitimate senders. Continuous email blacklist monitoring ensures that issues are detected early, keeping your reputation strong and your communication uninterrupted.

How to Visualize Time Series Data with InfluxDB 3 & Apache Superset

Learn how to visualize time series data from InfluxDB 3 Core using popular open source Apache Superset. This tutorial walks you through setting up both systems with Docker, writing sample IoT data, and creating your first visualization. For more information about Apache Superset, this article may be helpful.

Connecting the dots: Solving IT asset visibility with Dataprime

In large tech organizations, keeping track of every laptop, desktop, and endpoint is one of the IT department’s toughest challenges. Each device needs to be accounted for, properly assigned, and compliant with the organization’s policies, all while teams, offices, and contractors constantly change.

How Prometheus Exporters Work With OpenTelemetry

Running distributed systems means you need clear visibility into how your services behave. Prometheus has been the standard for metrics for a long time, and OpenTelemetry is now giving teams a more consistent way to collect telemetry across their stack. In many setups, you'll have both: existing Prometheus instrumentation that's already in place, and new components instrumented with OpenTelemetry.

Import Snowflake, Salesforce, ServiceNow, and Databricks metadata into Datadog with Reference Tables

Engineering, operations, and security teams can struggle to make sense of their telemetry data in isolation. Logs, metrics, and events tell what is happening but are often missing critical metadata like who owns what, where it's coming from, or indicators of attack. These gaps in visibility slow down incident response, complicate cost control, and make business or security analytics much harder.

Catch and remediate ECS issues faster with default monitors and the ECS Explorer

Organizations that run applications on Amazon Elastic Container Service (Amazon ECS) often juggle signals across container and task metrics, logs, and events while they hunt for the change or condition that broke a deployment. This work adds operational overhead and extends incident timelines as teams switch between tools and manually correlate symptoms.

Key learnings from the State of Containers and Serverless report

We recently released the 2025 State of Containers and Serverless report, which examines cloud usage data from tens of thousands of Datadog customers. The study shows adoption trends across container orchestration platforms and serverless offerings, and it explores how organizations use those resources to optimize workloads for efficiency, cost, and simplicity.

Not so "mini"-dumps: How we found missing crashes on SteamOS

We shipped an improvement to Sentry's game engine and native SDKs that most developers probably didn’t even notice until now – unless they were explicitly aiming to test their Windows-built games on Linux with Wine/Proton compatibility layers. That's exactly the point. While we were focused on improving our game engine SDKs, our learnings while investigating a mysterious issue are applicable for any Windows application running on Linux via Wine or compatibility layer.

What Are AI Guardrails

When you're shipping LLM features, a lot of the work goes into keeping the model's behavior predictable. You deal with questions like: These are everyday concerns when you integrate LLMs into production systems. Guardrails AI provides a Python framework that helps you enforce those expectations. You define the schema or constraints you need, and the framework validates both the inputs going into the model and the outputs coming back.

Cybersecurity Monitoring Best Practices: Building a Stronger Defense Against Modern Threats

Let’s be honest—the cybersecurity battlefield keeps changing fast. Attackers evolve their tactics, networks expand, and data flows in every direction. If you’re responsible for protecting your organization’s security, staying ahead of cyber threats can feel like chasing shadows. This is where effective cybersecurity monitoring comes in.

Pastries with SREs: From AIOps to GenAI and LLMs (lactose-free latte making)

In this episode of Pastries with SREs, we look at AIOps, where it fell short, where it worked, and how generative AI (GenAI) is reshaping what’s possible in observability today. We explore: If you’re wondering whether generative AI is different this time, this episode offers a grounded, practical look at how it’s evolving observability workflows.

Solutions to hands on exercises | Grafana Alloy for Beginners Ep 14

In this episode, Lisa Jung and Mischa Thompson reveal the solutions to all four hands-on missions, showing how Alloy can collect, process, and secure telemetry data with precision and style. From uncovering hidden keys to securing sensitive data this episode puts Grafana Alloy’s problem-solving skills to the test. This repo is a great resource for learning about the Grafana stack end to end, so check it out if you'd like a full end-to-end working example!

Triaging an Incident with a Critical Data Pipeline at #rivian

Rivian makes electric vehicles to advance its mission to keep the world adventurous forever. As software defined vehicles, Rivian’s R1T and R1S are connected to the cloud from day 1, and telemetry data is at the heart of enabling mobile notifications, remote diagnostics, fleet management, and more. With so many critical pipelines in the cloud, observability is a top priority for the data platform.

Top 10 APM Tools [2026 Guide]

In 2026, application performance isn’t just a technical metric—it’s a business-critical factor. As organizations move deeper into cloud-native architectures, distributed systems, and AI-driven workflows, ensuring speed, reliability, and uptime has become non-negotiable. According to Gartner, by 2026 more than 70% of new APM implementations will be cloud-native, and businesses that leverage advanced observability platforms are expected to reduce downtime by up to 60%.

Can a Human Beat Grafana's AI at Its Own Game?

Grafana Assistant just went GA at ObservabilityCON, and it’s already changing how developers onboard, troubleshoot, and build dashboards in Grafana Cloud. In this video, we put it to the ultimate test — a head-to-head challenge between me and the Grafana Assistant. Who can onboard an app into Grafana Cloud faster and more accurately? Chapters: Watch as we explore: How the Grafana Assistant simplifies onboarding and setup Building dashboards for Redis, Kafka, and Postgres The power of using community dashboards vs. manual configuration Whether AI can truly speed up observability workflows.

Hands on Exercises Overview | Grafana Alloy for Beginners Episode 13

The moment to test your mad Alloy skills is here! Join Lisa Jung and Mischa Thompson from Grafana Labs for an overview of the four hands-on exercises and see how much you’ve learned throughout the series. This hands-on exercise section consists of four missions: The Hidden Key The Cardinality Crisis Attribute Alignment Redact and Protect Your mission, should you choose to accept it, awaits you in the Mission section of the series repo.

The New Open 360 AI Experience

Experience the new Open 360 AI, built to help you explore, analyze, and act on your observability data in a smarter way. See how the AI Agent works directly inside dashboards to explain anomalies, summarize trends across your telemetry data, and guide you to root cause, without switching views or writing queries. Everything you know and love is still here, now enhanced with AI.

Make privacy compliance a competitive advantage with Cribl Guard

As Chief Legal Officer, I’ve personally navigated the complex, ever-shifting landscape where privacy compliance meets rapidly evolving technologies. Whether it’s the sweeping reach of a law protecting personal data in the EU, the specific demands of a law giving California residents more control over their personal information, or the critical protections of a law safeguarding sensitive patient health information in the U.S., one challenge remains.

From Observability to Network Intelligence: How Kentik Built the Foundation for Networks That Think

The age of dashboards is ending, as observability has only created more noise for network teams to sift through. Kentik SVP of Product, Mav Turner, lays out why true network intelligence requires a clean, contextual data foundation to finally create a network that thinks.

Turn fragmented runtime signals into coherent attack stories with Datadog Workload Protection

Security teams face a constant trade-off between detection coverage and alert fatigue. Broad, rule-based detection approaches surface every possible indicator of compromise (IoC) but generate unmanageable alert volumes. Narrow, tightly scoped rules reduce noise but risk missing critical signals. And while individual indicators of compromise can highlight suspicious behavior, they often lack the surrounding context needed to tell a complete story of how an attack unfolded.

What's New at Logz.io - October 2025

We’re expanding the Open 360 AI experience to more users with a modernized navigation and full access to Grafana and OSD dashboards. Your existing dashboards, alerts, bookmarks, and integrations remain unchanged, while new AI-powered capabilities provide deeper explanations and actionable insights. Existing customers can request early access through their account team.

Agentic AI in Action: How OpenAI, Tribe AI and LogicMonitor See Enterprises Preparing for Autonomous IT

Recommendation: Focus your next AI initiative on one high-impact workflow. Measure, iterate, and scale. Agentic AI has quickly become the next frontier of enterprise automation. Instead of static AI tools that wait for human prompts, agents act on behalf of users by autonomously reasoning, sequencing steps, and taking action within defined guardrails.

Flight watch: Optimizing flight operations with real-time monitoring

Aviation has always relied on precise planning and timely communication. However, the rapid development of digital tools has transformed the way airlines, operators, and dispatchers manage flights. One technological advancement at the forefront of this progress is flight tracking and monitoring software, which enables real-time oversight of aircraft movements across the globe. Among these tools, flight watch stands out for its intuitive interface and rich functionality, offering operational teams a crucial advantage in a complex landscape.

Why Your Website's Speed & Structure Affect Visibility

Website performance and organization are vital for a brand's digital success in today's competitive online environment. Users demand quick responses and seamless experiences, and delays can lead to frustration and lower search engine rankings. Focusing on load speed, straightforward navigation, mobile compatibility, and technical stability is crucial for businesses to stay relevant and competitive. A fast, well-organized website provides users with instant access to information, easy navigation, and low friction. Neglecting these aspects can lead to missed opportunities, reduced organic traffic, and poor online engagement.
Sponsored Post

Transform your workflow with Raygun's remote MCP

We're happy to announce Raygun's new remote MCP server, giving AI tools direct access to live error data so they can investigate issues, surface root causes, and take action with real context, not guesses. It's been nearly a year since Anthropic released the Model Context Protocol (MCP), and a lot has changed in the AI space. Since then, almost all major players now support MCP, allowing them to tap into the massive and ever-expanding catalogue of MCP servers. When MCP first launched, we shipped our own Raygun MCP within 48 hours of the spec dropping, which was an early step toward giving LLMs visibility into Raygun data.

October 2025 Azure outage: How StatusGator detected it first

When Azure Front Door began to fail on October 29, 2025, hundreds of downstream services, including Microsoft 365, Teams, SharePoint, and Azure SQL, went dark. While Microsoft didn’t publicly acknowledge the issue until 12:35 PM ET, StatusGator dashboards were already lighting up nearly 50 minutes earlier. StatusGator notified its subscribers of an Azure outage 42 minutes prior to the official status page at 11:53 AM ET.
Sponsored Post

Top 10 Statuspage.io Alternatives in 2025

Choosing the right status page solution can make the difference between customer trust and customer churn during incidents. This guide compares the top status page alternatives to help you find the perfect fit for your team's needs-whether you need public incident communication, internal vendor monitoring, or enterprise-grade features.

Top Observability Tools for 2026: The Definitive Guide

As we move toward 2026, observability is evolving from an engineering luxury to an operational necessity. Modern applications span microservices, containers, APIs, and data pipelines and when something breaks, users expect instant recovery. That urgency is fueling rapid market growth. According to Market.us, the Global Data Observability Market is projected to reach several billion dollars by 2033, growing at a CAGR exceeding 20% between 2024 and 2033.

A different view for the performance timings of an uptime monitor

When you monitor a website at Oh Dear, the monitoring also includes the historical performance insights that belong to that monitor. It gives you a historical overview of the speed of that monitor, allowing you to see anomalies and changes over time. As of today, there's a second view available, one that matches the webbrowser visualisation of the timing of a single request. This view shows the same waterfall information you'd find in Chrome or Firefox, providing a familiar view to developers worldwide.

Part 1: Digital Twins and Predictive Maintenance

As machines and systems grow more connected and complex, the traditional toolbox for managing them feels increasingly outdated. Engineers and operators need new approaches that match the realities of software-driven products and data-intensive environments. Digital twins provide that leap forward. By creating a virtual model of a physical asset and continuously feeding it with real-time data, digital twins reveal both current performance and likely future outcomes.

How to build the ideal engineering team dashboard

Most developers spend too much time digging through tabs and switching between tools, rather than actually writing code. According to an IDC survey, only 16% of their week goes to coding, while the rest is lost to what researchers call “organizational inefficiencies” – all those little things that slow teams down.

How to install On-Premise Poller for Windows

Learn how to install the Site24x7 On-Premise Poller on a Windows machine to monitor your internal resources securely. This step-by-step guide will help you set up monitoring in minutes. What you’ll learn: Whether you're an IT personnel, DevOps engineer, or MSPs managing resources behind the firewall infrastructure, this video will help you understand how easy it is to securely install the On-Premise Poller for efficient monitoring decisions.

Business Continuity vs. Business Resilience: Key Differences

In IT, change is the only constant, and sometimes it arrives as a major disruption. This could include a power outage, a cyberattack, or even a global pandemic. While it’s impossible to foresee every crisis, you can be ready for them. Two key concepts for this are business continuity and business resilience. Although these terms are often used interchangeably, they refer to two separate yet complementary strategies for ensuring your organization keeps operating under any circumstances.

Why Simplicity Beats Sprawl in Modern IT

In enterprise boardrooms today, what was once an arms race to adopt more tools and chase every new capability has now crystallized into a single mandate, “Make the platform work harder without spending more.” The industry has reached a saturation point. The buyers who once greenlit expansions now demand efficiency. And the ones who built the stack? They’re rethinking it entirely. It’s no wonder platformization is taking off.

Grafana Tempo: Setup, Configuration, and Best Practices

As systems grow, understanding how a request moves across multiple services becomes harder. Traces help bring this picture together by showing the exact path a request takes, along with the timings that matter. Grafana Tempo is built for this kind of workload. It stores traces efficiently, works well with OpenTelemetry, and keeps the operational overhead low.

The Top Five Business Continuity Software

Disaster can strike any business at any time. Businesses must be prepared to continue critical operations with minimal disruption, whether it’s a flooded server room, a data breach, or any other kind of exploit. That’s why it’s essential to have strong measures in place—including a business continuity plan (BCP)—and the right tools to support these measures.

Coffee and Claude: How Honeycomb MCP Makes AI Work for You

If you caught our recent Introducing Honeycomb MCP: Your AI Agent’s New Superpower webinar, you know it was a lively mix of big ideas, demos, and a few laughs about the messy, fast-moving world of AI. Hosted by Austin Parker, Morgante Pell, and James Bland from AWS, the conversation explored how Honeycomb’s new Model Context Protocol (MCP) is changing the way developers and AI agents interact with data.

Compliance Under the Microscope

I wanted to share a story of a recent engagement with a law firm to highlight the strategic importance of compliance in today’s legal sector. It started with a single email. A mid-sized law firm received a regulator’s request for evidence following a client complaint. The issue wasn’t malpractice; it was a missed filing deadline caused by a system slowdown. The firm had no audit trail to prove the delay was technical, not procedural.

From Telemetry to Truth: Why Observability Must Be Service-Centric

Modern enterprises depend on systems that appear calm: dashboards glow, availability reads steady, and metrics suggest composure. But the signals only tell part of the story. Conversion softens at the margins, regional sign-in times drift, a compliance report misses an expected field. The puzzle isn’t visibility; it’s meaning. Components describe status; services carry outcomes.

Safely Roll Out Features with Datadog Feature Flags

In this short demo, see how Datadog Feature Flags help teams release new functionality safely and efficiently. Datadog provides advanced targeting, progressive rollouts, and automatic rollbacks — all integrated with powerful observability data. Learn how you can use simple on–off flags or multi-variant configurations to test and deploy features with confidence. With built-in monitoring of key guardrail metrics, Datadog can automatically pause or reverse rollouts when issues are detected, keeping your releases stable.

How Datadog is Reinventing On-Call #Datadog #OnCall #DevOps

Datadog is reimagining how engineers handle incidents—moving beyond simple alerts to an intelligent, voice-driven on-call experience. With Datadog On-Call, teams can acknowledge alerts, access runbooks, post to Slack, and collaborate in real time, all before even touching their computer. See how Datadog brings incident response, communication, and automation together so you can respond faster and keep customers informed.

Debugging in Elixir with Observer

Erlang's Observer is often discussed in passing and regarded as a curiosity during Elixir courses. However, Observer provides many powerful tools for monitoring and debugging your application, both in development and production. Together, we will learn how to access the Observer GUI and debug a project that leaks memory, both locally and through a remote node. We will set up process tracing and track garbage collections to find the offending code in our sample project. Let's get started!

Logs Are Your Data Platform: Dynamic, Queryable, S3Backed

Modern systems move fast. Features ship daily, user behavior shifts hourly, and risks surface in minutes. In that reality, logs are not just a troubleshooting artifact. They are your most expressive data source. Logs capture the words developers write to their future selves. They carry the full story of requests, users, experiments, errors, feature flags, and revenue events.

Building Smarter AI Products #Datadog #DASH #AI

AI capabilities are advancing faster than ever — transforming how teams design, build, and ship intelligent products. In this teaser from Building Successful AI-powered Products at Datadog DASH, experts discuss the rise of agent-based systems, evolving model capabilities, and how to stay ahead in the new era of automation.

How IT teams can finally break free from manual AD management

If there’s one thing every IT leader can agree on, it’s this: Manual Active Directory (AD) management never ends. There’s always one more access request, one more approval chain, and one more audit reminder flashing on your screen. By the time you’ve closed your last ticket of the day, there’s already another one waiting. For many teams, 2025 became the year of “we’ll automate next quarter.” But next quarter came and went without any automation.

Embracing failure and chaos to improve system reliability and SRE team performance

In this interview with Alex Hidalgo, Field CTO at Nobl9 and author of Implementing Service Level Objectives (O’Reilly Media), we explore how traditional metrics like MTTR and MTTx can give a false sense of reliability. Alex shares how SRE teams can embrace failure, build psychological safety, and design systems that reflect the human factor behind uptime, outages, and real-world reliability.

AI Agent for Proactive Problem Management: A Shift Toward a Ticketless Future

As organizations rely on increasingly complex IT infrastructures, incident management often turns into a constant cycle of alerts, escalations, and fixes. While reactive responses may keep operations running, they rarely address the deeper systemic issues that slowly erode performance. Recurring incidents, silent failures, and hidden patterns are usually symptoms of unresolved root causes that traditional approaches struggle to uncover.

Observability vs. Monitoring: Key Differences Explained (2026 Guide)

People often get confused between Monitoring and Observability, using the terms interchangeably in DevOps. However, they represent two distinct yet complementary concepts that play a crucial role in ensuring application reliability and performance. As modern applications evolve, over 90% of new digital services are built using microservices and cloud-native architectures. Traditional monitoring alone can’t provide full visibility into distributed systems.

The APM paradox

Application Performance Monitoring (APM) means many things to many people. At its core, it enables developers to diagnose why their applications are slow and helps them provide a better experience to their users. Traditionally, this is accomplished by collecting a lot of data and displaying it in the form of dashboards and request traces. The problems you're trying to solve are generally known up front.

Common Microsoft Teams Issues & How to Troubleshoot

Microsoft Teams is one of the most popular tools for work communication today. Whether you're chatting with your team, jumping on a video call, or sharing files, it helps keep everyone connected. But let's face it – MS Teams isn’t perfect. You’ve probably run into issues like calls dropping, bad audio, or slow Teams performance. These problems can be frustrating, especially when you’re in the middle of an important meeting or deadline.

Discover resources smarter with deep discovery in internet services

Discover how Deep Discovery from Site24x7 simplifies your website monitoring by automatically detecting, grouping, and managing all related resources—so you don’t miss a thing. In this video, we walk you through a real-world use case, the problems Site24x7 solves, and how its time-saving features like Bulk Addition make managing multiple monitors effortless. Whether you’re tracking SSL, DNSs, APIs, or website performance, Deep Discovery gives you complete visibility without manual hassle.

OTel Updates: Declarative Config - A Steadier Way to Configure OpenTelemetry SDKs

Application configs change over time, often in small ways that are easy to miss. They may start simple — a few environment variables, one exporter, nothing unexpected. As your instrumentation grows, you add rules for filtering health check spans, adjust sampling based on attributes, or introduce environment-specific resource settings. Each change makes sense on its own. But months later, the picture can look different across dev, staging, and production.

Synthetic Monitoring for GraphQL Endpoints: Beyond the Query

GraphQL isn’t just another API protocol—it’s a new layer of abstraction. It collapsed dozens of REST endpoints into one flexible interface where clients decide what data to fetch and how deep to go. That freedom is a gift for front-end teams and a headache for anyone tasked with reliability. Traditional monitoring doesn’t work here. A REST endpoint can be pinged for uptime.

AWS & Splunk: Accelerating Innovation Through Partnership

Discover how AWS and Splunk are pushing the boundaries of innovation to empower your security, observability, and cloud transformation journey. This video highlights our joint commitment to driving digital resilience through unified visibility, faster threat detection, and seamless integration across AWS services.

Monitor OCI spend, AI in DDSQL Editor, OTLP Metrics API, and more | This Month in Datadog

See how you can gain insights into cloud costs by tracking OCI spend and easily comparing instance types in October’s episode of This Month in Datadog. Join us for a spotlight of Cloud Cost Management’s support for Oracle Cloud Infrastructure, and the product’s new feature, Instance Explorer, which enables you to visualize and easily compare the cost and performance of instances across AWS, Azure, and Google Cloud.

Grafana Mimir 3.0 release: performance improvements, a new query engine, and more

In 2022, we introduced Grafana Mimir, our open source, horizontally scalable, multi-tenant time series database (TSDB) designed for long-term storage of Prometheus and OpenTelemetry metrics. Over the years, Mimir has become a go-to metrics backend within the open source community, with 30 project maintainers and more than 4.7k GitHub stars.

Stop the guesswork: Troubleshoot with confidence with process monitoring

IT infrastructure is vast, complex, and interdependent. At any point in time, businesses rely on thousands of servers running thousands of processes. Detecting server downtime is fairly easy—but true observability is when you know precisely which processes are working as intended and which are silently contributing to performance degradation. A failed database worker or a memory-leaking background service can silently drain resources until your most critical apps grind to a halt.

Understand user experience through network performance with Datadog Synthetic Monitoring

When an application slows down or fails, pinpointing the cause isn’t always simple. Is it a backend regression, a misbehaving API, or a bottleneck somewhere deep in the network? Without full visibility, teams waste precious time troubleshooting across disconnected tools and layers. Datadog Synthetic Monitoring now supports Network Path to help you proactively identify whether user-facing issues stem from your code or from the underlying network.

Accelerate your Azure integration setup with guided onboarding

Getting started with monitoring for Microsoft Azure environments can be a lengthy and manual process. Many tools require users to create app registrations, assign permissions, and enable log forwarding or telemetry data collection across multiple portals and scripts. These fragmented steps slow down onboarding and introduce opportunities for misconfiguration, making it harder for teams to quickly achieve full visibility.

Gobbling Up Insights: Graylog 7.0 Serves Up a Feast

A feast of new features. A cornucopia of new capabilities. A banquet of breakthroughs (and the T-day puns are just getting started). Graylog 7.0 brings a full plate of advancements that help security teams cut through noise, control cloud costs, and respond with confidence. We’re serving practical improvements across dashboards, automation, and AI support so analysts can focus on action instead of manual effort.

40 Best Cloud Network Monitoring Tools of 2026 for All Platforms and Giants like AWS, Google, Azure, IBM, and Oracle

Cloud network monitoring software is a type of software designed to monitor and manage the performance, availability, and security of networks and network devices in cloud environments. These tools use various techniques to gather information about network traffic, bandwidth utilization, application performance, and other metrics related to network health and availability.

Everything You Need to Know About the SSL Certificate Monitoring

In today’s hyper-connected world, website security is not optional. It is the foundation of the digital trust. Whether you run an e-commerce store, manage a SaaS platform, or operate a corporate platform, your online presence matters a lot. For all this, your online presence depends on the SSL certificates to encrypt sensitive data and authenticate your identity. However, too many organizations treat SSL certificates as a “Set-and-forget” task.