Operations | Monitoring | ITSM | DevOps | Cloud

Sponsored Post

Data-Led Growth: How FinTechs Win with App Event Analytics

In the rapidly shifting world of financial technology (FinTech), acquiring and retaining new customers to achieve long-term business growth requires a proactive approach to user experience and application performance optimization. As FinTech companies compete against rivals to grow a user base and revolutionize how consumers manage their finances, they increasingly depend on data-driven insights to optimize their mobile applications and deliver exceptional user experiences. This is where application event analytics comes into play.

Why Modern Enterprises Still Get Blindsided And How Business Process Observability Changes That

Traditional observability misses business failures Modern monitoring tools can show that systems are technically healthy while critical business outcomes are quietly failing. Business Process Observability (BPO) closes this gap by tracking entire business transactions, like orders, payments, and shipments, instead of just infrastructure and application metrics.

There Is No Good Spring Boot Alternative (Unless You're Doing One of These Three Things)

Every few months a new "we migrated off Spring Boot" post washes across r/java or DZone. The numbers are always impressive. 60% memory reduction, 85% faster startup, cloud bill cut in half. The comments are always full of developers convinced they should be doing the same thing. Is Spring Boot really that bad now? I decided to do my own research. I read every credible public migration case study I could find. I ran benchmarks. I built the internal business case for switching two of our services.

High-cardinality metrics at scale: why the standard playbook is wrong

The “high cardinality is expensive” sentence has become observability’s version of “in this economy” — said so often that nobody questions whether it’s true. Every vendor pricing page invokes it. Every glossary article repeats it. Every architecture diagram shows aggregation buffers placed before the storage layer.

Your telemetry, your apps. Inside apps on the Cribl Platform

You already use Cribl to tame your telemetry data. Now you can turn that data into apps your teams actually want to use. In this video, we walk through how to create apps in the Cribl Platform and show how real apps solve real problems: guided troubleshooting for noisy incidents, opinionated security views, and exec-friendly ROI dashboards. You’ll see how apps sit on top of Cribl Stream, Edge, Search, and Lake, so you reuse the data and logic you already have instead of building custom tools from scratch.

498 Fake FIFA World Cup Domains and How Phishing Sentinel Catches Them

The FBI published a warning last week. Threat actors have registered more than 498 fake domains tied to the 2026 FIFA World Cup. Fake ticket sites. Fake job listings. Fake merchandise stores. All live in DNS right now. Every one of those domains is catchable. Not after victims report fraud. Before anyone gets hurt. That is what DNS Spy’s Phishing Sentinel is built to do.

Phone numbers now supported in status page contact field

We’ve rolled out a small but useful improvement to the Status Page → General settings. Previously, the Support contact field in the footer only accepted: Based on feedback from our users, the field now also supports phone numbers. Status pages in StatusGator already offer a variety of customization options – including custom branding, layouts, monitor visibility, subscriber settings, and privacy controls.

Why Shared Context Matters in Hybrid Cloud Operations

The first post in this series explored why traditional observability breaks down in hybrid cloud environments. As infrastructure, applications, and dependencies stretch across on-premises networks and cloud services, isolated monitoring views leave teams with an incomplete understanding of what is happening and why. That challenge raises the next question: what kind of operational model actually works in a hybrid environment?

Monitor LLM routing with the Kubernetes Inference Extension

If you serve LLMs on Kubernetes without inference-aware routing, your load balancer is likely wasting inference capacity. Generic HTTP traffic management blindly routes requests, assuming the backends in your cluster are interchangeable. But your model-serving backends are stateful and unevenly prepared to handle any given request. As a result, requests are often routed to the backend that’s not the one best suited to respond.

How a unified data model improves feature flag rollout decisions

Consolidation is reshaping the experimentation and feature management landscape. Tools are merging, and partnerships are being repackaged as platforms. But marketing a unified experience is not the same as building one. Right now, engineering leaders and product managers are reassessing whether the tools they depend on are built for the long term. It’s irrelevant which vendor has the most products.

7 Observability Platforms With Built-In SIEM (2026 Comparison)

Your SIEM flags a threat. Then someone loses ten minutes pivoting to a second tool just to find the trace, host, or deployment behind it. That gap where security and observability living in separate products is exactly what the 7 platforms below are built to close. This list is scoped deliberately to platforms that run real SIEM detection on the same data plane as your APM, logs, and infrastructure telemetry, not standalone security-only tools like QRadar or Wazuh.

Patch Management vs Vulnerability Management: What are Key Differences?

What keeps systems secure in real IT environments, applying fixes quickly or knowing what needs attention first? Most IT teams do not struggle because they lack tools or processes. They struggle because two critical functions are often mixed together. Patch management and vulnerability management. This creates a gap between what is being fixed and what actually needs to be fixed. The challenge is that teams deal with constant alerts, regular updates, and growing security risks.

Overview of TCP Port and UDP Checks

Welcome to Uptime.com! In this video, we'll guide you through setting up TCP Port and UDP checks. Learn how to monitor server responsiveness using TCP Port checks and how to configure UDP checks for applications requiring less packet accuracy. We'll cover the necessary steps, required information, and test your configuration to ensure it's correct.

Sensor Monitoring Tools for Modern Facilities

Modern facilities depend on real-time visibility. Buildings now need to monitor air quality, occupancy, water leaks, energy use, equipment vibration, access activity, temperature, humidity, and safety risks across multiple zones. Sensor monitoring tools help facility managers detect problems earlier, reduce downtime, protect assets, and improve occupant comfort. They also support better maintenance planning because teams can respond to data instead of waiting for complaints or failures.

Top Tips: How to stop doing everything yourself and delegate to AI before you burn out

Top Tips is a weekly column where we highlight what’s trending in the tech world today and list ways to explore these trends. This week, we’re looking at which tasks you can delegate to AI. We've all struggled to delegate tasks. Whether you're a junior struggling to prioritize your tasks on a daily basis or a manager unsure of assigning responsibilities, you know how messy task delegation can get. Some people just improvise while others have a method to this madness.
Sponsored Post

Clouds Without the Fog: Unified Control for Hybrid SAP

SAP customers with complex SAP know the challenges of managing multiple landscapes well. While classic tools like SAP Landscape Management (LaMa) and Focused Run served us well for years, they were built for a static, on-premises world. Now, with the 2027 end-of-support deadline for legacy solutions looming, the "fog" of hybrid management is getting thicker.

8,000+ Services and counting: One place to monitor what matters

StatusGator now monitors more than 8,000 services now! From cloud platforms and AI tools to communication apps, payment providers, developer infrastructure, and business software, we continue expanding our monitoring coverage every day so teams can track everything that matters in one place.

Our First Take on Citrix Platform Flex

On May 12, 2026, Citrix officially announced Platform Flex, a new consumption-based model for the Citrix Platform. At first glance, it looks like yet another licensing change, something Citrix customers have become fairly used to over the years. But after digging through the documentation, Platform Flex appears to be more than just another packaging exercise. The real change is that Citrix is moving toward a credit-based desktop consumption model.

Unified observability for Alibaba Cloud with Datadog

Alibaba Cloud is a major cloud provider in APAC, offering industry-leading foundational AI models in addition to compute, managed databases, object storage, and Kubernetes through its Container Service for Kubernetes (ACK). Teams choose Alibaba Cloud for its infrastructure availability across Asia Pacific and its managed services. For SREs and platform engineers, that often means running Alibaba Cloud alongside AWS, Google Cloud, or Microsoft Azure.

How to Install and Configure an OpenTelemetry Collector

Originally published June 2024. Updated May 2026. A lot has changed since the first version of this guide. In May 2026, OpenTelemetry officially graduated within the CNCF, the highest maturity level a project can achieve. All three core signals (metrics, logs, and traces) are now stable across every major language SDK. Collector adoption has never been higher, and the ecosystem around it, particularly OpAMP for remote management, has matured significantly. This update walks through three things.

Best Cloud Monitoring Tools in 2026 [20+ Analyzed, Top 6 Picks]

The best cloud monitoring tools are Hyperping (uptime, server monitoring, status pages, and on-call at a flat rate), Datadog (full-stack observability with the broadest integration catalog), New Relic (usage-based observability with the most generous free tier), Dynatrace (AI-driven automatic root-cause analysis for large enterprises), Better Stack (monitoring paired with logs and incident response), and Prometheus + Grafana (the open-source standard for cloud-native metrics).

Cybersecurity Tips for Small Businesses

Small businesses are now among the most frequently targeted organizations in the world. Attackers focus on them not because they have the most to steal, but because they tend to have fewer defenses, smaller teams, and less time to spend on security. The good news is that the majority of attacks rely on a small set of well-understood techniques, and most of them can be prevented or contained with practical, affordable controls.

Deploy Datadog Kubernetes Autoscaling at scale

Every Kubernetes environment accumulates waste over time. Teams overprovision CPU and memory requests to avoid performance risk, run idle replicas to preserve headroom, and leave Horizontal Pod Autoscalers (HPAs) untouched long after workload behavior has changed. Some of this waste can be addressed at the node level, where Datadog Cluster Autoscaling helps teams rightsize capacity.

Monitor Azure Managed Redis with Datadog

Azure Managed Redis is Microsoft’s fully managed, enterprise-tier in-memory data store. It is designed for the low-latency caching, session storage, and real-time data needs of modern applications, including AI workloads that depend on fast vector and embedding lookups. Because user-facing applications often query Redis directly, even small regressions in latency, hit rate, or memory pressure can degrade the user experience.

Monitor JavaScript framework routing with Datadog RUM

Modern web applications rely on frameworks like Next.js, Vue, and Angular to handle routing and rendering. In these architectures, navigation happens within the application rather than through full page loads, which makes it difficult for traditional browser instrumentation to capture what users actually experience. As a result, teams often see misleading view names, missing navigations, and errors that are either misattributed or not captured at all, especially during hydration or lazy loading.

Top 9 Network Performance Metrics You Should Measure in 2026

How do you know if your network is actually healthy right now? For most IT teams, answering that question means jumping between multiple tools, dashboards, and alerts, only to end up with more uncertainty than clarity. The problem is not missing data. It is knowing which signals matter, what normal really looks like, and when performance issues start affecting users and business operations. Modern networks generate thousands of metrics every minute, but not every spike or alert deserves attention.

Federated Search | From Silos to Insight | Splunk Cloud with Apache Iceberg REST and AWS S3

This walk-through shows how Splunk Cloud can search AWS S3 data through an Apache Iceberg REST catalog backed by Nessie. Learn how Iceberg table metadata, S3 storage, and Splunk Federated Search work together so analysts can query historical security data where it lives without reingesting it into Splunk.

Instrument LangGraph agents with Datadog: a practical guide

AI agents tend to function as black boxes, and it can be difficult to trace and understand agent workflows end-to-end in order to characterize performance. Particularly, you need visibility into the following: By tracing full agent runs with LLM Observability, Datadog AI Agent Monitoring enables you to visualize workflows with flame graphs and quickly spot sources of failures and latency.

How we cut build times by two-thirds by deleting our CMS

At Sentry, we’re obsessed with things not breaking. It’s kind of our whole deal. But for a while, our own marketing site was testing that obsession. Much of what you see on sentry.io (the marketing site, blog, open source microsite, etc.) were running on a fleet of legacy Gatsby sites powered by a traditional headless CMS. On paper, it worked.

From Insight to Action: Operationalizing Logicmonitor + Catchpoint for Unified Observability

Visibility without control is just expensive awareness. Most IT teams can see when something’s wrong, but can’t easily tell who’s affected, why, or what to do next. In other words, they lack real control. In this in‑depth session, LogicMonitor’s Callum Brown and Brandon Delap showed how to move past that.

Root Cause Analysis: How Engineering Teams Fix Production Issues Faster?

When a production incident strikes, a sudden latency spike, a cascading API failure, a service returning 500s at scale, every minute of downtime has a cost. Root cause analysis (RCA) is the process that turns that chaos into a clear answer: what actually broke, and why. Not the symptom that triggered the alert. The underlying cause.

Why Autonomous IT Is Becoming Essential for the Modern Industry

Autonomous IT shifts enterprises from reactive to proactive operations“By combining AIOps, agentic AI, predictive analytics, and self-healing automation, Autonomous IT helps organizations detect issues early, automate remediation, and prevent downtime before it impacts customers or revenue.

Underminr Proved Your DNS Filter Has a Blind Spot. Here's the Other Layer You Should Be Watching.

A new attack technique called Underminr was disclosed this week. It slips past protective DNS by abusing shared CDN edge IPs. The DNS query looks clean. The connection lands on malware. This post walks through what Underminr is, why protective DNS misses it, what actually stops it, and the OTHER DNS layer most teams forget to watch.

FinOps KPIs for IT Infrastructure: A Practical Field Guide for Cost Visibility

Infrastructure cost visibility has become a critical part of IT decision-making. Performance still matters, but for many infrastructure leaders, that’s no longer the full conversation. Leadership teams increasingly want clarity around cost movement, upgrade exposure, underutilized resources, and whether infrastructure decisions are financially defensible. That creates a different requirement for operations teams: visibility that connects technical behavior to business impact.

Microsoft 365 backup best practices: A practical guide for IT teams

Microsoft 365 plays a critical role in modern business communication and collaboration with services such as Exchange Online, SharePoint Online, and OneDrive for Business. However, many organizations overestimate Microsoft 365’s native protection and recoverability. In reality, Microsoft 365 operates under a shared responsibility model. While Microsoft ensures infrastructure availability and uptime, organizations are responsible for protecting and recovering their data.

Bridging Bedrock Skills with AI: A Conversation with Jeremy Bradberry

What happens when decades of operational experience meet modern AI-driven networking? In the latest episode of Next-Gen Network Heroes, Bob Slevin sits down with Jeremy Bradberry, Senior Network Engineer at Delaware North, to explore how network engineers can modernize infrastructure without losing sight of the operational realities behind the technology. Jeremy shares lessons learned from working on legacy manufacturing systems, how AI is helping engineers analyze data and automate workflows faster than ever before, and why strong standards still matter in today’s AI era.

Investigate funnel drop-offs with Product Analytics

For most product teams, funnels are a staple of the analytics toolkit despite a frustrating limitation. You can see which step users are dropping off at, but understanding why requires hours of manual slicing across segments, separate comparison views, and a lot of trial and error before you land on a useful hypothesis. And even when you find something meaningful, taking action typically means jumping to another tool, building a new segment, or filing a request with a data team.

What's Next for WhatsUp Gold: Unified Network Visibility and Security

In this session, we’ll walk through the Progress WhatsUp Gold roadmap - linking recent releases and what’s next - to show how the platform is growing toward greater visibility, stronger security, and more consistent operational workflows for hybrid, multi‑site, and security‑focused environments.

Hybrid Cloud Monitoring Explained: On-Prem + Cloud + Kubernetes in One View

Understand what hybrid cloud monitoring is and why it’s critical for managing modern distributed IT environments. Hybrid cloud monitoring helps organizations unify visibility across on-prem infrastructure, public cloud platforms, virtual machines, containers, and Kubernetes clusters in a single monitoring platform. In this video, learn how fragmented monitoring tools create operational blind spots and slow down incident response across hybrid environments.

Game On: What Retro Gaming Teaches Us About Modern Networks with Jeremy Bradberry

What can decades of hands-on operational experience teach us about the future of AI-driven networking? In this episode of Next-Gen Network Heroes, host Bob Slevin sits down with Jeremy Bradberry, Senior Network Engineer at Delaware North, for a conversation that spans everything from legacy manufacturing systems and mainframes to modern AI-assisted network operations. Jeremy shares how his early career working in industrial environments shaped the way he approaches networking today, giving him what he calls an “X-ray vision” into how technology connects directly to business operations.

What is AI-Powered Observability? A Complete Guide for IT Teams in 2026

Is your monitoring stack really giving you clarity, or just more alerts? Your monitoring stack is probably working exactly as designed. That is the problem. As systems grow, most IT and platform teams start to see the same patterns: At this point, traditional monitoring starts to feel limited. This is where teams begin exploring AI in observability. In this guide, we will explain what AI-powered observability actually means, how it works, and when it is useful.

AI SRE Agent: How Autonomous Incident Investigation Is Eliminating Manual Root Cause Analysis

A critical production alert wakes you up: p99 latency just hit 4 seconds. You drag yourself to a terminal, open five dashboards, start correlating log timestamps with trace IDs, dig through 47,000 log lines across eight services, and 90 minutes later, you finally find the culprit: an N+1 database query introduced in a deployment that shipped four minutes before the spike started. An Atatus AI SRE Agent would have identified that root cause and drafted a remediation plan in 28 seconds. Not approximation.

IPL: How to use the ipl-web TermInput

Most form fields ask users for a single value like a name, an email, or a date. But some need a list of values. A plain text input with comma-separated values can technically do the job, but it gives no feedback while typing, no suggestions, and one invalid entry rejects the whole field. The ipl-web TermInput solves this problem. Each value becomes a separate term with its own validation; terms can be enriched, and the input even supports suggestions.

The inside scoop on alerting changes in Kubernetes Monitoring

Kubernetes Monitoring in Grafana Cloud comes out of the box with preconfigured alert rules that notify you about issues like CPU throttling, crash-looping pods, and nodes going offline. These rules are installed automatically when you set up the app, and they start evaluating immediately. But if you've recently reinstalled the Kubernetes Monitoring app and your alert notifications stopped arriving, or started looking different, you're not alone.

Spend less time on repetitive tasks with the new automation feature in Grafana Assistant

The ability to schedule regular tasks, such as cron jobs, has been around for decades. So why are we still running the same AI prompts by hand every day? As you use Grafana Assistant, our AI-powered observability agent, to stay on top of the state of your system, you likely find yourself asking the same questions. Maybe you want to know what changed overnight, or whether yesterday's deployment hurt latency, or which dashboards or skills are drifting out of date.

Best Cron Job Monitoring Tools in 2026 [25 Analyzed, Top 5 Picks]

The best cron job monitoring tools are Hyperping (cron monitoring, uptime, on-call, and status pages at a flat rate), Healthchecks.io (free open-source heartbeat monitoring), Cronitor (schedule-aware cron analytics), Better Stack (monitoring with integrated logs and incidents), and UptimeRobot (budget-friendly uptime with basic heartbeat checks).

You don't need to pick one: how Sentry and OpenTelemetry work together

You already instrumented the backend with OpenTelemetry. Your services emit spans. Your teams know the OTel APIs. Maybe you already run a Collector. So when you start evaluating Sentry, the obvious question is: Do you need to replace your OpenTelemetry setup with the Sentry SDK? No. The practical answer is usually: keep OpenTelemetry where it already works, add the Sentry SDK where it gives you more application context, and send OpenTelemetry Protocol (OTLP) events to Sentry.

Builder in the loop: Eric Lake on making AURA smarter after every incident

Builder in the Loop is a Mezmo interview series focused on the engineers, product leaders, and operators shaping AURA, an open-source, MCP-native agent harness for production operations. The goal is to get past the polished product layer and talk through the decisions that matter when AI starts interacting with real systems. Key questions include: What should agents be allowed to do? How do they get better over time? Where should humans stay in the loop?

Everything We Talked About at O11yCon 2026

We just wrapped O11yCon 2026, and this year's conversations hit differently. Agent-based software development is here, now. It's no longer an optional choice, and everybody is struggling to understand what their agents are doing and how to make them cost less and perform better. Over the course of fifteen talks, we saw clearly that the old assumptions on how and who (or what) writes our software has been upended. Here are some highlights. We'll have videos available in the near future.

Web Accessibility Monitoring: an Ops Team Guide

Web accessibility monitoring is the automated, scheduled scanning of a website for accessibility failures. Unlike a point-in-time audit, monitoring runs continuously. Code changes, content updates, and third-party scripts all introduce regressions. Monitoring catches them before they become complaints. This guide covers how it works, and where it fits in an ops stack.

Why Clean Dashboards Improve Reporting and Decision-Making

Reporting affects how leaders judge performance, catch strain points, and set priorities. Yet many teams still work from crowded views, disconnected files, and stale exports. That arrangement slows review, invites doubt, and weakens confidence in every figure shown on screen. Clean dashboards correct that problem by presenting important measures in a clear order, limiting visual clutter, and making changes easier to spot. Better reporting, in turn, supports steadier choices across finance, sales, operations, and service.

Introducing Microsoft DHCP management in OpUtils: From monitoring to full control

If you manage enterprise networks, this scenario probably sounds familiar: An IP conflict surfaces, connectivity drops for a group of users, and the confusion begins. You check your DHCP server, dig through scope utilization, and try to piece together what went wrong, often after the disruption has already occurred. For years, network administrators have needed a single console for visibility and control into DHCP.

Explore for Spans: One View with Infinite Depth

It’s 20 minutes into a P0 incident, and you have already switched between four different tools, re-authenticated twice, and translated queries across three incompatible syntax languages. The root cause you are searching for. Well, that is still out there somewhere. The reality of investigative latency is that most engineering teams face navigation problems, not data problems. During high-pressure incidents, teams lose cognitive momentum due to context switching between disconnected telemetry silos.

What Is Hybrid Cloud Monitoring (And How To Actually Do It Well)

Most IT teams running a real hybrid setup are not short on data. They are short on a place where the data agrees with itself. By the end, you will know what to ask a vendor for, where teams usually trip, and how to scope a proof of concept that does not burn a quarter. Hybrid cloud monitoring is the ongoing collection of telemetry across your on-prem kit and one or more public clouds, treated as one environment instead of two or three. The goal is not just visibility.

Operator now has Long-Term Support (LTS) version

VictoriaMetrics Operator has been developing at a neck-breaking pace, bringing numerous improvements, features, and fixes to our community. We usually make at least a single release every two weeks. While this rapid iteration cycle is great for delivering fixes and improvements quickly, it can be challenging for administrators managing critical production environments.

Getting Started with gcx: A CLI for AI Agents and Grafana Telemetry | Demo

AI agents are only as useful as the context they can access. With gcx, your coding agents can connect to Grafana and query real-time production telemetry from your Cloud, Enterprise, or OSS environment. The best part: it avoids the upfront context bloat that can come with loading tools before you even send a prompt. gcx uses a CLI approach, so there’s zero token cost until your agent actually needs to run a query.

Lessons From a CI/CD Supply Chain Attack at Grafana Labs

When a compromised GitHub Actions workflow targets your CI/CD pipeline, how do you respond — and what do you change so it never happens again? Nick and David from Grafana Security walk through a real supply chain incident triggered by a pull_request_target misconfiguration, showing exactly what broke, what tools caught it, and what the team rebuilt afterward.

Measure the real impact of AI coding tools on software delivery with Datadog AI Impact

Engineering teams have rapidly adopted AI coding tools, but organizations still struggle to understand their impact. Existing dashboards focus on activity, such as daily active users, acceptance rates, or lines of generated code, but these metrics don’t answer a more important question: Are teams actually shipping more, faster, and with fewer issues?

Building a Defensible AI Compliance Framework

Organizations have moved past theoretical conversations about AI adoption. Models, agents, and autonomous workflows are entering production environments. Business leaders are optimistic about potential gains in efficiency, decision support, and operational scale. Yet beneath this momentum, compliance and risk teams feel a different pressure.

Search Azure Blob data in-place with BYOS for Cribl Lake

See how Bring Your Own Storage (BYOS) in Cribl Lake allows teams to connect directly to Azure Blob Storage and instantly search data in place — without moving, duplicating, or rehydrating telemetry. In this demo, Cribl Product Manager Risk Salsa walks through setup, dataset creation, and how to run fast investigations across your Azure-hosted data using Cribl Search.

How to Reduce Help Desk Demand (Hint: It's Not a Help Desk Issue)

Most IT organizations are trying to reduce help desk demand the same way they have for years: by making the help desk itself more efficient. They improve routing, tighten SLAs, expand self-service, and add AI into the support flow. These changes can make the queue move faster, but they do not stop the work from arriving in the first place. The same problems keep finding their way back to IT. Employees lose time to slow devices, unreliable apps, failed updates, access issues, or confusion after a rollout.

What Is Internet Congestion and How to Fix It

Your VoIP calls are choppy. File uploads are crawling. Your team is complaining that the CRM is sluggish, and remote desktop sessions keep freezing. You check your firewall, your switches look clean, and there are no alerts on your LAN. The problem isn't inside your network. It's upstream, and it's happening quietly every day during peak hours.

OpenTelemetry Monitoring with Netdata

If you've standardized on OpenTelemetry (or you're heading that way), you probably know the collector gets your data out, but where it lands and how useful it is once it gets there are separate problems. Netdata now ingests both OTLP metrics and OTLP logs natively, so your OTel pipelines feed directly into the same monitoring experience as everything else in your infrastructure: same dashboards, same alerting, same query interface. No separate backends, no context switching.

New Explore: Faster answers, less friction, and a better way to investigate your data

There is a moment every engineer knows too well. Something is wrong in production. You have an alert, a vague symptom, and pressure to find the one signal that explains what changed. You open your logs and traces, and you immediately hit the same two problems: the dataset is huge, and the path from “I see something odd” to “I understand why” is full of tiny, exhausting steps. Meet new Explore, our redesigned investigation experience for logs, traces, and spans.

Ameet Talwalkar on Building the AI Research Lab

"We're doing cutting-edge AI, focused on real translational impact: getting our research over the wall and into production." Ameet Talwalkar, Datadog's Chief Scientist, shares what it took to build the AI Research Lab from the ground up — and what makes DAIR different from traditional research teams. At Datadog, research ships. Recent work from the lab includes Toto 2.0, open-weights time series forecasting models ranked on leading benchmarks, and ARFBench, a new benchmark for evaluating AI on real incident data.

Your agent can't fix what it can't see

Agents are getting better and better at fixing bugs. They’re even getting better at testing their work, thanks to headless browsers, sandboxes, simulators, etc. But what about the bugs that only show up once you bring in different browsers, languages, extensions, internet speeds, and all the other variables that get mixed in the second you ship to prod? Or all the bugs that only show up when you account for… well, humans being humans and doing weird stuff you didn’t expect them to do?

We Built a Better DNS Propagation Checker. Here's What Makes It Different.

Today we are launching the DNS Spy DNS Propagation Checker. It is free. It works on any domain. It shows you what is happening in more places, in more detail, and faster than the tools you have been using. You can try it right now: dnsspy.io/dns-tools/dns-propagation-checker.

WHOIS & RDAP Domain Lookup & Expiry Check

In this video, we’ll walk you through how to set up and configure your Whois and RDAP Domain Lookup & Expiry Checks in Uptime.com. Learn how to monitor and receive alerts before your domain expires, and protect your registration information from unauthorized modifications. We cover step-by-step instructions for setting up checks through the Uptime.com UI and via API.

Future Solving with Brian Evergreen (Or: How to Escape those AI Career Jitters)

Brian Evergreen joins the show to challenge the fear-driven narrative around AI and work. Rather than treating the future as something coming for us, Brian argues that leaders and individuals should decide what future they want to create, then work backwards. He explores why “start with the problem” thinking limits AI strategy, how visible strategy and relational leadership can unlock better transformation, and why human connection may become more valuable—not less—in an AI-enabled world. A thoughtful conversation on escaping AI career anxiety, building resilient networks, and creating value beyond efficiency.

Inside the Grafana AI Team Weekly: Guards for AI Observability (May 5, 2026)

This is an excerpt from a real AI team weekly meeting where we talk about the stuff we build and occasionally also demo them! In this one, Principal Software Engineer Sven Großmann shows a new feature he's working on for AI Observability, called "guards". We're showing parts of our team meetings to build in public in some small way and give you a sneak preview of what's to come. But not all features we show may make it to production! You've been warned. :)

DNS Monitoring for MSPs: A Complete Setup Guide

If you run an MSP, this is the call that ages you. The fix is almost always small. A record was edited at the registrar. A vendor changed an MX target. A new tool added a TXT record and pushed SPF over the lookup limit. None of that should reach a client. With the right monitoring, none of it does. Here is a real one. A 40-person law firm renews their EV certificate. The vendor needs a CAA record cleaned up.

Exploring Powerful Power BI Dashboards for Smarter Decision-Making

Operational dashboards help teams answer urgent business questions quickly. They show whether production is on track, inventory is healthy, downtime is rising, or resources are being stretched too thin. This article explores practical Power BI dashboard examples for operational efficiency across production, supply chain management, resource planning, and performance measurement. It also explains how to build dashboards that support real decisions rather than simply displaying data.

Essential Mac Maintenance Tips for Operations Professionals

Operations professionals rarely have the luxury of working slowly. Their day consists of managing deadlines and analyzing reports, communicating between teams, and organizing files. It also involves constantly switching between dozens of services. At this pace, the Mac becomes the hub of daily coordination. That's why performance speed, system stability, and macOS predictability have a direct impact on performance. Most Mac issues arise from a lack of regular maintenance. Chaotic background processes, overflowing storage, outdated security settings, and more can gradually turn even a powerful MacBook into an unstable device.

Shopify outage on May 22, 2026 impacted merchants worldwide

On May 22, 2026, merchants using Shopify experienced a brief but widespread disruption that affected access to product pages, collections, and administrative tools. While the outage lasted less than an hour, it created immediate challenges for businesses that rely on Shopify to manage inventory, update products, and operate online stores. StatusGator detected the developing incident at 10:20 UTC using Early Warning Signals, 18 minutes before Shopify officially acknowledged the outage at 10:38 UTC.

Your Microsoft Azure storage, our data lake power: The best of both worlds

The wait is over for Azure-first organizations. Cribl just launched Cribl Lake Bring Your Own Storage (BYOS) for Microsoft Azure, giving you full data lake power without moving a single byte of telemetry out of your environment. Join us to see how you can finally get the flexibility of a modern data lake while keeping your data in Azure.

AI Won't Replace You. Someone Using It Will.

AI isn’t about replacing engineers. It’s about leverage. The teams that win will be the ones that: Triage incidents faster Correlate signals automatically Reduce manual investigation Automate repetitive operational work In observability, that means asking: AI won’t eliminate expertise, it amplifies it. The real risk isn’t AI taking your job. It’s competitors using AI to operate at a speed and efficiency you can’t match.

The New Agentic AI Job Roles IT Leaders Need

CIOs are under pressure from every direction. Budgets remain tight, geopolitical uncertainty is forcing organizations to rethink resilience, and workforce expectations continue to evolve. At the same time, AI is accelerating a broader shift across enterprise IT – changing not only how organizations operate, but also the skills and roles they will increasingly depend on. The question is not whether AI will reshape IT teams, but how quickly organizations can adapt to these new ways of working.

Anthropic Monitoring & Observability with OpenTelemetry and SigNoz

Learn how to implement end-to-end monitoring and observability for Anthropic (Claude) API-based applications using OpenTelemetry and SigNoz. In this video, we walk through instrumenting your Anthropic API calls, collecting traces, metrics, and logs, and visualizing everything in SigNoz to gain real-time visibility into performance, failures, and bottlenecks. You'll see how to move from basic logging to production-grade observability, so you can debug faster, optimize latency, and confidently run Claude-powered AI systems at scale.

Monitor your Render services with AppSignal

AppSignal now supports Render's Metrics Stream. Configure it once in your Render workspace and Render forwards OpenTelemetry metrics to the AppSignal Collector. From there, the metrics show up in your AppSignal app as host metrics and automated dashboards. You only have to set up the stream once per workspace.

Why Traditional Observability Breaks Down in Hybrid Cloud Environments

Hybrid cloud has reshaped the way enterprises build, run, and troubleshoot digital services. Applications now stretch across on-premises infrastructure, cloud platforms, regional services, interconnects, and distributed dependencies that change constantly. Operational complexity has expanded with that footprint, yet many observability practices still reflect assumptions from an earlier era of simpler architectures and clearer boundaries. That gap shows up fast during an incident.

How to measure developer experience (DevEx) in the AI era

As AI coding assistants dramatically inflate PR counts, commit frequency, and lines of code, the limitations of individual output metrics have never been more apparent. A developer can now produce significantly more lines per session, but higher volume doesn’t guarantee that the code is stable, maintainable, or successfully running in production. GitClear analyzed over 200 million lines of code and found that code churn nearly doubled following widespread AI adoption.

How to Deploy Serv-U Gateway to Achieve Secure File Transfer in DMZ

Serv-U Gateway acts as a reverse proxy for secure file transfer for DMZ networks. Serv-U Gateway terminates all incoming connections to the DMZ and never stores any data at rest in the DMZ. Serv-U Gateway is supported on both Serv-U FTP Server and Managed File Transfer (MFT) Server. Comply with PCI DSS by using Serv-U Gateway.

Episode 11 - Human Choices in an AI Future (Part 1)

What if the biggest risk in the AI era isn't the technology, but waiting for someone else to tell you what to do with it? In this episode of The Intelligent Enterprise, host Tom Stoneman sits down with Karthik Ravindran, General Manager of Enterprise Data and AI at Microsoft, to unpack what it really takes to thrive alongside AI, not in spite of it.

Zero to Dashboard with Grafana Assistant and the Infinity datasource plugin

Senior Developer Advocate Nicole van der Hoeven demonstrates how to go from zero to dashboard in a few minutes without using any queries, with the help of Grafana Assistant and the infinity datasource plugin for Grafana. Nicole is using the rawg.io video game database API to visualize games and get recommendations for what to play next!

The Checkly Playwright Reporter: Live Demo, Rocky AI RCA & Production Monitoring

Your Playwright tests catch bugs. The hard part is figuring out what actually broke — and sharing that context with your team. This session shows exactly how the Checkly Playwright Reporter solves that: one shared home for all your test runs, AI-powered root cause analysis, and a direct path from failing test to production monitor. María de Antón, PM for Playwright features at Checkly, runs a live demo on a real app with real failures.

A Runnable Reference Architecture for Industrial IoT on InfluxDB 3

Industrial teams keep telling us the same thing: the data is there, but the stack to act on it isn’t. PLCs, CNCs, SCADA systems, vibration sensors, and quality stations all generate high-frequency telemetry that gets stranded in proprietary historians or stitched together with point integrations nobody wants to own. By the time anyone looks at it, the moment to act has passed.
Sponsored Post

Multi-Cloud Monitoring And Why Status Pages Aren't Enough

Multi-cloud environments make outage detection harder. Relying on individual status pages from Amazon Web Services, Google Cloud Platform, and Microsoft Azure often leads to delayed, incomplete, or conflicting signals during incidents. This article explains how fragmented visibility impacts incident response, and how aggregating status across cloud and SaaS dependencies helps DevOps teams detect outages faster and respond with confidence.

Meet the new Mobot: Your log analysis partner

Every single day, the Sumo Logic Platform analyzes more than four exabytes of log data. The good news? The answers to your application performance, infrastructure health, and security incidents are hidden in those logs. The challenge? Historically, uncovering those answers required query language fluency. That’s why we built Mobot, our conversational interface that connects users to advanced AI capabilities using natural language.

Closing the Evidence Gap

Compliance teams are entering a moment where the expectations placed on them far exceed the visibility tools they have available. AI-driven environments introduce new forms of variance, drift, and distributed decision-making that unfold across infrastructure, models, agents, and services. These patterns do not map cleanly to the evidence structures that compliance processes rely on.

SIEM alerts: everything you need to know

Let's walk through setting up SIEM (Security Information and Event Management) alerts to monitor security threats in applications. We will explain what SIEM alerts are, why they're relevant with regard to application security, and provide practical examples of common alerts a developer could implement. We will show how to configure simple alerts with Honeybadger Insights.

Project and manage cloud spend with Datadog budget forecasting

Cloud and SaaS spending continues to grow across teams, services, and providers, changing too quickly for retrospective cost management workflows to keep up. Finance and engineering leaders often rely on last month’s reports or manually maintained spreadsheets, which don’t reflect current usage. As a result, teams lack context on how spend is trending and often discover budget overruns only after they’ve occurred.

Generate test scripts from natural language with Grafana Assistant: introducing k6 Script Authoring

Performance testing is critical to ensure your applications stay reliable under load, but writing the scripts themselves often feels like a chore. Most engineers already know the scenario they want to test; the hard part is translating that intent into a working performance test. Even experienced developers who use k6 can lose time looking up syntax, configuring load stages and thresholds, or debugging boilerplate code before they can run a meaningful test.

Elevate Your MSP: From Reactive IT to Proactive Digital Experience Assurance

Internet Performance Monitoring (IPM) is essential for MSPs to move from reactive support to proactive experience assurance. Green lights on your internal dashboard don’t mean users are having a good experience. That was the central tension in this conversation between LogicMonitor RVP of Managed Services, Daniel Gad, and Catchpoint Field CTO, Gerardo Dada, and it’s a problem most MSPs haven’t fully solved.

A Runnable Reference Architecture for Network Telemetry on InfluxDB 3

Networks generate the most data of any system in your stack and have the least patience for stale dashboards. Interface counters tick every second. BGP sessions flap. Flow records arrive in bursts. When something goes wrong, you don’t have 10 seconds to wait for an aggregation to finish.

The product analytics you already have

You already have everything you need. If you’re using Sentry, you have traces, structured logs, and now application metrics. Most teams use that stuff for debugging and stop there. But get this: that same data can answer most of the product questions you’ve been sending to a separate analytics tool, maintained by a separate team, with a separate data model and a separate bill. (Not all of them.

Using AI to Instrument Applications with OpenTelemetry

OpenTelemetry is one of the best things that’s happened to observability in the last decade. It’s open. It has SDKs for every language that matters. It’s vendor neutral. The OTel community has been doing the hard work of standardizing how applications emit telemetry, so that you, the engineer, don’t have to learn five different agent formats to monitor five different services.

What is Patch Management and Why is It Important? A Complete Guide

Patch management is one of the cheapest security steps you can take, and one of the most often ignored. Most IT teams know they are behind on patching. They just disagree on how far behind they actually are. Here is the simple truth: That waiting period is the problem patch management exists to solve. This guide covers what patch management actually is, how the full process runs from start to finish, where most teams quietly fall behind, and what to look for in a tool that holds up today.

Inside the Grafana AI Team Weekly: Workspaces and Investigations (April 28, 2026)

This is an excerpt from a real AI team weekly meeting where we talk about the stuff we build and occasionally also demo them! In this one, Staff Product Design Engineer Ben Darlow demos improvements to Workspace Home. Staff Software Engineer Sonia Aguilar and Principal Software Engineer Sven Großmann also demo a new dependency graph view for Investigations. We're showing parts of our team meetings to build in public in some small way and give you a sneak preview of what's to come. But not all features we show may make it to production! You've been warned. :)

Avantra Platform Overview

Introducing Avantra: Purpose-built for SAP, Avantra's AIOps platform empowers the world’s top global enterprises and MSPs to run at their best — preventing costly unplanned downtime by automating the detection and resolution of operational issues before they impact the business. An SAP Partner Edge Build and Cloud ALM Silver Partner, Avantra partners with SAP to enable teams to observe everything, automate what matters, govern continuously, and navigate SAP transformation with clarity and confidence... anywhere SAP runs.

How to Overcome Government Payment Fraud with Speed and Scale

Government payment fraud is a fast-growing risk for public sector organisations in Australia and globally. From welfare and healthcare payments to business grants and disaster relief, increasingly sophisticated organised criminal networks and other actors exploit complex, high-volume government programs to unlawfully access public funds. The impact is significant—billions lost, program integrity undermined, and essential resources diverted.

What is Service Request Management? A Complete Guide

If you run a service desk, you’ve likely seen this pattern: Service requests, incidents, and change requests often end up in the same queue under the same SLA, even though they require different handling. Many requests that could be resolved through self-service still go through manual intervention, while misclassification adds further delays and confusion. Service request management brings structure to this by defining how requests are handled end to end.

The "Single Pane of Glass" Is Dead - What Network Teams Actually Need Is Intelligence

The infrastructure industry spent two decades chasing a single pane of glass. The future looks different: domain-expert AI platforms that reason deeply within their own data, connected through tool chaining when problems cross boundaries.

Inside the Anthropic + Claude Code Hype at AWS Summit London: Live Laugh Logs ep. 2

Are companies blowing through their entire 2026 AI budget in a matter of months? Welcome to Episode 2 of Live Laugh Logs, the podcast from Annie, Lewis, and Andre from the Coralogix Developer Relations team, where we get together and recap everything going on in our worlds!

What is Log Management? The IT Team's Guide to Taming Log Data

Understand what log management is and why it’s essential for troubleshooting, security, and observability across modern IT environments. Log management helps organizations collect, centralize, parse, and analyze logs from servers, applications, cloud platforms, containers, and network devices in one searchable platform. Learn how centralized log monitoring reduces mean time to resolution (MTTR), eliminates siloed troubleshooting, and helps IT teams detect anomalies faster using AI-powered analytics.

The Complete Guide to Observability Pipelines

Modern engineering teams are drowning in telemetry data. A mid-sized Kubernetes cluster running 50 microservices can generate millions of log lines per minute. Add distributed traces, Prometheus metrics, cloud provider events, and application-level instrumentation and you're looking at terabytes of observability data every day. The problem isn't just volume. It's what you do with it.

Error Budget in SRE: The Complete Guide (2026)

An error budget is the acceptable amount of unreliability permitted by your SLO over a defined time window. It is not a target. It is not a stretch goal. It is a hard ceiling that, when breached, should trigger a pre-agreed organizational response — feature freezes, postmortems, or infrastructure investment. The formula is blunt: Error Budget = 1 - SLO Target Error Budget (time) = (1 - SLO Target) × Window Duration For a 30-day window: That last number should make you uncomfortable.

How to Create Your Own Plugins and Check Commands in Icinga 2

If you’ve been using Icinga 2 for a while, you probably know the built-in checks cover a lot of ground: disk space, CPU, memory, ping. But sooner or later you’ll run into something specific to your setup that no existing plugin handles. That’s where writing your own plugin comes in. The good news? It’s simpler than it sounds. Icinga 2 doesn’t care what language your plugin is written in. It just runs the script, reads the exit code, and displays the output. That’s it.

The Productivity Tax of Repeat IT Failures in Technology Companies

Technology companies are being pushed to deliver faster outcomes while justifying growing investment in AI, SaaS, and digital infrastructure. But productivity does not improve just because new tools are deployed. It improves when employees can use those tools without the constant drag of slow devices, unstable applications, and fixes that do not fully solve the problem. That is the productivity tax of digital friction.

Unlock telemetry value with a well-planned data lake

Your SIEM only holds a slice of your telemetry. Your data lake holds the rest. We'll show you how to use that to your advantage for investigations, threat hunting, and reporting. Why your data lake beats your SIEM for investigations – Your SIEM keeps a short window of expensive, filtered data. Your data lake keeps everything. When something goes wrong, that difference matters more than you think Threat hunting without the handcuffs – Hunting across months of data in a SIEM is painful and costly. We'll show you how a well-planned lake makes broad, deep searches practical and affordable.

Teach Your AI Coding Agent to Instrument, Monitor, and Troubleshoot Infrastructure with netdata/skills

There’s a growing ecosystem of AI coding agents: Claude Code, Cursor, Copilot, Codex, Gemini CLI, Windsurf, and others. They’re good at writing code, but they don’t inherently know how to instrument that code for observability, configure monitoring infrastructure, or troubleshoot production systems using real telemetry data. That knowledge lives in documentation, runbooks, and the heads of your senior SREs.

AI Powered IT Operations & Autonomous Resilience | Full SolarWinds Day Q2 2026 Event Replay

Watch the full SolarWinds Day 2026 event on-demand and discover how AI is transforming IT operations, observability, and incident response. In this exclusive event, SolarWinds CEO Sudhakar Ramakrishna and product leaders unveil the company’s vision for Autonomous Operational Resilience—powered by AI, automation, and unified visibility across hybrid and multi-cloud environments.

Honeycomb Canvas: The Multiplayer Workspace for the Agentic Era

Last week, we launched a major update to Canvas, our investigation workspace. The new Canvas has evolved from an AI co-pilot you chat with to a place where your whole team, human and agent, can work the same problem on the same surface. Auto-investigations begin the moment a trigger, SLO, or anomaly fires. Custom skills encode your team's runbooks so every agent investigates with your team's expertise built in.

How we made a SQL query optimization agent 59% more accurate using autoresearch and LLM Observability

Without experiment infrastructure to help you test your LLM applications, every research session starts with the same questions: What have we tried previously? What were the numbers? Which prompt version produced that result? Why did we discard that approach? The answers live in scattered notes, terminal history, and half-remembered conversations. Each handoff between sessions loses context. In practice, iteration can slow down as teams get bogged down in testing and analysis.

How to audit and clean up monitors effectively

Alert fatigue and blind spots develop together. Monitoring stacks that generate noise while missing critical issues may have incomplete coverage or poorly configured alerts. As they grow reactively and without structured coverage assessment, both issues worsen. Teams will often add monitors when something breaks and tune thresholds when alerts become unbearable, but rarely audit their overall setup to see if it works.

Introducing Atatus Sensitive Data Classifier

Your logs know too much. Every debug statement, every traced request, every APM span can carry the risk of capturing something they shouldn't. A customer email. A JWT token. A credit card number. An API key that was never meant to leave your payment service. It doesn't look like a breach. There's no alert. Your observability platform just quietly accumulates sensitive data like indexed, replicated, and accessible to every engineer with log query access.

Building a CloudWatch metrics pipeline: parsing OpenTelemetry data

AWS delivers CloudWatch metrics in OpenTelemetry format via Firehose, but AppSignal uses its own internal format. Building the parser to bridge these two formats presented several technical challenges. The metrics arriving through this pipe power AWS automated dashboards. When AppSignal detects metrics from a supported AWS service, it creates a dashboard for it automatically, with pre-built charts grouped by category: compute, databases, networking, messaging, storage, and others.

How Airbnb Built a High-Volume Metrics Pipeline with OpenTelemetry and vmagent

We always knew that Airbnb’s engineering is operating on a completely different scale, and their new high-volume metrics pipeline is proof of that. This is one of those rare stories where scale and efficiency go hand in hand - they modernized their observability stack with open source components and reduced cost by an order of magnitude. Airbnb is now processing more than 100 million samples per second on a single production cluster.

From Signal Corps to Space: Building Networks That Can't Fail with Troy MacDonald

What does it take to succeed in networking when complexity is constantly increasing, and change never slows down? In this episode of Next-Gen Network Heroes, host Bob Slevin sits down with Troy (David) MacDonald, a network engineer at Blue Origin and former U.S. Army Chief Warrant Officer, to explore a career that spans from infantry beginnings to designing and managing large-scale, mission-critical networks.

Optimizing Team Strengths for Effective Operations

Most people think great network engineers are defined by technical expertise. This episode challenges that idea. Because what Troy McDonald shows is that the real differentiator isn’t just technical skill—it’s the ability to translate complexity into clarity. From military operations to enterprise networks, one lesson keeps showing up.

Microsoft Fabric outage disrupted analytics workloads on May 18, 2026

On May 18, 2026, organizations using Microsoft Fabric experienced a multi-hour outage that disrupted analytics workloads, reporting systems, and access to platform services across several regions. StatusGator detected the developing incident at 14:00 UTC using Early Warning Signals, 37 minutes before Microsoft officially acknowledged the outage at 14:37 UTC.

The $600 billion wake-up call: New Splunk research reveals downtime is a systemic business crisis

600 billion annual impact: Aggregate downtime costs for the Global 2000 have soared 50% in two years. $15,000 per minute: The average cost of downtime for organisations, highlighting the immediate financial impact of service disruptions. 3.4% stock price drop: The average decline in shareholder value following a single downtime incident.

Reality Byes The Birth of Mobile DEX (Opening the Black Box)

On this edition of Reality Bytes, Dina and Tom welcome Rose Cicala, Director of Product Marketing, and Mile Djokic, Senior Product Manager, to discuss the launch of Mobile Experience — and what it means for the future of Digital Employee Experience. Together, they explore why mobile devices have become mission-critical for frontline and hybrid workforces, why mobile visibility has remained a major blind spot for IT, and how Mobile DEX changes that. The conversation covers healthcare, retail and manufacturing use cases, AI compliance, application insights, VDI convergence, and the growing shift toward mobile-first work strategies.

Multiple API Keys Are Here - More Keys, Better Control, Stronger Security

Today we're rolling out a major upgrade to API Keys in Bindplane. You can now create up to 25 API keys per project, give each one a description, set an expiration date, and delete keys you no longer need. Under the hood, every key is now hashed with Argon2, the modern standard for credential storage. If you've been working around the old single-key limit by sharing one key across CI jobs, scripts, and teammates, this release is for you.

Diagnose slow PostgreSQL queries faster with explain plan correlation

When a PostgreSQL query runs slowly, engineers often start with EXPLAIN ANALYZE. The output is a tree of plan nodes, each one describing a step the database took to execute it. A query with several joins and a subquery can produce 20 or more nodes. But the plan gives no visual indication of which node corresponds to each clause in the SQL text. Diagnosing the problem means viewing the plan in one window and the query in another, manually tracing connections between them.

Explore Datadog metrics with Natural Language Queries

Metric exploration often begins with a simple question, but answering that question can require deep familiarity with metric names, tag structures, and query syntax. Experienced users spend time refining queries through trial and error, and newer users struggle to get started. As a result, teams face delays in troubleshooting and analysis. Valuable observability data, including metrics that are difficult to discover and query, also goes underused.

Community Spotlight: A Native iOS App for Your InfluxDB Data

One of the things we love most about building an open source platform is seeing what the community creates with it, and independent developer Anton Havekes recently built something we just had to share. Anton put together Influx Dashboard, a native iOS app that connects to your InfluxDB instance and brings your time series data straight to your phone. We’re genuinely thrilled to see this kind of work come out of the community.

12 IT Infrastructure Best Practices Every IT Leader Should Follow

Why do IT infrastructure issues continue to slow down teams even when tools keep improving? In most IT environments, the challenge is not a single failure. It is a set of ongoing operational gaps that are easy to overlook but difficult to control over time. A few of the common challenges include: In 2026, IT environments are more distributed and fast-changing than before. Hybrid infrastructure, cloud adoption, and strict compliance requirements make consistency harder to maintain.

The New Compliance Crisis: AI Is Outrunning Its Controls

Enterprises have spent decades refining compliance frameworks around workflows that were linear, predictable, and well-documented. These frameworks were built for systems that executed actions deterministically and for human operators who made decisions slowly enough for oversight to keep up. In that environment, compliance could function as a retrospective discipline because the evidence required to validate behavior generally existed in complete, stable form.

What's New in Graylog V7.1 Webinar

What to Expect? Graylog 7.1 is built for lean security and IT operations teams who need real outcomes, not more tools, more add-ons, or more manual work. This 30-minute deep dive session covers what's new and what it means for your team. What you'll learn: See Graylog 7.1 in action: detection, triage, and documentation without compromise.

Why SRE agents need orchestration, not just more tools

Single agents are a useful starting point for SRE workflows. They are not where the architecture should end. The first version is simple enough: connect an LLM to a few tools, give it a system prompt, and point it at your infrastructure. It can summarize an alert, pull logs, answer questions, and draft a useful next step. Then the workflow gets real. You add GitHub for runbooks, Kubernetes for cluster state, PagerDuty for incident context, Prometheus for metrics, and Mezmo for telemetry.

Media Monitoring Evolved: How AI Makes Website Tracking Tools Essential

The average person would need 180 million years to read everything published online in a single day. For organizations trying to track what people say about their brand, manual monitoring stopped being viable somewhere around 2015. AI-powered media monitoring tools now process this impossible volume automatically, detecting brand mentions, analyzing sentiment, and flagging potential crises before they spiral.

Agent Timeline: The Flight Recorder for Your AI Agents

Last week, we introduced Agent Timeline, a powerful new observability experience purpose-built for debugging AI agent workflows in production. Agent Timeline uniquely connects AI-layer visibility to full-stack observability by organizing telemetry around an agentic conversation. A conversation contains one or more agent executions, each of which may contain LLM calls, tool invocations, handoffs, retries, human escalations, and downstream system calls.

How Ecommerce Brands Track Regional Price Differences Online

Many online stores display different prices depending on the user's location. The same product may cost less in Eastern Europe, more in the United States, and have completely different discounts in Germany or France. There are several reasons for this: This is especially common in marketplaces, electronics, fashion, and travel-related ecommerce. For international brands, understanding these pricing differences has become an important part of market analytics.

Commercial Trucking Technology for Better Driver Awareness

Modern highways demand constant focus from professional drivers. New tools help fleets stay safe on long trips across the country. Fleet operators can monitor road hazards much better than in past decades. New onboard systems protect both the driver and the cargo from unexpected road events. High highway speeds mean split-second decisions dictate safety margins. Stay aware of your surroundings to prevent severe accidents before they happen. New updates give teams better visibility than ever. Drivers feel more secure when they have technology backing them up on dark roads.

The Importance of Time Synchronization in Windows Authentication

Kerberos is a secure network authentication protocol that allows users and systems to prove their identity over a network without sending passwords in plain text. It is widely used in enterprise environments (for example, in Windows domains) to enable single sign-on (SSO). At its core, Kerberos uses a trusted authority called the Key Distribution Center (KDC) to issue encrypted “tickets” that verify identity.

Cache-busting magic variables for uptime checks

Over the weekend, my own site went down and Oh Dear didn't catch it. The origin server had fallen over, but Cloudflare happily kept serving the cached HTML. Everything looked fine from the outside. Embarrassing. Scratching our own itch here, we just shipped magic variables: short placeholders you can drop into your monitor URL, request headers, or POST payload. Right before each check, we replace them with fresh values, so every request is unique enough to slip past any cache and actually hit your origin.

Get Lightrun AI Skills: Expert Workflows for AI Agents

Today we’re launching Lightrun AI Skills, structured, repeatable investigation workflows built for AI coding agents. With Lightrun MCP, agents like Claude Code, Codex, and Cursor can already instrument live production services and reason over live runtime evidence without a redeployment. But AI agents remain non-deterministic by design, using the same tool differently every session.

SOA Expire Value Out of Recommended Range: What It Means and How to Fix It

The Start of Authority record is the first record in any DNS zone file. It's the record that says "this zone exists, this is the primary nameserver in charge, and here are the timing rules that govern how this zone behaves." A full SOA record looks like this when you query it: Each of those numbers does something different. The one that triggered your warning is the Expire value, the fourth number. In this example, 1209600 seconds, which is exactly 14 days.

Reverse DNS Does Not Match SMTP Banner: What It Means and How to Fix It

When your mail server connects to a recipient server to deliver email, the very first thing it does after the TCP connection is established is introduce itself. That introduction happens through the EHLO command (or its older predecessor HELO), and it looks like this: That hostname in the EHLO line is your SMTP banner. It is what your server claims to be.

How Honeycomb Is Embracing the Challenges of End-to-End Observability with Embrace

Customers regularly come to us looking to solve their observability problem by connecting the dots from frontend to backend. It sounds straightforward in theory, but in practice it's one of the hardest problems in modern application monitoring. The frontend monitoring tools they already have in place tend to be proprietary or narrowly scoped to frontend needs, leaving them without the context-rich backend data that makes real triage possible.

Cribl Notebook templates in Cribl Search

Investigations are time-sensitive, and analysts shouldn’t waste time recreating the same workflows or rewriting familiar queries. Whether troubleshooting infrastructure, investigating suspicious IPs, or analyzing host activity, teams often rely on duplicating old processes and copying query snippets — a slow, inconsistent approach that’s hard to scale.

Server Monitoring: The Complete Guide to Metrics, Tools, and Best Practices

If you run IT operations, you already know servers carry most of what your business depends on: When a server slows down or goes offline, the impact spreads fast, and the team feels it before the dashboard does. That's the core problem server monitoring is built to solve. It watches the health and performance of your servers continuously, so issues get caught early instead of becoming outages. The cost of getting these wrong keeps climbing.

Autonomous IT Needs Internet Performance Monitoring: Why Internal Visibility Alone Is No Longer Enough

Internal visibility isn’t enough for modern incident response. Your app team has three dashboards open and everything looks fine. CPU is healthy, memory is stable, the application servers are responding normally. But users are still complaining. The checkout page is slow. Logins are timing out. Support tickets are piling up. And your monitoring tools have nothing useful to say about why.

Slack outage on May 14, 2026

On May 14, 2026, users across multiple regions began reporting problems with Slack, including messaging failures, sign-in issues, and problems loading attachments and images. While the outage did not affect every user, reports quickly showed the issue was widespread enough to disrupt business communication for organizations around the world. StatusGator identified the incident through customer outage reports and triggered an Early Warning Signals alert at 14:21 UTC.

How to embed Grafana dashboards into web applications

Note: This post originally published in October 2023 and was updated in May 2026 to include new methods and options for embedding Grafana dashboards. Grafana dashboards are powerful and flexible tools for observing applications and infrastructure, so it’s no surprise we get a lot of questions from the community about how to embed them into their web applications.

Web API: your complete guide for custom integrations

Data is almost always scattered across too many tools. Usually, if you want to see it all in one place, you're stuck building messy pipelines or paying for a warehouse you don't really want. SquaredUp is a window into all those tools. It lets you see what’s happening across your entire stack in real time without moving any of the data. Think of it as a universal translator that lets your tools talk to each other so you can stop the manual digging and just see the big picture.

Product Update - May 2026

IncidentHub's latest product updates include a new Business plan with Teams support, early outage detection v1, and more integrations with ticketing systems. The public status now includes a disable feature. As before, many features are driven by feedback, and I am grateful to all our customers who have shared their feedback with us.

Best Network Monitoring Tools for 2026 (Top 12 Compared)

In 2026, the best network monitoring tools are Kentik, Datadog, SolarWinds NPM, LogicMonitor, Cisco ThousandEyes, Dynatrace, Auvik, Paessler PRTG, ManageEngine OpManager, Zabbix, OpenNMS, and WhatsUp Gold — spanning four overlapping categories: network intelligence platforms, full-stack observability, digital experience monitoring (DEM), and traditional network performance monitoring (NPM), including open-source tools.

Action trails: The missing link between AI and human trust

When people talk about trusting AI, they usually focus on the interface. It summarizes and uses confident language with a level of clarity that feels reliable. But that’s all window dressing. None of it builds trust. Trust doesn’t come from what the AI says. A verifiable record of what the AI did makes it trustworthy.

Proactive vs Reactive Monitoring: What are the Differences?

A single hour of unplanned downtime can cost a mid-sized enterprise more than $300,000, according to ITIC report. Most of that cost comes from one place: teams find out about the problem after users do. That is the core limitation of reactive monitoring. It tells you something has failed, but doesn't tell you something is about to fail. This guide is for IT operations leads, platform and SRE engineers, and IT directors deciding how to evolve their monitoring practice.

Building Real-Time Telemetry Pipelines for IRIG 106 compliance

Every second of a flight test produces a torrent of telemetry from engines, sensors, and control systems. Aerospace teams have captured this data for decades to verify performance and maintain safety, yet analysis often happens long after the mission ends. Engineers wait for downloads, conversions, and compliance checks before they can interpret results. That delay turns telemetry into a historical record instead of a feedback loop.

When your agents hallucinate at 2 am, it is not a model problem

The first time an AI assistant suggests "restart the service" during a live incident and nobody on the bridge can tell whether that suggestion came from a current runbook, a stale wiki page, or thin air, you stop caring about model benchmarks. You start caring about what the agent actually knew, where that knowledge came from, and whether you can trust the chain of reasoning behind it.

How to Identify LAN Issues (Local Area Network Problems)

Here is a reality that every network admin eventually runs into: users report slow apps, dropped calls, and broken connections, and the first instinct is to blame the ISP or the cloud provider. The ticket gets escalated, the ISP pushes back, and hours later, you find out the problem was sitting inside your own building the whole time. A saturated switch port. A misconfigured VLAN. A flaky patch cable in the server room.

ITSM Maturity Playbook Live, Episode 1: Incident Management Masterclass

Join this 5-part series designed to help IT teams move from reactive, fragmented processes to a more structured, connected way of working. Each session focuses on a core area, from incident resolution and CMDB visibility to employee experience, service catalog design, and change governance, giving you practical frameworks you can apply right away. You’ll walk away with: Faster, more consistent incident resolution.

Why Network Operations Needs Data-Centric AI

The discussion around AI in infrastructure and operations has become increasingly model-centric. Teams want to know what model a platform uses, how current it is, how much reasoning capacity it has, and how quickly it can be updated as the model landscape shifts. Those are reasonable questions, but they tend to arrive too early. In production operations, the more consequential question is what happens to the data before any model is asked to interpret it.

7 Proven Steps to Maintain Operational Continuity During S/4HANA Migration

Migrating to SAP S/4HANA is one of the most consequential system changes your organization will undertake. The technical complexity alone is significant. But the real risk is operational: maintaining uninterrupted service delivery while transforming the core systems your business depends on. Failure to manage this well causes outages, data inconsistencies, user disruption, and cost overruns. None of those are acceptable outcomes. The good news is these risks are manageable.

Getting started with Checkly dashboards

Checkly is a modern reliability platform that combines testing, monitoring and observability in one place. Its integration with Playwright and languages such as TypeScript means that developers can write tests using tools they are familiar with and then run them in Checkly. Its Monitoring as Code philosophy also means that Checkly tests can be incorporated into CI/CD pipelines.

From Phishing to SQL Injection: How Breaches Actually Happen

Critical vulnerabilities are critical because they're easy to exploit — but most breaches don't even need them. Tony explains why phishing remains the dominant attack vector, why strong instrumentation matters for forensics (tracing an API call through a database to see exactly what was leaked), and how observability data becomes security data when something goes wrong. The system is harder to breach than the human. And that's the whole game.

One Collector, Two Teams: How Bindplane Bridges Security and Observability with OpenTelemetry

Observability engineers will spend weeks tuning instrumentation. Security engineers? They want a collector installed and logs flowing — yesterday. And that's actually the magic of OpenTelemetry + Bindplane: from day one you're routing firewall logs, endpoint data, server logs straight into your SIEM with zero instrumentation lift. One toolchain. Two teams. No silos. Filmed at Google Cloud Next '26 — Las Vegas bindplane.com#OpenTelemetry.

Best APM for Small Development Teams in 2026

Last updated: May 2026 If your team is 2 to 20 developers and you do not have dedicated DevOps, SRE, or platform engineering, most APM tools were not built for you. They were built for the team that has you: a team with specialists who can tune dashboards, configure alerting pipelines, manage data retention policies, and explain the monitoring system to everyone else. You do not have that team. You have developers who also handle deploys, on-call, and debugging production issues between writing features.

Honeycomb Innovation Week: Announcing Our Partnership With Embrace

Honeycomb and Embrace are extending the rigorous, data-driven practice that Honeycomb pioneered for foundational to mobile and web, giving, site reliability, and platform teams a complete, correlated picture of system health. The strategic partnership makes understanding performance and reliability for every user and every screen part of the observability practice, bringing new depth and standardization to how teams measure end user impact.

New ways to agentically build and edit dashboards

The traditional dashboard workflow, teams slowly handcrafting visualizations to track critical KPIs, is dying in a world of AI agents. A few years ago, in a pre-agentic-everything world, we tried to make it easier for developers to monitor critical experiences. We introduced Insights pages, which were pre-configured dashboards any Sentry user could adopt instantly that surfaced common health signals, like Web and Mobile Vitals.

Simplify micro-frontend observability with Datadog RUM

Micro-frontend architectures, where independent teams build and deploy separate parts of a frontend application, introduce an observability challenge: Telemetry data is fragmented across services, making it difficult to determine which micro-frontend caused a performance degradation or error spike.

Attribute AI costs across providers with Datadog Cloud Cost Management

AI adoption is accelerating across organizations, and spending often follows a similar pattern: rapid growth, multiple providers, and limited visibility into where costs originate. Each provider exposes billing data differently, with distinct schemas, dimensions, and interfaces. FinOps and engineering teams often spend significant time consolidating fragmented data, only to end up with partial attribution and limited context about who or what generated the AI spending.

Improvements to our status pages as we tackle a DDoS

The uptime & availability of our status pages hasn't been great these past few days. The root cause is a persistent and pretty aggressive DDoS attack targeted at our own status page, status.ohdear.app. As a result, the overload on our systems also affected all other status pages we host for clients. We're not yet at Github or Claude levels of uptime sadness, but this isn't acceptable to us. In this post, I'll share what's happening and what steps we've already taken.

You Are Building With AI. Who Is Watching What It Ships?

AI coding assistants have made it possible for a single developer to build and ship a production application in a weekend. Claude Code, Cursor, GitHub Copilot, and similar tools can scaffold a Rails app, write the models, generate the views, wire up the API, and push to production before Monday. This is genuinely exciting. It is also genuinely dangerous if you do not have monitoring in place before you ship.

Honeycomb Achieves the AWS Financial Services Competency

Honeycomb is proud to share that we have achieved the Amazon Web Services (AWS) Financial Services Competency. This recognition validates our technical expertise and proven customer success in assisting financial services organizations with building, running, and understanding their production systems on AWS. Securing this competency is a direct response to our customers’ feedback in this space: observability in regulated, high-stakes environments requires more than dashboards and alerts.

3 things you need to know about headless observability

If you're building agents trying to figure out the best way to actually make them successful in production, you're going to want to know about headless observability. Headless observability means an agent can access information about the health of your system through a CLI instead of clicking around dashboards. It's the data layer that going to unlock serious autonomy and allow you to scale with agentic workloads.

Cloud Outage History: Six Years of Recurring Failures

Cloud infrastructure has never been more reliable in theory. In practice, the last six years of cloud outage history have delivered some of the most disruptive incidents on record. Not because cloud providers got worse, but because the systems built on top of them got larger, more interconnected, and more brittle in ways that don't show up until everything breaks at once.

Get deeper insights with historical outage reports

StatusGator now includes a new Outage Reports tab on the service monitor detail page, giving users more visibility into recent service disruptions directly where they monitor services. Users can now quickly review recent outage activity for a specific monitored service without leaving the detail page.

How to Monitor Applications and End User Experiences

In this video, see how Skylar One helps you understand the impact of changes on application performance and the end user experience. By tracking service level metrics across an e commerce environment, you can quickly identify when performance degrades and how it affects user behavior. Explore how Skylar One enables: With Skylar One, teams can quickly connect performance changes to real user impact, helping ensure a consistent and reliable digital experience.

Total Economic Impact study finds LogicMonitor Edwin AI delivered a 313% ROI and payback in 6 months or less

Forrester Consulting’s Total Economic Impact study found that a composite organization based on interviewed customers achieved 313% ROI and payback in less than 6 months with LogicMonitor Edwin AI. AI for IT operations has a credibility problem. The market is crowded with claims about speed, automation, and intelligence, while buyers are left doing the harder work of separating measurable impact from vendor language.

True Visibility: How Liang Chen is Rethinking Network Monitoring

What happens when deep networking expertise meets low-level programming and a passion for invention? In this episode of Next-Gen Network Heroes, host Bob Slevin sits down with Liang Chen, Senior Network Architect at Texas Children's Hospital and a true innovator in network performance and visibility. With more than 25 years of experience in networking, plus advanced expertise in programming languages like C and Assembly, Liang has built his own next-generation traffic analysis platform from the ground up—designed to provide real-time, packet-level visibility at massive scale.

Enhancing Your Search Skills with Liang Chen

What does it take to reinvent network visibility from the ground up? In this episode of Next-Gen Network Heroes, Bob sits down with Liang Chen, Senior Network Architect at Texas Children’s Hospital and creator of a next-generation network traffic analyzer built for real-time, packet-level visibility. Liang shares how he built a platform capable of analyzing traffic at up to 200Gbps with zero packet loss—unlocking deeper network forensics and faster troubleshooting in mission-critical environments.

Tips and Tricks for Handling Secrets in Icinga 2

Today, we are going to look at a few things related to handling secrets. While Icinga 2 has no dedicated mechanisms for secret handling, there are a few tricks you can do with standard features. This is not meant as a step-by-step tutorial, but rather as an inspiration where you can adopt the ideas that make sense in your setup.

Observability for the Agent Era: Day 1 | Keynotes

Honeycomb's Innovation Week: Observability for the Agent Era (May 12-14) For Day 1 of Innovation Week, Honeycomb co-founders Christine Yen and Charity Majors will share what it actually takes to understand and debug systems in the agent era, and what the best engineering teams are doing differently. A 3-Day Virtual Event for Teams Building the Future May 12: Get insights on how the best engineering teams are tackling the challenges of the agentic era.

Redgate Monitor | AWS Database Migration Readiness

n this demo, we explore the AWS Database Migration and Modernization (D2M) framework, from Align and Assess, trough to Optimize, and show how Redgate Monitor helps you to establish performance baselines, right-size target environments and continuously optimize RDS and Aurora spend for full cloud cost visibility. Learn how Redgate Monitor can give you a single view of your entire AWS and on-premises, multi-database environment.

Why Siloed Monitoring Increases Your MTTR and How to Resolve It

Are you spending more time figuring out whose problem it is than actually fixing it? If that feels familiar, you are not alone. Many IT teams start their day with multiple dashboards and tools, yet still struggle to understand what is wrong when something breaks. Everything may look fine in one view, and fine in another, but the customer impact tells a different story. Incidents end up taking longer to resolve than they should. This is not about effort or capability.

Builder in the loop: Henry Andrews on building AURA like production software

An interview series with the people building Mezmo’s open-source agentic harness for production operations. Builder in the loop is a Mezmo interview series focused on the engineers, product leaders, and operators shaping AURA, our open-source, MCP-native agentic harness for production operations. The goal is to get past the polished product layer and talk through the decisions that matter when AI starts interacting with real systems. What should agents be allowed to do?

ActiveMQ Message Persistence: KahaDB, Artemis Journal & JDBC

Every persistent message in ActiveMQ must survive a broker restart. That guarantee is the contract behind DeliveryMode.PERSISTENT is what separates a messaging system from a memory buffer. It is also what makes message persistence configuration the most consequential decision in ActiveMQ architecture.

Turn StatusCake into a verified alerting and escalation flow with Hermes

Most monitoring setups have the same weak spot. Detection is easy. Decision-making is not. StatusCake is good at telling you that something might be wrong. What happens next is where things sometimes get messy. One alert goes straight to a chat room. Another wakes the wrong person. A third ends up getting missed because the site had a brief wobble and recovered before anyone looked. Hermes is useful in that gap.

What Is Log Monitoring? Pipeline, Pitfalls, and Practices for 2026

Catching a cascading failure in the first 90 seconds is one of the better feelings in production engineering, and it almost always comes back to your log monitoring pipeline doing its job upstream of the alert. The teams that land there consistently treat log monitoring as a real-time detection layer in its own right, and the choices you make in that pipeline shape how every incident plays out for years.

From Monitoring to Observability: How DEX Integrations Strengthen IT Visibility and User Productivity

When I started working in IT in the last 90’s, IT performance was always measured by the health of infrastructure: CPU utilization, network latency, server uptime, and for many organizations, little has changed in the last 30+ years. We became very good at keeping systems alive, yet users still struggled to get work done. That disconnect is exactly why Digital Employee Experience (DEX) has emerged as a critical discipline. But DEX on its own is not the end goal.

What Is APM? A Guide to Application Performance Monitoring

A well-instrumented service tells your on-call engineer which deploy broke checkout, which span ate the latency budget, and which line to revert before the support queue fills up. Getting there depends on how cleanly your application performance monitoring layer turns telemetry into answers. The sections ahead walk through how APM works, the metrics and components worth tracking, the cloud-native challenges at scale, and how to evaluate APM tooling against your real workload.

What Is an Incident Commander? Role, Skills, and Best Practices

The fastest incident response teams treat coordination as a craft. Someone owns the call, drives the decisions, and keeps everyone moving in the same direction while the team puts the system back together. That person is the incident commander (IC), and getting the role right is what separates your 15-minute fix from a four-hour war room where nobody’s sure who’s making the call.

Honeycomb Innovation Week: Debugging Agentic Workflows with Ken Rimple

Canvas skills are how your team's runbooks and tribal knowledge become an active part of the investigation instead of a document someone has to remember to open. Pre-built skills cover the most common investigation patterns out of the box. Custom skills let you encode the specific context, thresholds, and decision logic your team has accumulated, so every auto-investigation starts with your best thinking already applied.

Observability for the Agent Era: Day 2 | Launches

Honeycomb's Innovation Week: Observability for the Agent Era (May 12-14) For Day 2 of Innovation Week, Honeycomb's product and engineering teams will take you inside the new capabilities purpose-built for the agent era. Expect live demos, real scenarios, and a hands-on look at what it means to own observability for the Agentic era, with AI in Honeycomb to observe AI in production. A 3-Day Virtual Event for Teams Building the Future May 12: Get insights on how the best engineering teams are tackling the challenges of the agentic era.

Innovation Week Day 2: Observability for AI, and Observability With AI

AI is reshaping the SDLC in two directions at once. AI-generated code is shipping faster and with less human supervision than ever before, while agents and LLMs are running directly in production, where they behave very differently from traditional software: non-deterministic, with a wider blast radius than any single function or component, with no stack trace to catch when something goes wrong.

Why Some Roles Care About Open Source & Why Others Don't: 4th Annual Observability Survey | Grafana

Note: We're happy to share that since the recording of this video, OpenTelemetry *has* graduated from the CNCF! SREs, developers, and CTOs say open source is essential to observability. Engineering managers and directors? Not so much. Grafana's 4th annual observability survey — 1,363 responses — reveals a split inside the same orgs that's worth a conversation.

Honeycomb Innovation Week: Observability With AI With Kale and Taylor

Watch this video to see the re-imagined Canvas in action, where auto-investigation has already ranked your hypotheses before you open the tab, multiplayer agents build on each other's work in real time, and a custom skill encoding your team's own runbook can reprioritize the entire incident before you've had your morning coffee.
Sponsored Post

The SDLC: phases, popular models, benefits & more

The Software Development Life Cycle (SDLC) describes the process we follow to deliver software to customers. It captures each step of creating software, from ideation to delivery and eventually to maintenance. In this post, we've broken down everything you need to understand the SDLC.

Stop Guessing, Start Fixing: AI Root Cause Analysis

Automating root cause analysis is often regarded as the holy grail of IT operations. A solution capable of automatically identifying issues, resolutions and even prevention. Performed correctly, automated root cause analysis accelerates MTTI (Mean Time to Identify) and MTTR (Mean Time to Resolution). But for many platforms, this goal remains elusive: complexity, differences between deployments and different architectures make automating root cause challenging.

Contributing Distributed Partition Ownership to the Azure Event Hub Receiver

If you're running OpenTelemetry collectors against Azure Event Hubs, distributed partition ownership and checkpointing just got significantly better. Your fleet now self-organizes. Failover is automatic. Restarts don't lose data. Here's how we got here.

Monitor CAA Records with DNS Check

DNS Check now supports monitoring CAA records. A CAA record (Certification Authority Authorization record) tells public certificate authorities (CAs) which of them, if any, are allowed to issue TLS/SSL certificates for your domain. Public CAs have been required to honor these records since 2017, so CAA records act as an access control list for certificate issuance.

AI-assisted testing, extensions updates, and more: k6 2.0 is here

For years, teams have relied on k6 to take a more proactive approach to performance testing, ensuring they can catch issues early and deliver more reliable user experiences. That approach has helped make k6 one of the most widely used performance testing tools in the open source community today, with more than 30k stars on GitHub. Last year, we introduced k6 1.0, a major release that brought TypeScript support, native extensions, revamped test insights, and production-grade stability guarantees.

Innovation Week Day 1: The SDLC Is Collapsing, and Observability Has Never Mattered More

The software development lifecycle is collapsing. The multi-stage pipeline that defined how software got built and shipped for decades is compressing into rapid loops of intent and validation, with agents now part of the teams building and running it. Day 1 of Innovation Week was about what that shift means for how software gets validated, where observability fits, and the problems that have always been hard but are now genuinely urgent.

Dashboard Playlists: Cycle Through Dashboards in TV Mode

When we shipped TV mode, we heard almost immediately: “Great, but I have five dashboards and one screen.” A single dashboard on a wall display covers one view of your infrastructure. If you want to rotate between your network overview, database health, application metrics, and infrastructure summary, someone has to walk over and click, or you’re buying more screens. Dashboard playlists solve this.

What is the Mean Time to Resolution (MTTR)? Why It Matters and How to Resolve

How quickly can you restore service when an incident hits your system? Most IT teams are not slowed down by detecting incidents. The challenge starts after something breaks, when the goal is to bring services back online as quickly as possible. Modern systems are highly distributed. Alerts arrive from multiple tools, dependencies are complex, and it is often difficult to immediately understand what actually failed.

What Leading Engineering Teams Teach Us About Operational Truth

Modern operational environments are intricate ecosystems shaped by distributed architectures, accelerating change cycles, and a constant influx of telemetry. The complexity itself is not the issue. The issue is how teams construct understanding inside that complexity. After years of expansion across cloud, edge, third-party services, and internal modernization efforts, many organizations now have abundant data but limited confidence in the meanings behind it.

Getting Started with XcodeBuildMCP: Let AI Agents Debug Your iOS Apps

XcodeBuildMCP gives AI agents the ability to build, test, and debug native iOS and macOS apps. In this hands-on workshop, we show you how to use the open source MCP server to unlock the full developer loop — build, run, debug, interact, and verify — without leaving your preferred AI coding environment.

OpenTelemetry Fleet Management: Scalable Control

OpenTelemetry has turned observability pipelines into production infrastructure, but managing them at scale often creates a massive operational burden. In this demo, we show how Coralogix Fleet Management acts as the central control plane for your OTel ecosystem, providing the governance and orchestration required for modern DevOps. Stop the "manual marathon" of PRs and Helm upgrades. Move toward a safer, more predictable operating model where telemetry is consistent, audited, and scalable.

Turn Noisy Logs Into Structured Data with Uptrace Grouping Rules

Here are 3 YouTube title options plus a description optimized for technical/dev audiences: Same log pattern. Hundreds of useless groups. In this video, we show how to use Uptrace Grouping Rules to automatically turn noisy logs into structured, searchable data — without changing application code. You'll learn how to: Examples covered: Perfect for:#OpenTelemetry users, backend engineers, SREs, and anyone dealing with noisy logs.

Security Integrations in Observability Self-Hosted

Integrating security data with observability data provides a comprehensive view for better threat detection and response. Security observability helps connect the dots between seemingly innocent events that, when correlated, reveal complex attack patterns. SolarWinds security products integrate into observability self-hosted, including Security Event Manager for log data and event correlation, Access Rights Management for identifying potential attack vectors, configuration management for compliance monitoring, and Patch Manager for tracking critical updates.

Why the Operational Complexity of E-Commerce Reaches a Critical Point in 2025

Modern webshops no longer run on a single system. Behind the digital storefront lies an architecture made up of dozens of components: from product information management to caching layers, from search engines to payment providers. For operations teams, this means the classic LAMP stack from 2010 is now a distant memory.

From vibe code to production-ready: observability for Next.js and Supabase apps

The way we build software has drastically changed over the past few years. What hasn’t changed is that this software ends up in front of real people: you, me, my mom. And when those users inevitably run into something broken, you as the application’s developer need to be equipped with the right tools, context and understanding of what broke, where it broke, and how to fix it as quickly as possible. Every day we’re inching closer to self-healing software.

Why Alert Fatigue Solutions Still Miss the Root Cause

Alert fatigue solutions have never been better, but on-call engineers are still burning out. Threshold tuning, AI triage, and alert correlation reduce the noise, but every alert that clears filtering lands with the same incomplete telemetry and triggers the same manual investigation cycle. This post explains why the evidence gap survives every fix, and how runtime context changes that.

The Best Kubernetes Monitoring Tools of 2026

Effective Kubernetes monitoring in 2026 is critical due to increased cluster scale and microservices complexity, demanding a shift toward unified observability (logs, metrics, and traces). The core focus is leveraging AI-driven features to automate anomaly detection, correlate diverse data, and significantly reduce Mean Time to Recovery (MTTR).

Best Elixir APM Tools in 2026: A Developer's Guide

Last updated: May 2026 Elixir applications have performance characteristics that are genuinely different from Ruby or Python. The BEAM virtual machine handles concurrency through lightweight processes, supervision trees restart failed processes automatically, and Phoenix channels can hold tens of thousands of persistent connections on a single node. These are strengths, but they also mean that the performance problems you encounter are different from what most APM tools were built to detect.

What is an Enterprise Knowledge Graph? Definition, Benefits, and Use Cases

Are your AI systems giving answers your teams cannot trust? Most enterprises deploy LLMs expecting reliable outputs, but the results often feel inconsistent or incomplete. The problem is the missing structure behind it. Enterprise data is usually fragmented across multiple systems, teams, and tools. Your AI does not understand how customers, products, policies, and operations connect. Without that context, it fills gaps with assumptions, which leads to unreliable results.

Making Semantic Conventions Work for You With OpenTelemetry Weaver

Your dataset has hundreds of attributes. Some are self-explanatory: http.response.status_code, server.address. Others are not: meta.refinery.reason, dataset.slug, sli.latency_target_ms. If you don't know what an attribute means, you can't write a good query. And if an AI agent doesn't know what it means, it guesses.

Easily connect any AI assistant (Claude, Codex, ...) to your Oh Dear data

Oh Dear keeps a watchful eye on your websites: uptime, performance, SSL certificates, broken links, DNS, cron jobs. If something can quietly break, we're already checking it for you. Today we're connecting that data to a new place: your AI assistant. We just shipped an MCP integration. If you use Claude, Cursor, or any other client that speaks the Model Context Protocol, you can now ask questions like "any broken links on my site?" or "when does my certificate expire?" in plain language.

Migrating Your DX NetOps Integrations from OData 2 to OData 4

If you integrate DX NetOps with external dashboards, reporting engines, or IT service management tools, you likely rely on our API framework. We are currently migrating this framework from OData 2 to OData 4. This transition requires you to update your existing integrations so they continue to function properly. Let me walk you through exactly what is changing, how to identify your active API queries, and the specific adjustments you need to make to your setup.

What is AI Agent Orchestration? Concept + How It Works

Have you tried using AI at work and felt it works well for small tasks, but not beyond that? It can handle simple things like creating a summary, writing a draft, or answering a question. This works because the task is clear. But most tasks are not that simple. They involve multiple steps. One step depends on another. Data comes from different systems, and some decisions need checks before moving ahead. This is where a single AI system starts to struggle.

Monitoring Your Azure to Azure Local Migration: One Dashboard for Both Sides

More organizations are moving workloads from Azure public cloud to Azure Local (formerly Azure Stack HCI) than most people realize. The reasons vary: data sovereignty requirements, latency-sensitive workloads that need to be closer to the edge, cost optimization for predictable workloads where reserved cloud capacity doesn’t make financial sense, or regulatory constraints that require data to stay on-premises.

AURA in Practice: Mezmo's SRE bot, demo walkthrough

A walkthrough of the Slack-based SRE bot Mezmo's engineering team built on AURA, the open-source agent harness, running against Mezmo's own production tooling. Adrian Furlong shows the bot answering questions in a DM with tool calls visible inline, then in a shared channel where it reads the conversation before responding. He opens a fresh PagerDuty incident on camera. The webhook fires AURA, and within seconds, the agent posts a triage note back on the incident and a structured analysis in the dedicated incident channel.

Managing OpenTelemetry at Scale: Why OTel Pipelines Need a Control Plane

OpenTelemetry made telemetry possible everywhere – turning observability pipelines into distributed production infrastructure. Distributed infrastructure requires a control plane for inventory, governance, and safe change. At 500 collectors across hybrid environments, operational overhead becomes a production risk. The moment telemetry pipelines become a distributed infrastructure, they inherit the operational problems of one.

Geo Maps: See Where Your Infrastructure Lives

When your infrastructure is spread across regions, data centers, branch offices, or edge locations, knowing where a node is physically located matters more than people usually admit. During an incident, “the node in the Singapore POP” communicates faster than a hostname. When you’re planning capacity, seeing geographic clustering tells you something that a flat list of nodes doesn’t.

AWS outage takes down more than 150 cloud services

On May 7th and 8th, 2026, Amazon Web Services (AWS) experienced an outage affecting Amazon Elastic Compute Cloud (EC2) in the dreaded US East 1 region. The original region of AWS located in Northern Virginia, us-east-1 or just “US East” as it is known, has been the subject of some of the internet’s most high profile and destructive outages and remains Amazon’s least reliable region.

A Runnable Reference Architecture for Battery Energy Storage Systems on InfluxDB 3

A battery is a complex electrochemical system where safety and revenue are decided in milliseconds. Cell temperatures, voltages, and state of charge change in real-time; dispatch decisions and thermal alarms must fire in real-time. Anything in between—your data pipeline, your historian, your alerting layer—has to disappear into the background.

Federated Search | From Silos to Insight | AWS S3 Schema Discovery with Splunk-Managed Tables

This walk-through shows how Splunk's crawler, available through the Data Management app, can discover schema and partition keys for S3 backed datasets and create Splunk managed catalog tables. Once the data is mapped, analysts can search AWS S3 data through Splunk and bring it into broader security, observability, and operational workflows.

How Modern Ops Lost Their Bearings

Modern operations carry a quiet contradiction. Organizations have never had more data, more dashboards, or more instrumentation, yet teams increasingly struggle to gain a reliable sense of what the environment is actually doing. The problem is not the absence of information. It is the absence of bearings. This drift did not happen suddenly. It accumulated across years of transformation.

Multi-tiered Observability: A Practical Way to Handle Diverse Workloads

Observability in large companies is rarely one-size-fits-all. The VictoriaMetrics topologies guide shows why different deployment patterns are needed as scale, isolation, and reliability requirements grow. Different workloads require different trade-offs: some need long retention for audits and trend analysis, while others need higher resolution for debugging. Business-critical systems also demand dependable alerting and high availability, often with several 9s of reliability.

Diagnose and resolve database performance issues faster with Database Investigator

When your database performance degrades, diagnosing the root cause is rarely quick or straightforward. Your existing tools might surface metrics like CPU utilization, wait events, and query duration, but then leave you to correlate the data and identify what went wrong. Worse, what first appears to be the root cause can often just be a downstream effect of multiple interrelated issues.

Zero-Code OpenTelemetry for Vert.x

Drop a JAR on the JVM. Get distributed tracing, RxJava context propagation, log-trace correlation, and Vert.x internal metrics. No code changes. No Maven dependency. Java 8–21. Inside the design of last9/vertx-opentelemetry v2.3.4. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

From noise to knowledge: How GenAI is revolutionizing log management and analytics

Focusing on GenAI and logs for IT efficiency Efficiency is everything for managing today’s digital systems. Technology is constantly transforming and expanding operations are driving an explosion in data. Consequently, data ingest and storage costs have soared. But it’s not just storage data costs that keeps teams behind.The challenge of managing all that observability data forces IT teams to choose between efficiency and the bottom line.

Monitor Unreal Engine Game Performance with Application Metrics

Your Unreal game can ship with zero errors and still not feel great. Stutters during combat, a frame-rate cliff on the big boss, rubber-banding in multiplayer, none of it shows up as a crash and none of it shows up in Sentry, leaving you without any visibility into what your players are actually experiencing in the wild. Well, until now. Unreal Engine already gives you plenty of tools to measure game performance and collect runtime stats, but all that data stays on the dev’s machine.

The Journey to Production AI: Five Steps for SRE and Platform Teams

In a recent webinar, The Journey to Production AI, Andre Elizondo walked through what separates a working agent demo from an agent worth trusting on a 2 a.m. page. Live polls during the session put numbers behind a pattern most platform teams already feel. ‍ ‍ Most teams are early. The ones who are further along did not get there by shipping a flashier demo. They got there by treating production AI as a platform problem.

Operational Intelligence and the Hidden Structure in System Logs

Most IT teams do not suffer from a lack of data. They suffer from the amount of effort required to make sense of it. Every network device, application, cloud service, and infrastructure component generates a constant stream of machine output. Logs capture state changes, failures, retries, warnings, and thousands of other small signals about how systems behave. The problem is that raw logs are hard to use at operational speed.

Retroactive sampling reduce trace traffic and costs

In this short, our software engineer Zhu Jiekun, explains how retroactive sampling can reduce trace traffic and ingestion costs by sending minimal data for sampling decisions and retrieving full spans only when needed—at the cost of added system complexity. Resources for Further Learning.
Sponsored Post

How to Reduce MTTR When Third-Party Services Go Down

Most MTTR guides assume the problem is in your infra. For modern apps, it's often not - it's Stripe, AWS, Auth0, or another vendor. Vendor status pages lie by omission. The lag between impact and acknowledgment can stretch to an hour or more. You need two runbooks, proactive vendor monitoring, and graceful degradation baked in before the 3 AM page hits. This post shows you exactly how.

Why Blast Radius Analysis Does Not End When Alerts Fire

Modern distributed systems fail in ways that can bypass even well-designed isolation patterns. When a failure is actively propagating across services at four in the morning, the question shifts from “how do we limit the blast radius” to “how do we confirm what it actually is.” Monitoring shows which services are in the impact zone, but it cannot show what code path caused the failure to spread, or whether it has stopped.

Data Sovereignty: How to Keep All of Your Services in Europe (AppSignal + Hatchbox)

Over the last decade, a great deal of data privacy regulations have been passed in the European Union. Like it or not, measures like GDPR, the Digital Services Act, and the upcoming Artificial Intelligence Act are exerting increasing influence across industries over how and especially where the data of European customers is stored. In this article, we will explore the ways to keep the simplicity of a Platform as a Service (PaaS) while utilizing only European providers.

Span or Attribute in OpenTelemetry Custom Instrumentation

TL;DR: Attribute. More information on one event gives us more correlation power. It’s also cheaper. When you want to add some information to your tracing telemetry, you could emit a log, create a span, or add a piece of data to your current span. Adding a piece of data to your current span is the best! Usually.

How one partnership powers search for over 2 million WP Engine users

How do you make search faster, smarter, and more scalable? During our recent webinar, I sat down with Luke Patterson, senior product manager at WP Engine, and Delphin Barankanira, independent software vendor partner engineering lead and data & AI specialist at Google Cloud, to answer that question. We dug into the mechanics behind WP Engine’s ability to deliver near-instant updates to over 2 million users.

Faster OpenTelemetry Migrations from Splunk to SecOps with Bindplane

Many security teams are looking to move off Splunk, whether to reduce licensing costs, consolidate their SIEM, or take advantage of Google SecOps' built-in threat intelligence and YARA-L detection capabilities. But migrations aren’t easy, and no one wants to run blind while they evaluate and move to a new platform. With OpenTelemetry and Bindplane, you can easily make the switch to SecOps without impacting your existing stack.

Eliminate noisy log lines with Adaptive Logs drop rules

Most platform and observability teams have logs they know are noise. These could be throwaway health check logs, forgotten DEBUG logs, or verbose INFO logs from little used services that only serve to inflate your bill. Regardless of what they contain and why they're there in the first place, the hard part is getting rid of them. Centralized teams want to easily and quickly prevent these logs from being ingested, without having to work with toilsome infrastructure change management to do so.

Fixing JavaScript observability, one library at a time

Over the past few weeks, we have been driving a cross-ecosystem effort to replace the “monkey-patching” that powers all JavaScript APM tools today with something built into the runtime. Here is why, how, and where it stands. This applies to server-side JavaScript only (Node.js, Bun, Deno, Cloudflare Workers). Browsers do not have diagnostics_channel and lack the async context propagation primitives needed to polyfill it.

7 Best Practices to Improve Digital Employee Experience in Modern IT Environments

Digital employee experience isn’t just a nice to have anymore. In hybrid, SaaS heavy IT environments Digital Employee Experience (DEX) is where productivity can live or die. Employees don’t care whether the culprit is Wi‑Fi connectivity, CPU/RAM load, poor battery life, or a misbehaving cloud app. They just know work got harder.

ActiveMQ Monitoring & Alerting Setup: The Complete 2026 Guide

Most ActiveMQ outages are not sudden failures. They are visible in the metrics for minutes, sometimes hours, before they become incidents. A memory usage graph climbing past 60%. A queue depth that isn't draining. An enqueue time that doubled after a deployment. A consumer count that dropped from 3 to 1 at 2 AM.

Auvik Aurora and the Future of AI in IT Operations

We built something called Auvik Aurora, and before you scroll any further, I can already hear your thoughts. “Wait a second, Anto. Is this going to be another blog post giving me the hard sell on using AI?” Fair enough, I don’t think anyone would blame you, especially when we’re seeing AI adoption across nearly every industry, tool, hobby, workflow, or even . The blank is intentional, AI is everywhere, and chances are that you already know that it matters.

Observability and Security for the AI Era

Datadog has always been driven by a broader vision of helping teams understand and operate complex systems. In this session, you’ll hear from Michael Whetten, Product SVP, and Abrar Hussain, Senior Director, Product Management, as they share the latest updates across the Datadog product suite and discuss how that vision continues to shape the platform’s evolution and support the next generation of AI-driven applications.

Major .de Outage: DNSSEC Failure at DENIC Takes Down German Domains

On May 5, 2026, a major.de outage disrupted access to websites across Germany and Europe. The incident, caused by a failure at DENIC, the operator of the.de top-level domain, resulted in widespread DNS resolution failures. This was not a typical service outage. It was a failure at the DNS layer that made entire domains unreachable. As DNS caches expired, more services went offline, creating the appearance of a spreading outage across unrelated companies.

Powering Autonomous IT with Edwin AI in ServiceNow Now Assist

Edwin AI extends ServiceNow Now Assist with real-time incident intelligence, acting as a context broker between observability data and ServiceNow incidents. Responders get the context they need inside the IT operations workflow they already use. Edwin AI now: The Edwin AI Agent for ServiceNow brings real-time incident intelligence into Now Assist and Workspace, giving ITOps teams root cause, impact, and recommended next steps directly inside the ServiceNow incident record.

How to Prevent AI Agents From Deleting Production Data

There’s a new question teams are asking. How can we prevent AI agents from deleting production. When Cursor deleted PocketOS’s entire production database in nine seconds, the agent wasn’t malfunctioning. It had full technical capability, but it was inferring operational authority from static code rather than live environment state. That gap between capability and context is the root cause. This article breaks down exactly how that happens, and what runtime visibility does to stop it.

Get Valid TLS Certificates for Icinga Web Despite a Firewall

Lots of big companies lock down their IT infrastructure in the internal network, sometimes they even use only locally mirrored repositories. I totally understand this, especially since our CVE-2024-49369. Nowadays, when LLMs find security holes even in OpenBSD, you definitely shouldn’t expose any services to the public without need.

Troubleshoot performance issues faster with the new Grafana Assistant integration for Database Observability

So your database is slow. Now what? Grafana Cloud Database Observability already gives you visibility into your SQL queries with RED metrics, individual execution samples, wait event breakdowns, table schemas, and visual explain plans. But visibility is just the starting point. You can see that a query's P99 latency spiked, but what should you do about it? You can see wait events like wait/synch/mutex/innodb firing, but what does that actually mean?

Elasticsearch 9.4 powers the next phase of the Elastic AI Ecosystem: Dell AI Data Platform with NVIDIA

AI is moving fast. Enterprise adoption needs to move with purpose. Over the past year, one thing has become clear: Organizations are not looking for more AI hype. They are looking for a path to production — one that connects infrastructure, data, and intelligence in a way that delivers real business value. That is exactly what the Elastic AI Ecosystem is built to do. At Elastic, we believe AI is only as powerful as the data foundation behind it. Great models matter.

Datadog for Government achieves FedRAMP High certification

Modern government missions depend on software platforms that can perform under demanding conditions. As agencies update systems that support public safety, benefits delivery, financial operations, and national priorities, they face security and compliance requirements that shape how technology is adopted as well as how it is built, operated, and evolved over time.

Analyze cloud costs with flexible spreadsheets in Datadog Sheets

Cloud cost data is most useful when teams can adapt it to their own reporting and planning needs. In addition to viewing cost breakdowns, FinOps teams often need to calculate forecasts, reshape datasets, and present tailored views to finance and leadership teams. In many workflows, those steps happen outside the observability platform. Once the data is exported, it quickly becomes outdated and requires repeated manual updates.

Navigating the Middleware Maze: How meshIQ 12.1 Redefines Scale and Simplicity with Agentic AI

meshIQ v12.1 transforms middleware management with petabyte-scale data processing and agentic AI. The new intelligent launchpad, simplified onboarding, and context-aware safeguards move teams from reactive monitoring to proactive, AI-driven operations across the enterprise.

Inside the .de DNS Outage: Real-World Data from UptimeRobot.

In the evening of May 5th, 2026, large parts of the German web briefly went dark. For a few hours, anyone trying to load a.de address through a major DNS resolver got errors instead of websites. Bahn.de, Amazon.de, and Spiegel.de were among the affected. Major brands like Telekom, DHL, and Sparkassen felt it too, along with hosting providers Hetzner, Strato, and Ionos.

What kind of correlations become impossible without depth and breadth?

Most teams don’t have a data problem. They have a correlation problem. When visibility is fragmented:→ Marketing sees conversion drop→ Engineering sees API latency So the wrong call gets made. Example: Checkout drops → pricing gets blamed → discounts applied. Reality: a backend API timeout was killing transactions. That’s what happens when you can’t connect: user impact (what) to system behavior (why)

Improved debugging for Expo apps with the React Native SDK

Events from Expo apps account for about 75% of the total event volume we receive from React Native apps. That number made it an easy decision to invest in updates to the Sentry React Native SDK to improve the debugging and performance workflow for your Expo apps. With these updates, you can now.

VictoriaMetrics April 2026 Ecosystem Updates

We’re excited to learn that our vmagent helped Airbnb migrate its high-volume metrics pipeline from StatsD and Veneur to OpenTelemetry. Airbnb is now handling 100 million samples per second. You can read more about the migration in these articles: In other news, April saw releases across the VictoriaMetrics Observability Stack. We have released several important bugfixes for VictoriaMetrics and many new features in VictoriaLogs. This release round-up covers updates for.

Monitoring from Private Locations

Not everything worth monitoring is on the public internet. In this 30-minute hands-on session, Daniel Paulus deploys four Checkly private location agents on AWS EKS with Terraform, then uses a coding agent to scaffold 200 internal checks in seconds — uptime, TCP, DNS, ICMP, and Playwright browser checks against legacy apps that never leave the firewall.

The cost of knowledge

In the world of observability, “cardinality” has become a heavy word. It is a ghost used to justify skyrocketing bills or degraded query performance. When cardinality rises, the advice is almost always the same: reduce it. Drop your labels, or reduce the dimensions. It is usually framed as “optimization.” Every label you add to a metric is a dimension of knowledge. Each one gives you a way to slice, compare, and explain the chaos of production.

Driving Innovation: A Bias Towards Action with Greg Freeman

AI is changing network operations faster than ever. In the latest episode of Next-Gen Network Heroes, Bob sits down with Greg Freeman of Lumen Technologies to talk about what it takes to innovate across one of the world’s largest telecommunications networks. From deterministic workflows to agentic AI, Greg shares how his team is using automation, analytics, and AI to improve network reliability, customer experience, and operational efficiency at scale.

Bias Toward Action: Driving AI Innovation Across Global Networks with Greg Freeman

What does it take to lead innovation across one of the world’s largest telecommunications networks? In this episode of Next-Gen Network Heroes, host Bob Slevin sits down with Greg Freeman, Vice President of Network and Customer Transformation at Lumen Technologies, to explore how AI, automation, and curiosity are reshaping the future of network operations.

How to Measure your Most Expensive Milliseconds

In the fast-paced world of mobile development, reliability rarely fails with a loud crash; instead, it degrades quietly through micro-regressions that erode user trust and engagement. While most companies track backend health and API latency, they often fly blind regarding the actual screen-level responsiveness that defines the true user experience. When Expedia Group underwent a major technical evolution, the team realized they lacked a consistent baseline to compare performance across platforms, leaving them unable to validate improvements before rollout.

Introducing AppSignal Labs

We've been shipping faster. A dark mode for the UI, AppSignal MCP, the AWS dashboard templates — things we would have kept internal a year ago until everything was polished. Now we don't. A v1 in your hands beats a v3 in our heads. We learn more from a week of real use than from a quarter of internal review. So we're giving that work a home. AppSignal Labs is where you'll find the earlier versions. Real software, available today, with a direct line to the team building it.

The World Beneath The Dashboards

Most people assume the modern enterprise runs cleanly on the dashboards and cloud consoles that dominate today’s digital workspaces. Anyone who operates these environments understands a more complicated truth. The real work happens beneath those surfaces, in systems few people notice until something slips. Across industries, engineers face the same recurring scenario: a routine shift disrupted by signals of degradation somewhere in the environment.

SmartAssist and SQL Analytics - AI-powered querying

SQL Analytics has always been one of my favourite SquaredUp features. That's not just because I can use raw SQL to achieve complex data transformations. The fact that I can run SQL queries over data from all sorts of sources — not just relational databases, gives incredible power and flexibility. The great news is that SQL Analytics now ships with our AI-driven SmartAssist technology.

What Is a Linux Server? Everything You Need to Know (2026)

An open-source foundation for resilient infrastructure: on-prem, cloud, and hybrid. IT downtime costs organizations an average of $9,000 per minute, or more than $1 million per hour. That’s real money lost when websites crash, transactions fail, or internal systems go offline. For many organizations, avoiding those losses starts with choosing the right server operating system (OS). Why? The OS sets the foundation for how stable, secure, and cost-efficient your infrastructure will be.

How to Monitor Your Node.js App on Hetzner with AppSignal

More and more developers are choosing self-hosting over traditional PaaS. At first, self-hosting may seem like unnecessary heavy lifting, especially when you can deploy as fast as creating a repo. However, with correct tooling, it’s easy to see why devs are moving away from PaaS. You get dedicated resources and (if needed) a European data center at a fraction of the cost.

How Scalability Works in SolarWinds Observability Self-Hosted

Cheryl Nomanson, SolarWinds staff technical trainer, provides a comprehensive overview of SolarWinds architecture and scaling options for self-hosted deployments. She explains the centralized deployment model starting with a single SolarWinds server that handles polling, web console, and database connections. The presentation covers key scaling indicators including polling thresholds that warn users at 85% capacity and alert at 100%. She demonstrates how to add up to 100 polling engines per server and additional web servers to handle more concurrent users.

Moving Beyond SolarWinds: A Guide to Modern Observability

Industry-leading observability experts provide strategic guidance on why and how modern IT teams are successfully moving beyond SolarWinds to more resilient, cloud-native platforms. IT teams running SolarWinds often know the pain points well before they start evaluating alternatives: separate modules for different monitoring needs, a self-hosted deployment model that requires ongoing maintenance, and pricing that gets harder to predict after each acquisition.

Ep 41: The cost of not thinking: Who's responsible when AI agents get it wrong?

In this episode of Masters of Data, we get into the messier side of AI adoption, tackling questions like who actually owns the output when AI gets it wrong, and whether chasing efficiency is making us forget what it means to be human in the first place. We discuss tech CEOs proudly announcing they no longer think for themselves and debate whether AI is quietly eroding our critical thinking skills. We make the case that purpose-built, narrow AI is genuinely exciting, but that no efficiency gain is worth losing the human touch that makes work, connection, and creativity meaningful.

Observability vs Monitoring: What's the Real Difference in 2026?

Understand the real difference between observability and monitoring — and why modern IT teams in 2026 need both. Monitoring tells you something is broken; observability explains why. See real examples, faster troubleshooting workflows, and how Motadata ObserveOps unifies both in one platform. Don’t forget to like, share, and subscribe for more IT insights.

Introducing the Coralogix CLI: Headless Observability for Every Agent

This article is a high-level overview of the Coralogix CLI. For a deeper look at how it works in practice, read the full technical deep dive here. Agent-driven investigation sounds simple: read the alert, query the data, return the cause. In reality, most agents either overload their context window with raw logs or guess at queries and return incorrect results.

ActiveMQ JMS 2.0 Implementation Guide: Simplified API, Transactions & Spring

For most of JMS's lifetime, writing a simple producer required creating a ConnectionFactory, creating a Connection, starting it, creating a Session, creating a MessageProducer, creating a Message, calling send(), and then closing the producer, session, and connection with the close calls safely wrapped in finally blocks to prevent resource leaks. Every developer knew the pattern. Every developer wrote it slightly differently. Every code review had the same comments about resource management.

Introducing Application Metrics: Track the signal, see the spike, jump to the trace

A few weeks ago we had a bug with Session Replay. Replays were failing in some browsers once more than 1,000 video segments loaded. We had no idea how often it happened or who was hitting it, and because the failure didn’t always produce an error, we had no way to find affected users to reproduce it. Before, we could’ve answered this with spans or logs, but it’s clunky — spans are often sampled, so you can miss outliers; logs are less structured and tend to change over time.

May the Logs Be With You: Graylog 7.1 Is Here

A long time ago, in a SOC far, far away…analysts were drowning in alerts, chasing context across fragmented screens, and watching real threats slip past detection gaps. Today, the Rebellion fights back. This isn’t a release built around a single marquee feature. It’s the result of our team listening to you on the front lines with an ear for removing the friction that makes your jobs harder than they need to be.

Monitor and optimize Supabase query performance with Datadog Database Monitoring

Built on Postgres, Supabase is an open source, all-in-one backend platform for developers who want to ship applications without managing infrastructure. This makes it especially popular with frontend developers and vibe coders who may have little to no database expertise. Datadog's Supabase integration provides high-level infrastructure metrics, but developers also need query-level visibility to easily diagnose, optimize, and trace performance issues back to their source.

What Is AWS EKS, and How Does It Work with Kubernetes?

Amazon EKS is AWS’s managed Kubernetes service for deploying and scaling containerized applications. Amazon Elastic Kubernetes Service (Amazon EKS) is a managed Kubernetes service that simplifies deploying, scaling, and running containerized applications on AWS and on-premises. EKS automates Kubernetes control plane management, ensuring high availability and seamless integration with AWS services like IAM, VPC, and ALB.

Obkio Microsoft Teams Monitoring vs. Microsoft Teams Admin Center

Most IT teams rely on Microsoft Teams Admin Center as their default monitoring tool to find and fix Microsoft Teams issues, but there's a gap between what it shows and what actually causes call quality problems. Teams Admin Center gives you Microsoft's perspective on what happened after an MS Teams call ended. It doesn't tell you what was happening on your network, on your users' devices, or in the five minutes before the complaints started coming in.

NVIDIA DCGM Collector: Deep GPU Monitoring for Data Center and AI Infrastructure

GPU infrastructure is expensive and increasingly central to production workloads. Whether you’re running ML training jobs, inference serving, video transcoding, or HPC workloads, understanding what your GPUs are actually doing, and what’s going wrong when performance degrades, is not optional.

Taming Log Noise With the OpenTelemetry Collector's Drain Processor

Do you receive 50 million log lines per day and struggle to see what actually matters? Health checks, heartbeat pings, connection pool messages—they all drown out the errors and anomalies you're trying to find. Most teams deal with this by writing filter rules to drop the noisy patterns. But those rules are manual, per-pattern, and brittle. A new deployment changes a log format and the filter misses it. A new service starts logging a chatty startup sequence nobody thought to exclude.

What's New with Progress WhatsUp Gold 2026.0

Progress WhatsUp Gold 2026.0 helps IT teams improve network visibility, strengthen security and work more efficiently. In this recorded webinar, explore what’s included in this free upgrade for customers with an active service agreement, including: Learn how Progress WhatsUp Gold 2026.0 can deliver proactive visibility with trusted security across your IT infrastructure.

This Month in Datadog - April 2026

In the latest episode of This Month in Datadog, Jeremy shares how to run autonomous Cloud SIEM investigations, remediate vulnerabilities with auto-generated fixes, and use natural language to explore Datadog. Later, Sumedha Mehta spotlights the Datadog MCP Server, which gives AI agents real-time access to Datadog’s observability data. Then, Chetan Sharma walks through Datadog Experiments, which measures how product changes impact the user journey.

AI Supply Chain Attacks Are Here. And Most Organizations Aren't Ready

When I read about the Vercel breach tied to a Context AI compromise, I wasn’t surprised. I’ve been talking with customers for a while now about how AI was going to introduce a new kind of supply chain risk. This is exactly what that looks like. What stands out to me is how familiar the pattern is. We saw it with open source, then again with SaaS, and again with cloud.

What is Cloud Threat Detection? An Ultimate Guide for 2026

What if the next breach in your cloud is already in motion, and your team has no idea how to see it? Cloud workloads are growing fast. APIs, identities, and data are spread across AWS, Azure, GCP, and on-prem systems all at once. Every layer creates its own logs, its own alerts, and its own blind spots. Most security teams are short on visibility, context, and time. That is the gap cloud threat detection is built to close.

How the Coralogix CLI Adds Production Intelligence to Any Agent for Any Use Case

The new interface into production telemetry is a tool call, made from whichever agent runtime the operator happens to be using at that moment. A finance lead in Claude Code, a product manager in Cursor, an engineer in Codex. Three different jobs, three different agents, three different reasoning loops. The thing they have in common is the data layer underneath.

Why Does MTTD Stay High Despite Observability Tools Running?

Monitoring coverage, anomaly detection, and SLO-based alerting have significantly narrowed detection windows for most failure types, but MTTD remains stubbornly high for a specific silent failure. This blog covers why type mismatches, swallowed exceptions, and values that pass validation without occurring without triggering errors, and what changes when your monitoring stack can generate those signals without waiting for a failure to surface them.

Federated Search | From Silos to Insight | Unified Datasets in AWS S3 with Ingest Processor

Are storage costs and data silos slowing down your investigations? In this video, we dive into the Unified Dataset Experience to show you how to search data where it lives. Learn how to use the Splunk Ingest Processor to route high volume logs directly to AWS S3 while maintaining instant visibility via Federated Search. No more re-hydrating data, just fast cost-effective insights.

ActiveMQ Security Hardening: TLS, JAAS, LDAP & CVE Patch Guide

In October 2023, security researchers published CVE-2023-46604, a CVSS 10.0 remote code execution vulnerability in Apache ActiveMQ. Within days, it was being actively exploited in ransomware campaigns. The attack required nothing more than network access to port 61616. No authentication, no credentials, no social engineering. The attacker connected to the standard ActiveMQ port and executed arbitrary code on the server.

OpenTelemetry VM Setup Guide: SigNoz Collection Agents Explained

About This Video: If you're working with OpenTelemetry, managing collector configurations across environments like VMs can quickly become difficult. In this video, we focus on VM-based setups and walk through how to configure SigNoz Collection Agents step by step. We start with an introduction to VM collection agents, then move into a practical project walkthrough using the OpenTelemetry demo. From there, we explore the documentation, set up configurations, run the collector, and finally validate everything inside SigNoz.

Get Observability in the Terminal, for You and Your Agents: gcx

The way you write code is changing, which means the way you observe your systems and respond to issues needs to change, too. Engineers today spend much of their day working via command line, as agentic tools like Cursor and Claude Code have become highly effective at handling many day-to-day engineering tasks. This greatly accelerates code generation, but it doesn't solve for the context switching that comes when you have to jump into another tool that's not part of this new, faster workflow.

Accelerating MTTR with Faster Root Cause Diagnosis: AI Advisor Now Supports On-Demand Connectivity, Config Context, and Device Diagnostics

Knowing something is broken is easy. Figuring out why is hard. Introducing three new, native AI diagnostic capabilities in the Kentik Network Intelligence Platform to accelerate root cause analysis and keep your network running better.

AI Diagnostics in Kentik NMS (Network Monitoring System)

Network problems are easy to spot. Proving root cause is the hard part — and it’s where most of MTTR gets burned. Kentik’s new AI diagnostics in the Network Monitoring System (NMS) close the gap between detection and diagnosis by bringing three capabilities directly into Kentik AI Advisor.

April 2026: IsDown Users Saved 16.5 Hours with Early Outage Detection

In April 2026, IsDown's early detection system gave users a 3.6-hour head start on a major outage — plenty of time to implement workarounds before the vendor even acknowledged the problem. Across 45 early detections, our users saved a collective 16.5 hours by knowing about outages an average of 22 minutes before official status pages were updated.

Real-Time Database Monitoring: Solving Database Latency with Zero-Code eBPF Tracing

In high-throughput database environments, a latency spike is rarely a simple story. Modern data layers are distributed, stateful, and constantly changing as shards move, nodes rebalance, caches warm, queries evolve, and connections churn. In practice, spikes usually come from one of three places: For many SRE and Platform teams, the real challenge is disconnected tooling. As one engineering lead recently shared during a technical workshop: “It’s all disconnected.

What Is SNMP? Gain Real-Time Insights Into Network Performance (2026)

SNMP is the universal protocol for monitoring network infrastructure, but its real value depends on which version you run, how you secure it, and how well your monitoring tool handles the OID work for you. SNMP (Simple Network Management Protocol) is the standard protocol IT teams use to monitor and manage network devices.

Stop ECS Containers From Collapsing Into One Service in OpenTelemetry

Why ECS containers collapse under service.name = aws_ecs and how to fix it for both EC2 launch type and Fargate, including the resource-vs-log-record pitfall that quietly breaks log filtering. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

April 2026 Early Warning Signals

April saw widespread disruptions across SaaS platforms, developer tools, and cloud services, with login failures, pipeline issues, and general service outages among the most common problems. StatusGator’s Early Warning Signals consistently identified these incidents ahead of official provider updates. In several cases, the lead time was significant. Bitbucket pipeline failures were detected 1 hour 17 minutes before acknowledgment, while Claude performance issues surfaced 59 minutes early.

Telemetry Talks ep 4: Retroactive sampling and OpenTelemetry

This episode of Telemetry Talks explores the evolution of an OTLP/gRPC tracing pipeline for VictoriaTraces within OpenTelemetry and VictoriaMetrics, including a shift from standard gRPC-Go to a simplified HTTP/2-based implementation to reduce complexity and improve flexibility. Together with the our guest, Jiekun, we revisited the VictoriaMetrics KubeCon talk ideas on tail-based and retroactive sampling — and their impact on the broader OpenTelemetry community.

When Dashboards Start Teaching the System: Why Selector's Natural Language Querying Matters

Operations teams have lived with the same frustrating tradeoff for years: the data exists, but getting to the right answer often takes too much time and too much expertise. Engineers are expected to know platform-specific query languages, navigate layers of dashboards, and understand exactly where the right visualization lives before they can even begin troubleshooting. That approach can work in smaller environments, but as infrastructure grows more distributed and complex, it becomes a bottleneck.

ActiveMQ Slow Consumer: Detection, Strategy & Prevention Guide

One of the most counterintuitive failure modes in enterprise ActiveMQ deployments is this: a single application team deploys a new consumer for a high-volume market data topic. Their consumer is slow, maybe they added a database write on every message, or their processing thread pool is undersized.

Add dynamically updating context to logs with Reference Tables and Observability Pipelines

Security and platform engineering teams rely on context-rich logs to investigate threats, prioritize incidents, and meet compliance requirements. Context is often stored separately from applications that generate logs, in sources like threat intelligence feeds in Snowflake, asset lists in Amazon S3, ownership data in ServiceNow CMDB, and risk scores produced in Databricks.