Operations | Monitoring | ITSM | DevOps | Cloud

Introducing the StatusGator MCP Server

Your AI agents can now monitor, triage, and respond to cloud outages autonomously. The way enterprises manage cloud infrastructure incidents is changing. AI agents are no longer just chatbots answering questions — they’re becoming first responders in your incident management pipeline. Today, we’re launching the StatusGator MCP Server, giving AI agents direct, structured access to the full power of StatusGator’s cloud status monitoring platform.

New Beta feature: Google Cloud Private Status integration!

We are excited to announce that Google Cloud is the latest addition to our suite of Enterprise infrastructure integrations! While StatusGator has long monitored the public status of Google Cloud services, this new integration goes deeper. You can now monitor the personalized health of your specific Google Cloud projects directly within your StatusGator dashboard.
Sponsored Post

From Silos to Collaboration: How to Democratize Data in Product Analytics

Companies who develop software products generate massive quantities of product performance and user engagement data that can be analyzed to support decision-making about everything from feature planning and UX design to sales, marketing, and customer support. Leveraging product data throughout the enterprise represents a significant opportunity to achieve a competitive advantage, but challenges like siloed data systems, poor data literacy, and the complexity of data analytics in the cloud can prevent organizations from making full use of their raw data.

API Latency Monitoring: Metrics, Percentiles, and Alerting Best Practices

APIs power modern applications. Every login request, product search, payment authorization, and mobile app refresh depends on an API responding quickly and reliably. When latency increases, users feel it immediately. Pages stall. Transactions hang. Confidence drops. Most engineering teams measure API latency. Fewer truly monitor it. There is a difference. Many teams track average latency in dashboards and assume performance is healthy.

API Endpoint Monitoring: How to Ensure Reliability, Performance & Functional Accuracy

APIs sit at the core of modern digital infrastructure. From e-commerce checkouts and payment processing to SaaS platforms and mobile applications, APIs move the data that keeps systems running. But APIs do not operate as a single unit. They are made up of individual endpoints, and each endpoint represents a specific function or resource that users depend on. As organizations shift toward microservices, cloud native applications, and third party integrations, the number of endpoints increases rapidly.

What is MRO? Maintenance, Repair, and Operations Explained

MRO stands for maintenance, repair, and operations. It refers to the activities, supplies, and services that keep equipment, facilities, and infrastructure running safely and efficiently. Every industry that relies on physical assets depends on MRO, whether that means replacing a worn bearing on a production line, restocking safety gloves in a warehouse, or servicing an HVAC system in a hospital.

Pull Request Velocity as a Proxy for AI Usage for Software Development

While AI have usage has been growing steadily for the last several years, the LLM models noticeably improved around the end of 2025. Specifically, they become more viable for software development. We are seeing the results. The feature and product delivery has picked up. One way to visualize this is by looking at the number of pull requests for your organization / software development teams. This chart shows the number of Github pull requests created by a team. Can you spot when AI usage increased?

Monitoring a Roborock Robot Vacuum With Healthchecks.io

I semi-recently bought a used Roborock S5 Max robot vacuum, and installed Valetudo on it. The installation process involves rooting the robot, and gaining SSH access to it. Which got me thinking, could I get the robot to ping Healthchecks.io at regular intervals? When the robot runs into a problem (closes a door after itself, gets stuck, chokes on a loose wire) and cannot return to the base, it eventually shuts down.

Accelerate Your OpenTelemetry Migrations With Honeycomb's Agent Skills

Since releasing our hosted MCP server last year, we've been thrilled to see customers not just adopt it but build Honeycomb deeply into their agentic development and observability workflows. Users have embraced it, leveraging Honeycomb to stay in conversation with their code and understand how it runs in production.

AI Needs Better Inputs: Why Observability Is Becoming the Foundation of Enterprise AI Maturity

Organizations across industries are accelerating their investments in AI for operations, yet the path to meaningful impact is proving far more complex than early expectations suggested. Analysts at Gartner, Forrester, Deloitte, and McKinsey continue to highlight the same structural barrier. AI cannot produce accurate predictions or safe automation when the operational data feeding it is fragmented, incomplete, or inconsistent.

Mastering the Trace Drilldown: How to Reduce MTTR with Coralogix

Stop the "Scavenger Hunt" during incidents. In this video, we walk through the new Coralogix Trace Drilldown, now GA for all customers. Learn how to move from high-level trace views to deep span insights in a single, unified workspace—without ever losing context. Whether you're investigating a latency spike or a failing microservice, the Trace Drilldown helps you answer "Where is the bottleneck?" from three different perspectives in one frame. What you’ll learn.

Balancing personal brand, company goals and open source in DevRel can be tricky

DevRel often means juggling goals that feel completely opposite: building trust while driving adoption, serving developers while supporting business growth. In this short, we explore why these “contradictions” are actually the secret to great Developer Relations.

European Compliance Requirements 2026: Key Regulations and Implementation Steps

The European regulatory landscape in 2026 looks less like a single finish line and more like a marathon with multiple checkpoints happening simultaneously. Organizations that spent years preparing for GDPR now face overlapping deadlines for AI governance, digital accessibility, operational resilience, and supply chain due diligence all converging within the same twelve-month period.

Ep 36: Do not resuscitate: Legacy tech in modern medicine

In this episode of Masters of Data, we dig into the cybersecurity nightmare that is modern healthcare IT, from ransomware attacks shutting down entire hospitals to IoT medical devices running software older than some of our passwords. We explore why healthcare organizations make such attractive targets for cybercriminals, and why the combination of life-or-death stakes, skeleton-crew security teams, and Windows-95-era equipment is a recipe for chaos.

What is Error Tracking? A Beginner's Guide to Monitoring Errors in Production

Every app breaks eventually. A button stops working. A checkout flow throws an exception. An API returns a 500 error at 2 AM on a Saturday. The question isn't whether your app will have bugs; it's whether you'll find out before your users do. That's exactly what error tracking is for.

Digital Trading: Why "Healthy Systems" Still Lose Trades

Digital trading firms operate in environments where milliseconds determine profit and loss. During volatile market conditions, platforms can appear fully operational while execution quality quietly degrades. When prices shift in so quickly, even a minor drift in your order-routing path means your competitors are exploiting the delta, while your platform appears perfectly green. For trading firms, observability is not just about uptime.

Incident Management in 2026: Best Practices, Tools Guide & More

When systems go down, every minute counts. You need more than just quick fixes. You need a solid system to spot problems early, take action fast, and learn from each incident to keep your users happy. That's what incident management is. In this guide, we'll walk through everything you need to know about incident management, from basic concepts to advanced strategies used by top DevOps teams.

Website Maintenance Plans: Checklist, Tools, ROI & Cost Breakdown (2026)

While most businesses invest heavily in website creation, many overlook the ongoing website maintenance plans needed to keep their digital presence performing at its peak. Data from recent studies reveals a harsh truth: 88% of online consumers won't return to a website after encountering technical issues or outdated information.

Incident Response Automation Guide: Cut MTTR by 33% in 2026

Every minute matters when you're dealing with a security incident. The longer a breach goes undetected and unresolved, the more damage it can cause to your systems, data, and reputation. But traditional incident response is plagued with challenges: alert fatigue, manual processes, skill shortages, and the sheer complexity of modern IT environments. Security teams are drowning in alerts while struggling to respond quickly enough to the threats that matter.

DevOps Workflow Strategy for Startups: 7-Step Guide (2026)

Reliability is the foundation of successful startups. Your product could have the most innovative features, but if it's plagued by downtime or performance issues, customers will eventually jump ship. Fortunately, creating an effective DevOps workflow strategy doesn't have to be complicated. This guide breaks down the essential components and implementation steps that startup DevOps and SRE teams need to focus on.

What is the Citrix License Activation Service (LAS)?

One of the hot topics from our recent Citrix-focused webinar was the Citrix License Activation Service (LAS). I had the chance to present alongside George Spiers— Citrix Expert and EUC Architect —and we walked through what LAS is, how it works, and what teams should be aware of.

Logging in Next.js is hard (But it doesn't have to be)

A typical Next.js deployment can execute code in up to three different runtimes: Edge, Node.js, and the browser. You may already be capturing logs from server-side code, but if you are not capturing the full request from middleware through server rendering to the browser, you are missing a lot of debugging info when things go wrong. TL;DR: A typical Next.js deployment can run in up to three environments; Node, Edge, and the browser.

Grafana Cloud Demo in Under 5 minutes | Full Stack Observability and more

Overview & demo of how Cloud provides an end to end Observability Platform that empowers users who have adopted open standards like or to improve their systems reliability using & a shift left approach with performance testing while optimizing their observability costs.

OpenClaw Monitoring & Observability with OpenTelemetry and SigNoz

Learn how to implement monitoring and observability for OpenClaw systems using OpenTelemetry and SigNoz. In this video, we cover how to instrument OpenClaw, collect traces, metrics, and logs, and visualize everything in SigNoz for real-time insights into performance and reliability. You’ll see how to quickly identify bottlenecks, debug issues, and improve system stability in production.

Detecting, Investigating, and Responding to Threats: Best Practices | WhatsUp Gold

As the speed of cyberattacks accelerates through the use of generative AI, traditional static playbooks are no longer sufficient to maintain organizational resilience. This webinar provides a deep exploration of modern security operations center methodologies that unify detection, investigation, and response into a single, seamless motion. By focusing on practical strategies for reducing alert fatigue and closing visibility gaps at the edge, this session equips decision-makers with the technical criteria to evaluate solutions that offer true forensic clarity.

Analyzing round trip query latency

It’s an all too common scenario: You get paged for some queries timing out, but when you investigate, the database performance looks unchanged. Something must have changed, though. If the database doesn’t look overloaded, where are these timeouts coming from? The answer often lies outside the database itself. Round trip query latency includes every hop between your application and the database, including connection pools, load balancers, and proxies.

Real-Time Visibility, Orchestrated Deployments, and More

The latest VirtualMetric DataStream release brings a significant step forward in platform observability and deployment flexibility. Version 1.9.0 gives security and infrastructure teams direct visibility into what’s happening across their pipelines in real time while expanding support for cloud-native environments and broadening connectivity options. Here’s what’s new.

Enhancing our API for better agentic consumption

AI coding agents like Claude Code and Codex are becoming a real part of developer workflows. They don't just write code, they call APIs, interpret responses, and take action based on what they find. That means the quality of your API responses directly affects how useful an agent can be. We've shipped a series of improvements to the Oh Dear API with this in mind. Every change helps humans too, but we specifically optimized for how agents consume and reason about data.

When IT instability becomes a patient safety risk in healthcare

Inside hospitals and health systems, the performance of clinical technology underpins nearly every care workflow and directly influences the timeliness and quality of patient care. Electronic health records sit at the center of admissions, discharge, imaging, lab coordination, and prescribing, so even minor technology friction can become a patient safety and operational risk. At scale, reliability becomes a prerequisite for consistent care.

Solving the Ticket Noise Problem: What We Learned from Our ServiceNow Webinar

On March 18th, we hosted a session focused on a challenge that continues to undermine even the most mature IT operations teams: ticket noise. It’s easy to dismiss noise as just “too many alerts”. But as we explored in the webinar, the real issue runs deeper. Ticket noise is a symptom of something more fundamental — a lack of correlation, context, and shared visibility across the stack.

The Benefits of Historical Data for Network Monitoring

Your phone rings. A user is complaining that “the network was slow" or "had issues around 3pm." You run a speed test. Green across the board. No active alerts. Everything looks fine. So what do you tell them? If you don't have a continuous, time-stamped record of what your network was doing at 3pm, you can't tell them anything, not with confidence. You're stuck choosing between "I didn't see anything" and "I'll keep an eye on it," neither of which fixes the problem or satisfies the user.

Observability and Security for the AI Era

Datadog has always been driven by a broader vision of helping teams understand and operate complex systems. In this session, you’ll hear from Yrieix Garnier, VP of Product, and Hugo Kaczmarek, Senior Director of Product, as they share the latest updates across the Datadog product suite and discuss how that vision continues to shape the platform’s evolution and support the next generation of AI-driven applications.

Setting up NTP Check on Uptime.com

Welcome to Uptime.com! In this video, we'll guide you through the process of setting up and configuring an NTP Check. Learn how to log in, navigate the Monitoring section, and complete the setup, including intervals, contacts, locations, and advanced settings. Also, find out how to handle NTP alerts and verify time offsets.
Sponsored Post

Did we miss the end of LaMa?

In mid-2024, SAP announced the discontinuation of SAP Landscape Management ("LaMa"). This wasn't a huge surprise, as the 2027 end-of-support date aligns neatly with Solution Manager, Focused Run and standard support for ECC. SAP also terminated work on SAP Landscape Management Cloud, discontinuing that product immediately. Responses to the post were as expected: customers asking, "What now?" and even expressing a bit of dismay. One response from SAP was especially telling: "Moving the ERP system to the cloud hands over the tasks realized with SAP Landscape Management to SAP as cloud vendor.

The Observability Gap: Why Monitoring Data Should Drive Tests

Most teams already know a lot about production. They have dashboards. They have traces. They have alerts. They have enough telemetry to explain what happened after an incident and enough graphs to argue about it for the rest of the week. Then they go to test a change and start from scratch. The integration tests hit a hand-written mock that returns {"status": "ok"}. The load tests replay a CSV somebody exported months ago. Staging is close enough to production right up until it matters.

Observability Is Now a Boardroom Priority Even If Nobody Wants to Say It Out Loud

Executives rarely state the full truth publicly, but inside boardrooms the conversation has changed. Observability, once viewed as a technical capability deep within operations, has become a strategic requirement for understanding business performance. Leaders may not always use the term itself, yet they focus intensely on the outcomes it promises. Their environments have grown too fast, too fragmented, and too interdependent for traditional visibility approaches to keep pace.

Debunking the Myth of the Homogeneous Network

If you have been in network operations for more than a week, you know the dream of the single vendor shop is exactly that, just a dream. In the practical reality of your daily job, the network is a diverse, chaotic ecosystem. It is a complex stack in which layers of technology from different times and vendors coexist, often uneasily.

Monitoring Your App Without Running Your Own Prometheus Stack

Prometheus and Grafana are the default monitoring recommendations across DevOps blogs, Reddit, and Hacker News, and for good reason. Prometheus is open-source and backed by the CNCF, but it’s not actually a complete monitoring system. It’s more of a metric collection engine.

Beyond the Queue: Modernizing Legacy Middleware with Apache Kafka 4.x

Apache Kafka 4.x eliminates the final barriers to legacy middleware modernization. With KRaft mode removing ZooKeeper dependency and native queue semantics bridging the gap, enterprises can finally transition from point-to-point messaging to event-driven architectures.

Olivier Pomel and Alexis Lê-Quôc on Datadog's origin, AI, and more | This Month in Datadog

Get an insider’s view of Datadog from the people who built it. On a special episode of This Month in Datadog, co-founders Olivier Pomel and Alexis Lê-Quôc sit down for a rare, in-depth look at the challenge that inspired them to build the Datadog platform, what the company is working on today, AI, and more. This Month in Datadog brings you the latest updates on our newest product features, announcements, resources, and events.

What Are Containers? (And Why "It Works on My Machine" Finally Dies)

What are containers in DevOps—and why do they solve the classic “it works on my machine” problem? In this episode of Cloud Security in a Minute, Sysdig breaks down containers in simple terms: what they are, how they work, and why they’ve become the backbone of modern cloud applications. You’ll learn: Containers package everything an application needs—code, dependencies, and system tools—so it runs consistently anywhere: your laptop, the cloud, or at massive scale.

Monitor Nutanix clusters, hosts, and VMs with Datadog

Nutanix is a hyperconverged infrastructure (HCI) platform that combines compute, storage, and virtualization into a single software-defined stack. By collapsing traditional infrastructure tiers into one platform, Nutanix simplifies provisioning and operations for virtualized workloads. Clusters are managed through Prism Central, which provides visibility into health, performance, capacity, and operational activity across hosts and VMs.

Datadog achieves ISO 42001 certification for responsible AI

As AI-powered products and services become central to how organizations operate, the need for responsible AI governance has never been greater. Customers, partners, and regulators are seeking assurance that AI systems are built, managed, and monitored responsibly and effectively. Datadog is committed to the responsible use of AI, both in how we build our products and in how we help customers observe their AI workloads.

Introducing Bits AI Dev Agent for Code Security

As organizations adopt AI-assisted development and increase their release velocity, they are not only generating more code but also finding more vulnerabilities from static analysis. The traditional remediation workflow of manually triaging issues, creating tickets, and opening individual pull requests (PRs) cannot keep pace. Fixing tens of thousands of vulnerabilities one by one is not a viable remediation strategy.

Automate Your Monitoring and Incident Handling: How Agents Dominate the Checkly CLI

50% of Checkly's CLI users are already coding agents. We predict that agents will become dominant by the end of 2026. This video demonstrates an agentic workflow where an alert reports a broken Shopify store login flow, and Claude Code, using the installed Checkly Skill and the Checkly CLI, pulls monitoring results, identifies a Playwright test failure, investigates the codebase, finds and fixes a bug, and then updates a Checkly status page by creating an incident.

How to Reduce MTTR with AI

The quick download: AI reduces MTTR by helping teams detect issues sooner, pinpoint root causes faster, and resolve incidents with less manual effort. IT downtime costs organizations an average of $9,000 per minute. AI-powered observability can cut incident resolution time by up to 70%. Here’s what it takes to get there. Every minute an incident goes unresolved, the meter is running.

Checkly and the Agentic Software Layer

November 24th, the Opus 4.5 release turned around the entire tech industry. This was the moment when agents became capable. Capable enough to write solid staff-level code. Capable enough to reason about alerts, investigate root causes much faster than most engineers, and set up the reliability layer faster. For me, this feels like an iPhone moment on steroids; the adoption of AI is accelerating much faster than any adoption curve I’ve seen over the past few decades.

One CLI, Two Audiences: How We Built for Agents and Human

Half of the Checkly CLI users are already coding agents. This is not a prediction — it's what the data shows today. Since February, more and more agents have been using the CLI to manage and configure their Checkly monitoring setups. Right now, we're at 50% human and 50% agentic CLI users. And we predict that by the end of 2026, it won't be humans using the CLI; the agents will have taken over. The terminal became the primary interface for AI agents doing real work in the Checkly ecosystem.

Telegraf Enterprise Beta is Now Available: Centralized Control for Telegraf at Scale

Telegraf is incredibly good at what it does: collecting metrics, logs, and events from just about anywhere and sending them wherever you need. But once Telegraf becomes part of your production telemetry pipeline, spread across environments, teams, regions, and edge locations, the hard part isn’t installing agents; it’s operating them. Configs drift. “Temporary” overrides linger. Rolling out changes across hundreds (or thousands) of agents becomes a careful, manual process.

Finding performance bottlenecks with Pyroscope and Alloy: An example using TON blockchain

Performance optimization often feels like searching for a needle in a haystack. You know your code is slow, but where exactly is the bottleneck? This is where continuous profiling comes in. In this blog post, we’ll explore how continuous profiling with Alloy and Pyroscope can transform the way you approach performance optimization.

Getting Scout Data Into Your AI Workflow

If you’ve spent any time in developer tooling lately, you’ve probably noticed a pattern: every product is rushing to add a chatbot, an AI summary, or some kind of “magic” button. We get it — it’s tempting. But at Scout, we’ve been deliberately taking a different approach. Instead of building AI into our product first, we’ve focused on making Scout’s data accessible to the AI tools you’re already using.

Securing the Future: Scaling AI, Sovereignty, and Resilience in ANZ ITOps

Enterprises in Australia and New Zealand are accelerating AI adoption, driven by strong digital trust frameworks. To remain competitive and compliant, the IT Operations (ITOps) landscape must evolve to manage hybrid complexity and persistent cyber risks. Join us for an exclusive, in-depth webinar as IDC and SolarWinds explore the strategic investments and unique challenges shaping future-proof ITOps across the ANZ region.

Scary Things Happen in Production. Context Helps You Find Them.

Production is a rowdy place of chaos, especially at scale. When you have millions of requests per second flowing through your system, weird things are always happening. Outliers, unusual request patterns, spikes and pulses of traffic from unknown sources, port scanning…it’s all there. To the naked eye, it looks like noise. If you know what you are looking for…patterns emerge. The night sky: every dot is a request. Without intent, it's an undifferentiated field of light.

A new Host Map for modern infrastructure

A host map is a visual representation of your infrastructure that displays hosts and related resources such as clusters, pods, and containers in a single, interactive view. We introduced the Datadog Host Map more than a decade ago to help you “know thy infrastructure” and answer critical questions: Does everything look healthy? Has anything changed? Does the shape of my environment match what I expect?

Monitor Juniper Mist in Datadog

From point-of-sale (POS) terminals to cloud-based applications and mobile devices, reliable connectivity is critical to business operations. Even brief disruptions can negatively impact user experiences, resulting in failed transactions, delayed application responses, or repeated attempts to reconnect. Juniper Mist is an AI-powered networking platform that provides insight into wireless environments, including access point performance and radio frequency health.

An Oh Dear skill for use in Claude Code or Codex

AI coding agents are getting good at calling tools. Claude Code, Codex, and others can run shell commands, parse JSON, and reason about the results. But they need to know what tools are available and how to use them. That's what skills are for. A skill is a small package of documentation that teaches an AI agent how to use a specific tool. We've built one for Oh Dear.

Smarter Alerts, Faster Root Cause, & Proactive IT Ops with SolarWinds AI Observability

Discover how AI is transforming IT operations with SolarWinds Observability. In this video, we showcase powerful new AI-driven features designed to help you detect issues faster, reduce alert noise, and stay ahead of performance problems across your entire stack. From applications and databases to networks, cloud infrastructure, and end-user experience SolarWinds AI delivers deep insights where it matters most.

When Code Becomes Cheap: The New Reliability Constraint in Software Engineering

For most of the history of software engineering, the primary constraint was production. Code was expensive, skilled engineers were scarce, and shipping features required concentrated human effort. Velocity was limited by how fast people could reason, implement, test, and deploy. That constraint shaped everything from team size, architecture, release cadence, through to how we thought about technical debt. When production is expensive, you optimise for output. You remove friction from shipping.

From raw data to flame graphs: A deep dive into how the OpenTelemetry eBPF profiler symbolizes Go

Imagine you're troubleshooting a production issue: your application is slow, the CPU is spiking, and users are complaining. You turn to your profiler for answers—after all, this is exactly what it's built for. The profiler runs, collecting thousands of stack samples. eBPF profilers, including the OpenTelemetry eBPF profiler, operate at the kernel level, so they capture raw program counters: memory addresses pointing into your binary.

How to Measure MOS Score for VoIP (Step-by-Step)

Poor voice call quality isn't just annoying, it's a productivity killer. Dropped calls mid-negotiation, garbled audio on client meetings, and one-sided conversations where half the words don't make it through: these aren't random technical glitches. They're symptoms of network performance problems that haven't been identified, measured, or fixed. And when your business runs on VoIP, Microsoft Teams, or any cloud-based communication platform, unmeasured voice quality is a liability.

Beyond the Data Lake: Leading Cross-Domain Operational Intelligence

As we wrap up RSAC, one theme that repeatedly emerged in conversations with security leaders is that the modern enterprise has reached a critical inflection point where the velocity of machine-generated telemetry has outpaced the capacity of traditional architectures. This trend requires an approach that moves beyond the storage of information to the activation of it in ways that don’t simply exacerbate alert fatigue.

Migrating from ManageEngine OpManager to WhatsUp Gold: A Practical, No Nonsense Guide

If you’re planning to move from ManageEngine OpManager to the Progress WhatsUp Gold solution, this guide outlines key differences, recommended migration steps, and practical checks to help you transition with minimal disruption. It also includes an example script you can use to start monitoring imported devices in the WhatsUp Gold solution.

ROI of AI: How CIOs Measure Real Business Impact

Since the advent of Artificial Intelligence (AI), it has become the buzzword for modern day businesses. It has tremendous benefits which has lured enterprises invest hefty money with a view of getting ahead of their competitors. Yet, many CIOs are still figuring out ways to get the best ROI of AI that resonates with their businesses. While there are many initial programs and proof of concepts that show promise, in the long run they fail to deliver their promise.

How to Automate Your Entire Cloud Deployment Lifecycle with IaC

In today's digital world, businesses depend on cloud infrastructure to run applications, manage data, and deliver services smoothly. However, managing cloud environments manually can quickly become complex and time-consuming. Teams often deal with repeated tasks, inconsistent setups, and unexpected errors.

Cribl Search Demo: Security Investigation

In this demo, Nate Zemanek , Staff Solutions Engineer, shows how Cribl Search runs fast investigations. As an open data platform, Cribl Search lets you pull data from multiple sources and query everything from a single pane of glass. You’ll see how to run fast queries with the new lakehouse engine, search historical data with a federated approach, and bring everything together for full context. Then, use Notebooks to collaborate and share findings across teams to understand what happened—faster.

How a Runtime Aware AI SRE Agent Transforms System Reliability

A runtime aware AI SRE extends existing AI SRE approaches by moving beyond telemetry correlation into runtime-validated reliability. While the majority of AI SRE tools accelerate incident triage using logs, metrics, and traces, they cannot confirm execution behavior if critical runtime signals were never captured. By generating on-demand evidence inside running services, AI SRES can eliminate slow redeploy cycles, ensuring your distributed systems remain resilient under real-world traffic conditions.

Top Root Cause Analysis Tools Built for Runtime Context

Root cause analysis tools are designed to help engineering teams understand why failures happen in production and other remote environments. As modern systems become more distributed and input-dependent, many incidents cannot be reproduced outside live environments. The stakes are significant: high-impact IT outages cost organizations a median of $2 million per hour, with annual downtime costs reaching $76 million per organization.

The Hidden Tax of Complexity: Why Modern Environments Cost More Than Leaders Realize

Enterprises rarely notice the moment complexity begins to reshape their environment. Growth initiatives move forward. New cloud services are adopted. Modernization programs introduce new architectures. Business units implement tools that solve immediate problems. Acquisitions add their own ecosystems. Each change is logical in isolation. The cumulative effect becomes something else entirely.

AI, Anxiety & 400 Open Windows: GEOFF WRIGHT RETURNS

Geoff Wright returns to unpack the messy reality of work in the AI era. From having 400 windows open and feeling less productive, to explaining why AI should fuel curiosity rather than replace human judgment, Geoff brings his usual mix of optimism, humor, and hard-earned perspective. The conversation explores prompt engineering, digital overwhelm, enterprise adoption, and why “being human first” matters more than ever. It is a wide-ranging, thoughtful discussion on anxiety, complexity, and the promise of AI, with a surprisingly funny detour into why the robots might eventually just leave Earth for Pluto.

How to Protect Website Monitoring from Cloud Disruptions

The cloud is often spoken of as a separate realm where data exists safely away from the messy realities of the physical world. But as the events of March 2026 have reminded us, the cloud has a physical home, and that home is susceptible to the same disruptions as any other infrastructure. Here’s how diversified monitoring across independent data centers can keep visibility intact when cloud services go down.

Mastering DX Netops Upgrade Automation

Upgrading a large DX NetOps environment with multiple components across distributed infrastructure can be a challenging endeavor. Network interruptions, time-consuming validations, and the need for detailed diagnostics have been persistent pain points for administrators. With the release of version 25.4.6 of the DX NetOps Upgrade Automation Tool, we've addressed these challenges head-on. This release introduces powerful new capabilities that fundamentally change how you approach upgrade operations.

Coralogix Earns 196 Badges in G2 Spring 2026 Reports Across 15 Categories

We’re proud to announce that Coralogix has earned 196 badges across 15 categories in the G2 Spring 2026 Reports, our strongest G2 performance to date. Placing in 369 reports, this represents a significant leap from Spring 2025, when we placed in 318 reports and earned 141 badges. These results are a direct reflection of the trust our customers place in Coralogix and their willingness to share honest feedback on the world’s largest software review platform.

The Role of Employee Monitoring in Securing Remote Teams: A Comprehensive Guide

How secure is your organisation when employees work from anywhere? Remote work has transformed how modern teams collaborate. They offer flexibility, broader talent pools, and improved productivity. Still, it has also introduced new cybersecurity challenges. 92% of IT professionals believe remote work has increased cybersecurity threats, even as organisations struggle to secure remote access points, home networks, and personal devices.

Observability Lessons From OpenAI

Writing code is moving from the good old IDE into the realm of autonomous AI agents. One example of this is OpenAI, which has been developing internally with 0 lines of manually written code. You can read about their workflow in their engineering blog: Harness engineering: leveraging Codex in an agent-first world. For me, the main takeaway of OpenAI’s article is how AI has rewritten the constraints equation.

Leveraging Cognitive Diversity to Tackle System Complexity

Most engineering leaders today understand that diversity matters. They've built teams that reflect a range of backgrounds, functions, and experience levels. They run postmortems, retrospectives, and architecture reviews that bring multiple voices to the table. They believe, not unreasonably, that this variety of perspectives leads to better decisions. But there's a problem hiding inside that assumption that can undermine everything: who people are is a surprisingly poor predictor of how they think.

Icinga Installation Guide - Part 2 - Installing Icinga Director and configuring your first objects

Take the next step with Icinga by adding the powerful configuration management tool Icinga Director to your setup. In this second part of our installation guide, we focus on simplifying and scaling your configuration using the Director. You’ll learn how to connect it to your existing Icinga 2 instance, create reusable templates, and start monitoring hosts and services through a more flexible, web-based interface.

Icinga Installation Guide - Part 1 - Getting started with a base Icinga Installation

Get up and running with Icinga 2 and Icinga Web in this step-by-step installation guide. In this video, we walk you through a complete base installation of Icinga, covering everything from setting up the database to accessing the web interface for the first time. This will help you get to the point of a working installation, especially if you're new to Icinga. We take you through the full process, including installing required components, configuring databases, enabling services, and completing the web setup wizard.

Internet Speed Monitoring - How to Proactively Test Your Internet Connections

Recent enhancements to eG Enterprise have added functionality to allow you to proactively test your internet speed with synthetic monitoring (“robot” tests that simulate real user activity). Using the new functionality you can proactively monitor internet speeds 24×7 from any location. The performance and quality of an Internet connection plays a major role in any IT environment. Use cases for this new functionality include.

How to Communicate the Value of DEX Across Your Organization

For many EUC and Digital Workplace leaders, the challenge with digital employee experience (DEX) isn’t the technology, it’s building alignment. You can see the data. You know where friction exists. You can quantify disruption, productivity loss, and inefficiencies. But you struggle to achieve your targets, because you need buy in from other teams, and right now, they don’t want to hear anything about DEX. Security has different priorities. Application owners are focused on releases.

Monitor Oracle Fusion Cloud Applications with Datadog

Many organizations rely on Oracle Fusion Cloud Applications to run core business workflows across finance, HR, and supply chain operations. Because these SaaS-based applications run on Oracle Cloud Infrastructure (OCI), engineering teams have limited visibility into their performance. Without direct access to the underlying stack, they often lack the signals needed to detect regressions or investigate degraded user experience.

Explore Kubernetes with native OpenTelemetry data

Kubernetes environments generate a constant stream of signals across clusters, nodes, pods, and workloads. For teams that have standardized on OpenTelemetry (OTel), maintaining ownership of that data is critical. But in practice, many observability platforms require translation into vendor-specific data formats, leading to fragmented product experiences, blank dashboards, and uncertainty about data integrity.

Annotate traces to improve LLM quality with Datadog LLM Observability

LLM applications rarely crash. They degrade quietly. Once these applications are shipped to production, subtle quality failures become harder to catch with traditional signals. Tone shifts, hallucinated details, off-topic responses, and incomplete reasoning can emerge while latency and token usage look stable.

Autonomous IT: What It Is and How to Get Started

Autonomous IT is the operating model where systems detect, decide, and act so your engineers spend less time fighting fires and more time defining what ‘good’ looks like. On a typical day, a mid-size enterprise generates tens of thousands of alerts across on-prem infrastructure, multiple clouds, and AI workloads, including every endpoint. Most of them don’t need a human. A few of them do, and telling the difference, fast enough to matter, is where IT teams are losing ground.

How OpenRouter and Grafana Cloud bring observability to LLM-powered applications

Chris Watts is Head of Enterprise Engineering at OpenRouter, building infrastructure for AI applications. Previously at Amazon and a startup founder. As large language models become core infrastructure for more and more applications, teams are discovering a familiar challenge in a new context: you can't improve what you can't see.

Bridging the gap between mobile experience and technical reality

For mobile-first organizations, the distance between a “slow app” and a “resolved ticket” is often filled with guesswork. Mobile performance is notoriously difficult to capture because it lives at the intersection of device hardware, network stability, and local code execution. Today, we are closing that gap with the launch of Coralogix Mobile Performance.

Making encrypted Java traffic observable with eBPF

Coroot's node agent uses eBPF to capture network traffic at the kernel level. It hooks into syscalls like read and write, reads the first bytes of each payload, and detects the protocol: HTTP, MySQL, PostgreSQL, Redis, Kafka, and others. This works for any language and any framework without touching application code. For encrypted traffic, we attach eBPF uprobes to TLS library functions like SSL_write and SSL_read in OpenSSL, crypto/tls in Go, and rustls in Rust.

What is Virtana Application Observability and how is it different?

Application Observability, Built for Hybrid Reality Modern applications don’t live in one place. A single transaction might span: Traditional APM shows you the trace. But hybrid reality doesn’t stop at the service layer. True application observability ties transactions to the infrastructure that actually delivered them across cloud, on-prem, and everything in between. Because in hybrid environments, the root cause rarely lives in just one tier.

Grafana Campfire - Release Pipelines - (Grafana Community Call - March 2026)

In this Campfire Community call, we'll be exploring Grafana's release pipelines - covering both our on-prem (public and private) artifact delivery and our Rolling Release Channels for building Grafana Cloud We'll walk through the fundamentals of how our pipelines work, including how ICs can patch branches and manage their own core Grafana releases, and where we're headed in the future. Plus much more!

API Availability Monitoring: How to Measure True API Availability

APIs are no longer just integration layers. They power customer logins, payment processing, SaaS workflows, partner ecosystems, and mobile applications. When an API becomes unavailable, revenue stops, user trust declines, and service level agreements are immediately at risk. Yet many teams still define API availability in the simplest possible way. If an endpoint responds with a 200 OK, the API is considered available. Monitoring dashboards stay green. Alerts remain silent. Everything appears healthy.

API Error Monitoring: A Complete Guide to Detecting and Resolving API Failures

APIs power nearly every modern digital experience. From mobile apps and SaaS platforms to payment gateways and internal microservices, APIs handle authentication, transactions, content delivery, and system-to-system communication. When an API fails, users often experience broken features, slow responses, or complete service outages. In many cases, they leave before your team even realizes something is wrong. The business impact of API failures is significant.

Applications Manager now officially supports Podman monitoring!

As organizations shift away from traditional container engines to embrace Podman’s rootless and daemon-less design, visibility often becomes a challenge. Because Podman doesn't rely on a central background service, traditional monitoring tools can leave you in the dark. Applications Manager's new Podman monitoring feature bridges that gap, giving you total visibility into your Podman workloads without compromising the security model you worked so hard to build.
Sponsored Post

The AI Readiness Paradox: The Agentic Value Gap And The Agentic Operational Model

The disconnect between enterprise confidence and AI capability is real. MIT reports fewer than 5% of enterprises have achieved measurable ROI from AI, yet Cisco claims 13% feel ready. The gap isn’t about AI technology—it’s about organizational rigidity and change management. More importantly, most studies focus on business intelligence rather than operational use cases, which are far less risky and more measurable.

Benchmarking Kubernetes Log Collectors: vlagent, Vector, Fluent Bit, OpenTelemetry Collector, and more

At VictoriaMetrics, we built vlagent as a high-performance log collector for VictoriaLogs. To validate its performance and correctness under a real production-like load, we developed a benchmark suite and ran it against 8 popular log collectors. This post covers the methodology, throughput results, resource usage, and delivery correctness. Collectors under the test: We’ve made all benchmark configurations and source code public, so you can reproduce and verify the results independently.

What is Kubernetes? Explained in 2 Minutes

What is Kubernetes, and how do companies like Netflix handle millions of users without crashing? In this quick guide, we break down Kubernetes in simple terms — from containers to pods, nodes, and the control plane — so you can understand how modern cloud applications stay reliable and scalable. Kubernetes acts like an air traffic controller for your apps, automatically managing where they run, restarting them if they fail, and balancing traffic across machines. Whether you're new to cloud computing or brushing up on DevOps basics, this video gives you a clear, beginner-friendly explanation.

Instrument zerocode observability for LLMs and agents on Kubernetes

Building AI services with large language models and agentic frameworks often means running complex microservices on Kubernetes. Observability is vital, but instrumenting every pod in a distributed system can quickly become a maintenance nightmare. OpenLIT Operator solves this problem by automatically injecting OpenTelemetry instrumentation into your AI workloads—no code changes or image rebuilds required.

Monitor Model Context Protocol (MCP) servers with OpenLIT and Grafana Cloud

Large language models don’t work in a vacuum. They often rely on Model Context Protocol (MCP) servers to fetch additional context from external tools or data sources. MCP provides a standard way for AI agents to talk to tool servers, but this extra layer introduces complexity. Without visibility, an MCP server becomes a black box: you send a request and hope a tool answers. When something breaks, it’s hard to tell if the agent, the server or the downstream API failed.

Observe your AI agents: Endtoend tracing with OpenLIT and Grafana Cloud

In another post in this series, we discussed how to instrument large language model (LLM) calls. This can be a good starting point, but generative AI workloads increasingly rely on agents, which are systems that plan, call tools, reason, and act autonomously. And their non‑deterministic behavior makes incidents harder to diagnose, in part, because the same prompt can trigger different tool sequences and costs.

How to monitor LLMs in production with Grafana Cloud,OpenLIT, and OpenTelemetry

Moving a large language model (LLM) application from a demo to a production‑scale service raises very different questions than the ones you ask when playing with an API key in a notebook. In production, you have to answer: How much is each model costing us? Are we keeping latency within our service‑level objectives? Are we accidentally returning hallucinations or toxic content? Is the system vulnerable to prompt‑injection attacks?

Balancing Data Locality, Data Sovereignty, and Data Replication

Modern distributed systems must simultaneously respect where data must live, where it should live for performance, and where it needs to live for resilience. Data sovereignty and residency requirements increasingly affect technical design decisions, not only in regulated industries, but in any global product that must navigate regional expectations, latency constraints, cost structures, and operational realities.

Datadog Data Observability, enables you to detect data quality and pipeline issues early.

See our latest Episode of This Month in Datadog, for a spotlight of Datadog Data Observability, which enables you to detect data quality and pipeline issues early, as well as remediate those issues with end-to-end lineage. We also cover: This Month in Datadog brings you the latest updates on our newest product features, announcements, resources, and events.

Seer fixes Seer: How Seer pointed us toward a bug and helped fix an outage

Seer is our AI agent that takes bugs and uses all of the context Sentry has to find the root cause and suggest a fix. We use it all the time to help us improve Sentry. Seer fixes Sentry. More recently, Seer has been helping us fix itself — Seer fixing Seer. An upstream outage triggered a bit of an avalanche, revealing a bug that had been hiding away for months. When it came time to fix it, Seer pointed us exactly where we needed to look.

Error Monitoring for Elixir: Now in Scout APM

Elixir’s “let it crash” philosophy is one of the best ideas in modern software design. Supervisors restart failed processes, the system self-heals, and life goes on. It’s like having a really good immune system. The problem is that a really good immune system can also hide chronic conditions. A GenServer crashing and restarting is working as designed.

API Response Time Monitoring: Metrics, SLAs & Optimization Guide

Modern applications are powered by APIs. Every login request, checkout transaction, mobile interaction, and third-party integration depends on APIs responding quickly and reliably. When an API slows down, the entire user experience suffers. Even a one-second delay in response time can: For ecommerce platforms, fintech systems, SaaS products, and real-time applications, slow APIs do not simply create inconvenience. They directly affect revenue, customer retention, and operational stability.

API Observability Tools: Complete Guide to Platforms, Features & Use Cases (2026)

Modern software runs on APIs. Whether you are operating microservices, integrating third party services, or building customer facing platforms, APIs are the backbone of your architecture. As systems become more distributed, simply knowing whether an endpoint is up or down is no longer enough. Teams need deeper visibility into performance, reliability, latency, and behavior across environments. That is where API observability tools come in. API observability goes beyond basic health checks.

API Status Monitoring: Real-Time Health & Uptime Tracking

APIs sit at the center of modern digital infrastructure. Mobile applications, SaaS platforms, microservices, and third party integrations all depend on APIs to exchange data and execute business logic in real time. When an API becomes unavailable, slows down, or returns incorrect data, users feel it immediately. Transactions fail. Dashboards stop updating. Logins break. Revenue and trust are affected within minutes.

VirtualMetric DataStream + Splunk: Pre-Ingest CIM Normalization Without the TA Tax

Splunk is built around a deceptively simple premise: get your data in, search it, and act on it. In practice, the gap between “get your data in” and “data that actually works in Splunk ES” is where most of the engineering effort goes. CIM normalization is non-trivial. Technology Add-on development is slow. Volume-based licensing penalizes growth. And the combination means that as environments expand, Splunk becomes harder to operate efficiently.

What Engineers Want from AI in Observability... According to the 2026 Observability Survey Report

The results show strong interest in AI for forecasting, root cause analysis, onboarding, and generating dashboards, alerts, and queries. But when it comes to autonomous action, practitioners are more cautious — and 95% say AI needs to show its work to earn trust.

Unifying Telemetry in Battery Energy Storage Systems

Battery energy storage systems (BESS) play a critical role in modern energy infrastructure. Utilities rely on these systems to balance renewable generation, stabilize grid operations, and respond to changing electricity demand. As deployments scale in size and complexity, operators require continuous insight into battery health, system performance, and grid interaction. Operators rely on telemetry generated across several operational platforms.

Architecting Log Management for Privacy and Scale without the Headache

As companies grow, they inevitably hit a wall: observability data explodes while privacy requirements become stricter. For years, engineers have faced a painful tradeoff—either ship petabytes of sensitive data to a central cloud (incurring egress costs and compliance risks) or manage a complex self-hosted stack that is painful to scale.

The Cognitive Ceiling: Why Modern Environments Outgrew Human Interpretation

For more than a decade, organizations invested in tools and telemetry with the belief that more visibility would create more control. Monitoring expanded across cloud, application, network, and infrastructure layers. Observability platforms entered the mainstream. Automation tools promised faster detection and improved coordination. Yet despite these advancements, incidents are not easier to understand. War rooms still fill with conflicting interpretations. Signals generate more questions than answers.

You're probably overdue for a Sentry SDK upgrade

Session Replay. Structured logs. AI monitoring. Automatic OpenTelemetry tracing. Feature flag tracking. If you haven't seen these in your Sentry dashboard, your SDK version is probably the reason. Whether you're on @sentry/react, @sentry/nextjs, @sentry/vue, @sentry/angular, @sentry/sveltekit, or any other @sentry/* package, they all version together. When we say v10, we mean all of them.

How to Perform a Network Health Check: Step-by-Step Guide

Your apps are slow. Users are complaining. You're staring at a dashboard trying to figure out what broke and when. Sound familiar? This is the reality of reactive network monitoring. By the time someone opens a ticket, the issue has already been affecting performance for minutes, sometimes hours. A network health check flips that script. Instead of chasing problems after the fact, you're catching them before users ever notice.

Claude Code is running bash commands on your infrastructure. Here's how to watch it.

I’ve been staring at Claude Code telemetry for the past few weeks, and I keep noticing the same thing: most teams drop it into their environment, say “it’s amazing,” and have absolutely no idea what it’s actually doing at the system level. That’s fine for a personal dev tool. It’s not fine when you’ve rolled it out to 50 engineers.

Claude Code + Lightrun MCP: Your AI Agent Now Has Live Runtime Vision

Claude Code, Anthropic’s coding agent, now integrates with Lightrun through MCP. AI code assistants have been flying blind. Google Dora’ 2025 report found it is causing, an almost 10% increase in code instability. Even with up to 1M tokens of context available in Claude, this powerful agenti cannot see how the code it writes actually behaves inside a live system under real traffic, real dependencies, and under a load of 10,000 requests per second.

How to Manage Icinga with Ansible Webinar

Managing monitoring environments shouldn’t be a manual chore. In this hands-on webinar, we show you how to fully automate your Icinga infrastructure using the Ansible Collection for Icinga. We take you step by step through everything from installing Icinga 2 to configuring master instances, setting up monitoring agents, building core objects, and integrating common components like Icinga Web, all driven by Ansible.

Bridging the Gaps in Modern Operations: How Real-Time Messaging Improves System Reliability

In modern IT environments, reliability is no longer defined solely by system uptime or infrastructure resilience. It is equally shaped by how effectively systems, teams, and processes communicate under pressure. As architectures become more distributed and operations more complex, the gaps between tools, teams, and data streams have become one of the most persistent challenges in maintaining consistent performance.

Cloud Migration Statistics for 2026

Cloud adoption has officially crossed a tipping point. In 2026, the conversation is shifting from whether companies are moving to the cloud to how complicated things are getting once they’ve moved. Hybrid architectures, multi-cloud strategies, AI workloads, and rising security pressure are turning “the cloud” into a web of interconnected environments. For IT and network teams, that creates huge opportunity—and plenty of room for chaos if visibility doesn’t keep pace.

Buy vs Build in the Age of AI (Part 3)

In Part 1, we looked at how AI has reduced the cost of building monitoring tools. Then in Part 2, we explored the operational and economic burden of owning them. Now we need to talk about something deeper. Because the real shift isn’t just economic; it’s structural. AI isn’t just helping engineers write code faster. It’s accelerating the entire software ecosystem; including how monitoring tools are built, maintained, and trusted.

Production Is Where the Rigor Goes

In early February, Martin Fowler and the good folks at Thoughtworks sponsored a small, invite-only unconference in Deer Valley, Utah—birthplace of the Agile Manifesto—to talk about how software engineering is changing in the AI-native era. They recently published a summary of key insights and themes from the summit, sorted into ten topical buckets.

AppSignal's MCP Server: Connect AI Agents to Your Monitoring Data

Your AI coding assistant already knows your codebase. Now it can know your production environment too. AppSignal's MCP server gives AI agents and AI code editors direct access to your monitoring data — errors, performance metrics, and more — so they can help you debug, investigate and resolve issues without switching context. And with our new public endpoint, getting started is simpler than ever.

Scaling Kubernetes workloads on custom metrics

The 2025 State of Containers and Serverless report found that 64% of organizations use the Kubernetes Horizontal Pod Autoscaler (HPA) to manage Kubernetes workload capacity. But only 20% of those deployments scale on custom metrics. The other four-fifths of organizations rely on resource metrics—CPU and memory utilized by their pods—to trigger autoscaling activity.

How to design cloud environments for AI-powered threat analysis

Cloud environments generate high volumes of security signals every day. With each one, you have to determine if it’s benign, a clear false positive, or something worth investigating. The challenge is needing to make these calls continuously, often without knowing whether any single event is part of a larger attack. Spending too much time investigating benign activity reduces the ability to detect threats elsewhere, and missing a legitimate threat has clear consequences.

AI in observability in 2026: Huge potential, lingering concerns

The role of AI in observability is evolving rapidly, but the data from our fourth annual Observability Survey makes one thing abundantly clear: the potential is real, and so are the reservations. Practitioners overwhelmingly see value in using AI to help surface anomalies, forecast and spot trends, assist with root cause analysis, and get new users up to speed quicker.

Open standards in 2026: The backbone of modern observability

Open source software and open standards are now an essential part of how organizations maintain their systems. That's not to say they haven't always been important, but the fourth annual Observability Survey, brought to you by Grafana Labs, shows just how deeply the shift to open has taken hold, with 77% of respondents saying open source and open standards are important1 to their observability strategy.

Engineers Want AI in Observability - With One Catch: 4th Annual Observability Survey by Grafana Labs

Actually useful AI is welcome in observability. AI for the sake of AI is not. In this overview of Grafana Labs’ 4th annual Observability Survey, Marc Chipouras shares what 1,300+ respondents from 76 countries told us about the current state of observability — and what comes next. This year’s survey explores four major themes: The results show strong interest in AI for forecasting, root cause analysis, onboarding, and generating dashboards, alerts, and queries. But when it comes to autonomous action, practitioners are more cautious — and 95% say AI needs to show its work to earn trust.

The World's Best Infrastructure Teams Trust Kentik

Why do network and infrastructure teams at leading enterprises including Canva, Dropbox, Google ConocoPhillips, and ServiceNow choose Kentik? In their own words, customers describe epic cost savings, dramatic return on investment, and blockbuster efficiency improvements that only Kentik can deliver. Learn why Kentik is the must-see network intelligence solution any enterprise that depends on reliable connectivity.

Monitor schema health with engine.schema_fields: Structure, Drift, and Volatility

If you’ve worked with an observability pipeline, you’ve probably experienced schema problems: a field disappears, a type shifts from string to number, or a new label quietly appears. The causes are everywhere. Different teams adopt different naming conventions. A dependency upgrade changes the shape of a library’s log output. Over time, these small, reasonable decisions compound into schema sprawl: dashboards break, alerts misfire, and teams scramble to find out what happened.

Flow State in an AI Workplace - Digital Friction 1:1 with Mike Lovewell

Tom welcomes Mike Lovewell to explore how digital friction continues to shape the modern workplace. From early days of low awareness to today’s complex, AI-influenced environments, Mike shares how friction has evolved in scale rather than cause. They discuss the growing importance of flow state, the measurable business impact of small disruptions, and why adoption—not just technology—is the key to success. AI emerges as both a solution and a new source of friction, depending on trust and usability.

Product Update - March 2026

IncidentHub's latest product updates focus on improving the public status page, adding integrations with ticketing systems, private status page ingestion, and making the notifications more useful to the end user. Some of these improvements are driven by user feedback. Feedback is what makes the product better, and I am personally grateful to all our customers who have shared their feedback with us.

Network Monitoring as Code

Tangling DNS, TCP handshake failures, packet loss: your network has blind spots that application-level dashboards miss. In this session, Daniel Paulus (VP Engineering, Checkly) sets up DNS, TCP, and ICMP monitors from scratch and deploys them as code using the Checkly CLI. You'll see how to import checks from the UI to a code project, use coding agents to build monitors, and debug network failures with Rocky AI, trace routes, and packet captures.

Free escalation procedure template (download & customize)

Your monitoring fires at 2 AM. The on-call engineer picks up but doesn't know who to call next, what information to include, or which Slack channel to use. Sound familiar? That's what happens when escalation procedures exist only in people's heads — or worse, don't exist at all. The fix isn't complicated: a documented escalation procedure that every team member can follow under pressure. The problem is building one from scratch takes hours.

Complete HTTP Status Codes List & Reference (2026)

This is a comprehensive reference of every HTTP status code defined in the HTTP specification (RFC 9110) and common extensions. Use it as a quick lookup when you encounter a status code in your browser, server logs, or API responses. For a beginner-friendly guide to the most common codes, see From 200 to 503: Understanding the Most Common HTTP Status Codes.

Bridge the DevSec divide: Using Grafana Cloud and Miggo for runtime protection

Note: This blog post is co-authored by Daniel Shechter, CEO and co-founder of Miggo Security. Modern runtime security is critical to understand complex systems and detect and protect against attacks, especially in rapidly evolving cloud native architectures. For many security teams, however, achieving deep visibility into runtime risks remains a moving target.

5 Database Monitoring Tips Every DBA Should Use to Reduce Firefighting

This is a guest post from udara.ratnakumara. In a recent webinar I hosted with my colleague Chris Hawkins, Inside a DBA’s Day: What Really Happens and How to Stay Ahead, we talked through the realities of a typical DBA day and the practical ways teams can stay ahead of issues rather than constantly reacting. For many DBAs, the day doesn’t start with coffee. It starts with an alert. A report is suddenly slow. An application query is timing out.

From Data Chaos to Results: The New Data Strategy for the Agentic Era

The world is generating data at a pace that defies the human ability to draw insights and comprehend. By 2028, we’ll reach almost 400 zettabytes of global data—with over 55% of it coming from machines talking to machines. For enterprises, this isn’t just a storage problem; it’s an existential challenge.

How Does Skylar Advisor Cut Alert Noise?

What if you could start your day without hundreds of alerts? Skylar Advisor transforms noisy event streams into a short list of prioritized advisories by grouping related alerts and signals together. It shows what is happening in your environment, explains why it matters, and provides clear next steps so instead of chasing alerts, IT teams get guidance focused on real operational impact.

How GDIT Automated Early Response to Preserve Critical Event Context

In this video, Jason Boig, Solutions Engineer at GDIT, shares how his team uses ScienceLogic to streamline network infrastructure monitoring and improve response times. Instead of relying on manual processes after an alert is triggered, ScienceLogic helps automate the initial response and capture critical data the moment an event occurs. This ensures nothing is lost as conditions change and gives teams immediate visibility into issues.

Fair Source Software in the AI age

Have you noticed AI recently? Yeah, us too. Generative AI is wreaking havoc on the software status quo, and that includes licensing, and that generates … opinions. Sentry has a long history of having opinions about software licensing. We started life as an unlicensed side project in 2008, then went through BSD, to BSL, to writing our own license, FSL.

The Hidden Crisis in Modern IT: Interpretation Risk

Technology leaders spent the past decade investing heavily in visibility. They expanded monitoring footprints, adopted cloud-native observability tools, integrated analytics dashboards, and layered on automation intended to streamline detection. Every addition promised deeper insight. Every initiative aimed to bring clarity to increasingly complex environments. Yet operations feel more chaotic, not less. Outages move faster. Incidents cross more boundaries. Signals appear without context.

Shifting Metrics Right

In the shift left era where it feels like we’re pushing everything as far to the start of the SDLC as we can, it may seem counterintuitive to shift anything right. That is, however, exactly what I suggest when it comes to generating metrics. How far you go to the right of the SDLC is a much more nuanced question and is dependent on a lot of factors, and on what metrics you’re talking about.

Instrumenting Rust TLS with eBPF

Coroot is an open source observability tool that uses eBPF to collect telemetry directly from applications and infrastructure. One of the things it does is capture L7 traffic from TLS connections without any code changes, by hooking into TLS libraries and syscalls. Works great for OpenSSL. Works for Go. Then rustls enters the picture and everything stops being obvious. With OpenSSL, everything is nicely wrapped: From eBPF’s point of view this is perfect: Everything happens inside one call.

Event Intelligence for Agentic IT Operations

Modern IT teams are experimenting with AI agents. But individual agents, working in isolation are not enough. To truly achieve Agentic IT Operations, organisations need a platform — one that coordinates, governs, and contextualises AI-driven actions across the entire IT landscape. That’s where Interlink Software comes in.

Monitor your application and network load balancer logs

Load balancers are the primary entry points to distributed applications. By strategically directing the flow of incoming web traffic to specific endpoints, load balancers help optimize throughput and ensure the horizontal scalability of applications. In modern systems, load balancers often do more than their name suggests: Beyond basic load distribution, they analyze requests and route traffic based on a wide range of variables, such as client identity.

Cost Optimization in Action: How We Cut Amazon SQS Costs by 87%

JC, the Director of Software Engineering, Cloud at LogicMonitor, shares how Cost Optimization enabled his team to shift to Cost-Intelligent Observability and tackle an unexpected and growing cloud bill. As engineers, we live and breathe performance. We obsess over latency, reliability, and uptime, the hallmarks of a healthy system. But there’s another metric that’s becoming just as critical: cost.

A New Scale Tier for Amazon Timestream for InfluxDB

InfluxDB 3 on Amazon Timestream for InfluxDB now scales to 15-node clusters, unlocking higher ingestion, greater query concurrency, and real-time performance at scale. In this video, PM Pete Barnett breaks down what this means for high-resolution, high-velocity workloads, and how you can scale from Core to Enterprise with zero downtime or data migration.

Taming the Broker Network: Achieving Reliable Apache ActiveMQ Operations

Broker networks grow from success but often become fragile webs. A global retailer's journey from Apache ActiveMQ chaos to reliable operations shows how unified visibility, automation, and governed self-service transform messaging from liability to strategic asset.

Captur: Observability-First Mobile ML Inference for Better Customer Confidence

Captur builds a mobile SDK that brings real-time image recognition and actionable feedback directly into customers’ apps, running complex machine learning models entirely on device without cloud inference. This architecture delivers privacy and performance, but also creates unique challenges when it comes to observability and debugging, especially as crashes can originate from the host app rather than the SDK itself.

Episode 8 - The Rise of Autonomous Teams

In this episode of The Intelligent Enterprise, host Tom Stoneman takes us inside the evolving use-cases for AI across different enterprises. Digitate recently conducted a survey of over 600 IT decision makers from across North America. The aim was to get a better sense of how AI tools are being implemented across workplaces — and the results are fascinating.

DevEx Talks episode 2 - Women in DevRel: What Matters in Open Source?

In this DevEx Talks episode, Adriana Villela and Cortney Nickerson explore what truly matters in open source through the lens of women in Developer Relations and Community roles. From diverse career paths to navigating DevRel as women in tech, they share honest reflections on impact, feedback, and long-term motivation in cloud native ecosystems.

Role of Control Room Design in Improving Monitoring Accuracy

Monitoring mistakes rarely happens randomly. Most of them originate in control rooms where operators struggle with poorly positioned screens, awkward equipment placement, or lighting that makes critical data difficult to see. In high-stakes environments like power grids, security operations, transportation systems, and manufacturing plants, monitoring accuracy directly affects operational stability and safety. Even highly skilled operators can make mistakes when their workspace works against them.

How to Set Up Heatmaps on Your Website with Hotjar

The visual tools assist you to observe the area in which people click, scroll, or spend time on your web pages in the form of heatmaps. Hotjar is among the most trendy tools to create heatmaps – and it is not hard to install it, as long as you follow the steps below. At the conclusion of this guide, you will know.

A New Scale Tier for Time Series on Amazon Timestream for InfluxDB

When we first announced the availability of InfluxDB 3 Core and Enterprise on Amazon Timestream for InfluxDB last year, we set a new standard for managed time series on AWS. We gave developers a simple way to harness high performance at scale while removing the burden of infrastructure management. But as our customers have taught us, “at scale” is a moving target. Across Industrial IoT, physical AI, and real-time observability, data is growing in both volume and resolution.

Quickly go from exploration to action with new one-click integrations in Grafana Drilldown

The Grafana Drilldown apps gives you a queryless, point-and-click way to explore your metrics, logs, traces, and profiles. But finding an insight is only half the job—you still need to act on it. Previously, that meant leaving Drilldown, manually copying queries, and navigating through Grafana's dashboards, Alerting, and "Explore" interfaces to pick up where you left off.

Choosing a JavaScript logging library: The 2026 definitive guide

With AI writing more and more of our code, properly monitoring and debugging that code has become an increasingly critical part of the development workflow that can't be ignored. Luckily, we have more time than ever to implement the right tools to do so. Implementing a production-ready logging solution is easy to do, and provides you and your LLM Agents with a wealth of debugging information from your app, across users and environments.

Architecting the Future: The evolution of Apache ActiveMQ for enterprise messaging and the path to mission control

Apache ActiveMQ is evolving from simple transport to intelligent fabric. Key shifts include replicated KahaDB for cloud-native resilience, Spring decoupling in v7, and OpenTelemetry observability—transforming messaging infrastructure for modern enterprise needs.

How to Spot Vulnerabilities in Your Supply Chain Quickly

Ensuring shipments are secure before leaving a warehouse is essential for preventing losses and delays. Essential checks before approving a shipment for dispatch include verifying documentation, inspecting packaging, and confirming that transport processes are properly followed. Completing these checks helps logistics teams detect potential problems before they escalate into costly issues. Supply chain vulnerabilities can disrupt operations, create financial risks, and damage a company's reputation. Taking proactive steps ensures that goods reach their destination safely and efficiently.

Golang memory arenas [101 guide]

Go 1.20 introduced an experimental arena package that lets you allocate many objects from a contiguous region of memory and free them all at once — bypassing the garbage collector entirely. The package remains experimental and its future is uncertain, but arenas are a valuable concept for understanding Go memory management and writing high-performance code. The arena package is experimental and on hold indefinitely. The Go team has made no guarantees about compatibility or its continued existence.
Sponsored Post

Top infrastructure monitoring mistakes (and how to avoid them)

Infrastructure monitoring is meant to simplify operations, not overwhelm teams with noise. Yet the average IT team receives more than 10,000 alerts every day. Despite this constant stream of notifications, critical issues still slip through the cracks. This volume of fragmented data creates a dangerous visibility gap across the infrastructure. As a result, teams can spend more time sorting through alerts than actually resolving issues.

The Obkio Story: Building a Network Observability & Monitoring Solution

In 2016, before Obkio existed, we ran a market audit. We interviewed banks, manufacturing companies, and service providers, and asked them one simple question: Why aren't you using a Network Performance Monitoring solution? The answer was unanimous: the tools were too complex, and nobody had the internal resources to run them full-time. If that was true for enterprises with dedicated networking staff, it was even more true for smaller businesses with generalist IT teams.

Reduce alert noise with Site24x7's Event Correlation

Alert fatigue remains one of the most underestimated problems in IT operations. Srinivasa Raghavan, director of product management, explains how event correlation addresses it. Event correlation is the process of grouping related alerts from across your infrastructure into a single, contextual incident to reduce the volume of noise during an outage or service degradation. In this short clip, Srinivasa walks through what how the feature functions and why high-volume alert environments make this kind of signal-to-noise reduction operationally significant.

OpAMP for OpenTelemetry: Managing Collector Fleets and Introducing the New OpAMP Gateway Extension

Today, Bindplane is launching the OpAMP Gateway Extension in alpha — a new component that extends OpAMP fleet management into network-segmented and firewalled environments where direct agent-to-server connectivity is not possible. It also addresses fleet scaling by fanning many agent connections into a small upstream pool, reducing connection load on the OpAMP server. We also hope to donate the OpAMP Gateway Extension upstream to the OpenTelemetry project and welcome community contributions.

Cloud Observability Is Broken - Hybrid Operations Need a New Intelligence Model

Cloud adoption was supposed to simplify operations. Infrastructure would become programmable, scalability would become elastic, and distributed architectures would enable resilience at global scale. In practice, cloud has delivered extraordinary flexibility, but it has also introduced a level of operational complexity that traditional observability approaches were never designed to handle.

What is Industry 4.0? Everything You Need to Know in 2026

Industry 4.0 is the term used to describe the fourth industrial revolution, a name given to the integration of physical and digital systems, which includes the internet of things (IoT) and artificial intelligence that are transforming a huge number of industries. At a high level, its goal is to create an efficient, automated process for creating products or services that can be adapted quickly and efficiently to changing customer needs.

Digital Adoption + AI: The Secret Route to Zero Tickets

Generative AI has the potential to transform workplace productivity – but do organizations know how to deliver on that promise? New research shows that employees who use generative AI tools engage with them up to ten times per day, spending over three hours per week interacting with AI at work. And yet within the same organizations, large groups of employees have never meaningfully engaged with these tools at all.

4 Key DEXOps Process Improvements

Most IT organizations want to improve the digital employee experience. But good intentions alone rarely move the needle. The real shift happens when organizations evolve how IT operates. Traditional IT operations are built around reacting to incidents. But ticket-based operations, or operations based on poor data, lack the ability to create truly predictive ways of working.

Why the New Normal in Cyberattacks Demands Network Intelligence

As cyberattacks evolve into “machine-speed” disruption campaigns that span cloud, identity, and network planes, traditional monitoring is no longer enough to protect modern enterprise infrastructure. Shifting to a network intelligence model, powered by real-time telemetry and AI-driven reasoning, enables security teams to detect weak signals and automate defenses before an incident becomes systemic.

MCP and A2A: What They Are and Why They Matter for Autonomous IT

MCP and A2A are the two protocols that make agentic AI governable at enterprise scale. One controls how agents use tools, and the other controls how agents work together. AI in the enterprise is no longer confined to chat windows. It’s operating inside incident queues and automation pipelines. Increasingly, teams are using AI agents to take action: detecting incidents, executing remediations, updating tickets, coordinating across systems.

What is SSL Certificate Monitoring?

SSL Certificate Monitoring is the automated process of validating the integrity, trust chain, and expiration status of TLS certificates across network endpoints to prevent connection failures. SSL/TLS certificates are required for encrypted data transmission and server authentication. If a certificate is expired or fails validation (hostname, trust chain, issuer, etc.), properly configured clients will terminate the connection.

From signals to savings: Optimizing cloud costs with Grafana Assistant and MCP servers

In today's cloud-native environments, managing resource waste and optimizing costs can feel like a constant battle. Operators, along with their fearless FinOps teams, spend countless hours hunting down unused resources, deciphering complex telemetry data, and manually implementing code or configuration changes to try to reduce cloud costs. But what if you could automate the entire process, from identifying waste to implementing the fix, all based on actual production telemetry?

Why Miami Businesses Need IT Support That Sees Problems Coming

In Miami, downtime rarely stays small for long. A dropped connection in Brickell can stall a sales call. A failed backup in Coral Gables can turn into a compliance headache. A slow server in Doral can drag down an entire team before anyone even realizes what is happening. That is why more companies are moving away from reactive, break-fix support and looking for Miami-based IT services with proactive monitoring.

Adapting Your Mobile Device Management for Evolving Cyber Threats

You can reduce this risk with multifactor authentication, where users confirm their identity through a second step, such as a mobile notification or biometric verification. Even if credentials are compromised, attackers cannot easily gain access to your systems.

Observability vs Monitoring: Why the Difference Still Matters in Complex Systems

In modern infrastructure, the words observability and monitoring are often used as if they mean the same thing. That shortcut sounds harmless, but it creates real confusion inside technical teams and business discussions. The two ideas are connected, yet they solve different problems. In simple systems, the gap may feel small. In complex systems, the gap becomes impossible to ignore because the cost of misunderstanding it usually appears during failure, not during routine operation.

Improved Azure status integration

Monitoring Azure health across large environments should not require complicated setup. Until recently, connecting Azure to StatusGator required configuring access at the subscription level, which could become difficult for organizations managing dozens or even hundreds of subscriptions. We redesigned the Azure integration to make it simpler, more scalable, and easier to manage.

Apple Developer outage on March 10th

On March 10, 2026, developers around the world began experiencing issues with Apple Developer services that prevented apps from being verified or launched on physical devices. For many teams building and testing iPhone apps, the outage disrupted development workflows and blocked deployment to test devices. The issue appeared to involve Apple’s developer certificate verification systems.

Evaluating Observability Tools for the AI Era

Every observability vendor has an AI story right now. Most have an MCP. Many have a chatbot. All have a demo where the AI finds the root cause of an incident in thirty seconds and everyone in the room nods. In the context of a public demo, these tools look almost identical. Ask the AI a question, the tool returns an answer, and the engineer fixes the bug. Impressive. But if you buy based on the demo, you may end up with an AI layer that looks great on a call and disappoints in production.

Bindplane Community Call in March 2026

Tune in for the Bindplane Community Call in March to learn more about SSO going GA, a wave of new updates, connectors, sources, and destinations, including a VictoriaMetrics partner integration — and a preview of what we're building next. We'll also share details on meeting the Bindplane team at KubeCon + CloudNativeCon Europe in Amsterdam. As always, hands-on demos and a live Q&A at the end.

Why Generic AI Fails in Ops: What Trustworthy Actually Requires

Enterprise operations reached a point where complexity outpaced human interpretation and outgrew the capabilities of generic AI. As environments became more distributed and interdependent, every incident, anomaly, and degradation produced ripple effects across systems that require context, lineage, and reasoning. Yet most AI models were not built for this reality. They were trained for general knowledge tasks, not the deeply connected operational truths that define enterprise performance.

Mastering the Diagnostic pivot from Health Policy to Pod

In the world of modern microservices, scale is a necessary challenge. Enterprise service inventories start modestly with a handful of components, only to balloon to hundreds over time. Traditional monitoring approaches cannot support that weight. The more organizations build, the more work they create, often only to keep systems running.

Syncing LDAP Users & Groups with the Icinga Notifications Web API

If you’re running Icinga in a mid-to-large organization, chances are your users and teams are already defined in LDAP or Active Directory. Manually re-creating contacts and contact groups in Icinga Notifications Web is tedious and error-prone, but thankfully, it doesn’t have to be that way. The Icinga Notifications Web REST API gives you everything you need to automate this synchronization. In this post, we’ll walk through how to build a reliable LDAP-to-Icinga sync using the v1 API.

How to Reduce MTTR with AI-Powered Runtime Diagnosis

Reducing Mean Time to Resolution (MTTR) in production systems requires understanding failure behavior in real time. While AI code agents significantly accelerated software development and deployment, incident resolution has remained constrained by incomplete pre-captured telemetry. AI SRE tools improve signal correlation, but MTTR reduction requires runtime-verified diagnosis that confirms execution behavior directly in production systems.

How to Solve "Cannot Reproduce" Bugs That Cost Support Teams Hours

Support teams frequently face vague customer reports and incomplete data but need to offer fast resolutions autonomously without escalating to developers. In this article, learn how to equip support engineers with tools to diagnose root causes in minutes, increasing self-sufficient issue resolution. We explore eliminating the ‘Reproduction Tax’ for ‘cannot reproduce’ bugs using runtime context to achieve technical certainty at scale.

6 Key Roles Every DEX Team Needs

Digital employee experience doesn’t fail because of technology. It fails because of operating models. Many digital workplace leaders invest in visibility tools, dashboards, automation capabilities, and sentiment platforms. And yet, months later, they’re still stuck in reactive mode. Tickets are down slightly. Reporting is better. But the organization hasn’t fundamentally shifted.

Native OpenTelemetry inside Alloy: Now you can get the best of both worlds

We're big proponents of OpenTelemetery, which has quickly become a new unified standard for delivering metrics, logs, traces, and even profiles. It's an essential component of Alloy, our popular telemetry agent, but we're also aware that some users would prefer to have a more "vanilla" OpenTelemetry experience.

Monitoring Your Node.js App Health on Fly.io

The Node.js service has just been containerized and deployed with a single fly deploy command across continents. Everything seems to be alright, but then a week later, a user messages you saying the app is slow. You run the fly logs command and scroll through some logs, and find nothing out of the ordinary. The Fly.io dashboard says the app is running and healthy, but something behind the scenes is slowing down the app, and you have no idea what. You don’t even know where to start.

Mastering Root Cause Analysis with Monitoring & Traffic Insights

IT teams today face increasing pressure to resolve issues quickly, - but hybrid environments, rising complexity, and endless alerts often slow everything down. In this expert‑led 30‑minute webinar, you’ll see how combining Progress WhatsUp Gold infrastructure monitoring with deep traffic analysis delivers the visibility needed to diagnose problems faster and significantly reduce time‑to‑resolution.

Let's Encrypt 45-Day Certificate Expiration: Monitoring & More

TLS certificate lifetimes are shrinking fast — and that changes how every organization handles renewals, validation, and outage prevention. Let’s Encrypt has confirmed it will move from 90-day certificates to 45-day certificates (with staged rollouts) and dramatically shorten authorization reuse windows. At the same time, the CA/Browser Forum’s Ballot SC-081v3 has adopted a broader industry schedule that ultimately caps public TLS certificates at 47 days by March 15, 2029.

Claude Agent SDK Monitoring & Observability with OpenTelemetry and SigNoz

Learn how to implement monitoring and observability for the Claude Agent SDK using OpenTelemetry and SigNoz. In this video, we walk through instrumenting your Claude-based agents, capturing traces, metrics, and logs, and visualizing everything in SigNoz for real-time insights. You’ll learn how to debug agent behavior, identify latency bottlenecks, and monitor performance in production environments.

Multi-Language Status Page Widgets: Customize Widget Messages in Any Language

If your product serves users in multiple regions, your status page widget shouldn't be stuck in English. A customer in São Paulo seeing "All Systems Operational" when they expect "Todos os Sistemas Operacionais" is a small friction, but small frictions compound. It signals that their language isn't a priority, and it adds cognitive load during the exact moment they're checking whether something is broken. Until now, IsDown widgets shipped with hardcoded English messages. That's changed.

Claude outage analysis: What happened on March 11

On March 11, 2026, users around the world began reporting problems with Claude, including login failures, API errors, and stalled responses. While the disruption did not affect every user, reports quickly showed that the issue was widespread. StatusGator began receiving outage reports at 13:56 UTC. Using its Early Warning Signals system, StatusGator detected the growing incident at 14:22 UTC. The provider officially acknowledged the outage later at 14:44 UTC.

Understanding Karpenter architecture for Kubernetes autoscaling

Karpenter is a fast, flexible Kubernetes autoscaler designed to improve cluster performance and cost efficiency. When the cluster doesn’t have capacity to schedule a pod, Karpenter requests additional compute from the cloud provider, specifying a right-sized instance that matches the preferences you’ve set (for example, instance family).

Key metrics for monitoring Karpenter

In Part 1 of this series, we explored how Karpenter’s architecture enables just-in-time provisioning and active node consolidation. Because Karpenter is constantly making infrastructure decisions based on real-time scheduling pressure, its metrics can give you early warning of provisioning slowdowns, cloud API throttling, and misconfigurations that prevent it from scaling the way you expect.

Tools for collecting metrics and logs from Karpenter

In the first two parts of this series, we explored how Karpenter’s architecture enables just-in-time provisioning and active node consolidation, and we identified the key Karpenter metrics you should track to keep your cluster performant and cost-efficient. In this post, we’ll look at vendor-agnostic tools you can use to capture these signals.

Monitor Karpenter with Datadog

In this series, we’ve explored Karpenter’s architecture, the key metrics that reflect its health and performance, and the vendor-agnostic tools for collecting and analyzing its telemetry data. In this final post, we’ll show you how Datadog helps you monitor and alert on Karpenter alongside your Kubernetes cluster and the infrastructure that runs it.

What your product data is actually saying

As tools such as AI agents become more integrated with the instrumentation, governance, and centralization of product analytics data, product managers (PMs) still own the meaning of those events and the connected outcomes. Knowing when to trust the data, forming strong hypotheses, and being able to act on the insights requires an expert in the loop.

Why DevOps and SRE Teams are replacing 3-4 monitoring tools with Atatus?

Your on-call engineer gets paged. A critical service is down. Error rates are spiking. They open Sentry for errors. Flip to Grafana for metrics. Pivot to Kibana to search logs. Then jump to Lumigo, but that only covers the Lambda functions, not the Node.js backend throwing the actual errors. Three tabs become five. Five become eight. Half the incident is gone and your team is still piecing together what happened instead of fixing it. Sound familiar?

Log Correlation for Security and Performance Monitoring

International travel comes with amazing sights, cultural experiences, and local delicacies. However, most travelers know that it comes with differing economies that impact a money’s value and various currencies. When people need cash, they have to translate the money in their wallets to the local currency, which means different coins and bills. Depending on the exchange rate, the currency’s value can change as the person moves from one country to another.

The future of Search is here: Faster, simpler, AI-driven

Do more with less. That’s the mandate we’re all hearing. AI has fundamentally changed how we work. Modern AI workloads generate 10-100x more queries than humans ever could, pushing legacy architectures past performance limits. And the audacity of it all? Legacy logging vendors continue to raise costs without delivering meaningful innovation. IT and security teams are still forced to choose between speed and retention. Investigations are still slow. Data onboarding is still painful.

Observability Where You Work: Introducing the Honeycomb Slackbot in Beta

Engineers are constantly context switching between tools, adding cognitive overhead on top of already complex work. You're deep in an investigation, you need to analyze some data, pull up a runbook somewhere else, and share findings back in Slack. Context gets lost in the shuffle, correlating across data sources becomes painful, and everything just takes longer. In high-pressure situations like incidents, that friction has a real cost to the business.

Honeycomb Metrics Is Now Generally Available

It’s Black Friday. Checkout latency is spiking. Your on-call engineer pulls up the dashboard and starts working through the list. Is it a regional issue? No, all regions look fine. A payment provider? Stripe, PayPal, Apple Pay all nominal. A bad deployment? Nothing shipped in the last six hours. All your infrastructure dashboards are showing green. But customers are complaining. Checkout is slow, carts are being abandoned and revenue is draining away.

Update Management, Content Hub Expansion, and KQL Support

The latest VirtualMetric DataStream release introduces several important capabilities across platform security, data management, and operational workflows. This update strengthens access protection, simplifies infrastructure management, and expands the ways security teams can work with live telemetry. It also extends platform connectivity and improves the user experience across many areas of the interface. Let’s take a closer look.

DNS Monitoring

You can now monitor DNS records directly from Hyperping. DNS issues are often invisible until your users start complaining. With DNS monitoring, Hyperping checks that your records resolve correctly from multiple locations and alerts you the moment something goes wrong. Head to your monitors dashboard to create a DNS monitor. You can also manage DNS monitors via the API. Questions? Reach out via in-app chat or email us at hello@hyperping.io.

Why Your NOC Will Ignore AI

Imagine you are driving to work and a yellow check engine light flickers on your dashboard. The car feels fine. It accelerates normally, there is no strange noise, and the temperature gauge is steady. What do you do? If you are like most people, you keep driving. You might make a mental note to look at it later, but you don't pull over on the highway and call a tow truck.

Shadow AI and the Coming Workplace Reckoning (w/ Kay Firth-Butterfield)

In this episode of The DEX Show, we’re joined by Kay Firth-Butterfield, the world’s first Chief AI Ethics Officer and former Head of AI at the World Economic Forum. From human rights law and human trafficking to Davos and large language models, Kay traces her remarkable journey into AI governance. We explore shadow AI, workplace “hallucinations,” AI companions, and the hidden risks leaders are underestimating. Kay shares why organizations need cross-functional AI governance, stronger guardrails, and far better training — and why the future of work may depend as much on the humanities as technology.

Buy vs Build in the Age of AI (Part 2)

In Part 1, we explored how AI has dramatically reduced the cost of building monitoring tooling. That much is clear. You can scaffold uptime checks quickly, generate alert logic in minutes, and set-up dashboards faster than most teams used to schedule the kickoff meeting. So the barriers to entry have fallen. But there’s a quieter question that rarely gets asked in the excitement of building. Have you ever calculated what it would actually cost to replace your monitoring provider?

Unleashing Resilience: Why the Agentic Era Demands a Unified Data Fabric

Imagine starting your day with a dozen disconnected apps where your calendar does not sync with your reminders, your maps do not know your appointments, and your contacts are not linked to your messages. You would constantly be scrambling, missing key details, and reacting late to what matters most. In our personal lives, we depend on tight integration to keep pace with the world. In business, the stakes are even higher.

Rising Demand for Elderly Care: Why Skilled Workers are in High Demand

People are living longer lives, a trend that brings both joy and new logistical challenges. Families now face difficult decisions about how to support aging loved ones. A growing need for professional assistance is reshaping the job market and household budgets. Finding the right balance between medical needs and personal comfort is a major goal for millions.

Infrastructure Under Scrutiny: Turning Visibility into Cost Control

A practical discussion with infrastructure leaders on how visibility is shaping cost control, renewal planning, and financial accountability across hybrid environments. Runtime: 41:32 The conversation around infrastructure has shifted. IT teams are no longer measured only on uptime or performance.

The hidden reason your reports don't match

There is a quiet moment that sometimes happens right before a meeting begins. The slides are ready. Dashboards are open. The numbers look neat on the screen. But the revenue doesn’t match last week’s number. A trend line suddenly looks different. Someone says, "That’s strange." And the conversation shifts. Instead of talking about strategy or growth, the room starts trying to figure out what happened to the data. Moments like this rarely happen because someone made a mistake.

Technology in the Workplace Statistics for 2026

Workplace tech has officially entered high gear. AI is embedding itself into everyday operations, and the modern workplace is more distributed and demanding than ever. For network and IT teams, the upside is significant—but only with the visibility and control needed to keep everything running smoothly. Here are 20+ technology in the workplace statistics shaping 2026 that can give IT and network teams a glimpse into where we’re headed.

The best observability platforms for developers

At some point, logs stop being enough. As applications grow more distributed, understanding what's actually happening in production becomes harder. That's what observability platforms are built for. The hard part is figuring out which one is actually right for your application — and your budget. This guide covers some popular options: what they do well, where they fall short, and who they're for.

Unlocking the Power of SolarWinds Through Training - SolarWinds TechPod 107

In this SolarWinds TechPod episode, hosts Chrystal Taylor and Sean Sebring talk with Cheryl Nomanson, a SolarWinds Academy trainer with 14 years at the company. They discuss the importance of technical education for complex software and networks, exploring SolarWinds' comprehensive training offerings including the SolarWinds Academy with its on-demand courses, instructor-led virtual classes, and office hours format. Cheryl explains the SolarWinds Certified Professional (SCP) certification program and the newer SolarWinds Certified Instructor (SCI) program for training partners globally.

Olly for SREs: 3 ways I actually use it in production

There’s a moment after an alert where you’re not fixing anything yet. You’re trying to answer a much simpler question: Is it actually down? Sometimes it’s obvious. Sometimes it’s 20 alerts at once with no clear starting point. Sometimes it’s a small upstream degradation that might cascade. Sometimes it’s just a spike that resolves on its own. That first phase is orientation. Is the signal real or transient? Is it isolated or spreading? Root cause or symptom?

Expanding Uptime Monitoring Down The Stack: ICMP Monitors Are Now Available In Checkly

When we started building Checkly's uptime monitoring suite, the goal was to give engineering teams complete visibility across every layer of their stack, from application down to network, in one place. URL, TCP, DNS, and Heartbeat monitors covered a lot of that ground. But one fundamental piece was missing: the ability to simply ping a host and know if it's reachable.

When Your Plant Talks Back: Conversational AI with InfluxDB 3

No one wants to stare at a plant and guess if it needs water. It’s much easier if the plant can say, “I’m thirsty.” A few years ago, we built Plant Buddy using InfluxDB Cloud 2.0. The linked article is still a great guide for cloud-first IoT prototyping as it shows how quickly you can connect devices, store time series data, and build dashboards in the cloud with the previous version of InfluxDB. But this time, the goal was different.

Bring Clarity and Confidence Back to Ops: How Trustworthy Guidance Sets a New Standard

For years, enterprises have chased the promise of artificial intelligence as a remedy for growing operational complexity. It seemed logical that if environments were expanding faster than teams could keep up, smarter models could fill the gap. But early deployments of generic AI proved a difficult truth. Intelligence alone does not create operational clarity. It does not guarantee safety.

Release software with confidence using Datadog Feature Flags

In this technical product demo, see how Datadog Feature Flags helps teams release software with confidence by connecting every feature flag to real-time observability data. Configure progressive, multi-step rollouts with automated guardrails tied to APM, RUM, and Product Analytics so you can pause or roll back instantly if latency, errors, or key business metrics degrade.

The architecture advantage: Why the data layer decides the AI race

Dozens of startups are sprinting to build the next “agentic SIEM” that can autonomously detect, investigate, and respond to threats. They’re well-funded, well-marketed, but structurally hollow. Here’s what it usually looks like: an LLM layer on top of a thin orchestration engine on top of fragmented or customer-hosted data lakes. While it looks impressive in a demo, it quickly falls apart in production. Why? It’s not built on a strong foundation.

Root Cause Analysis in Software Testing: Methods, Techniques, and How AI Is Changing the Game

If you've ever fixed a bug only to watch it come back two weeks later, you already understand why root cause analysis matters. Patching symptoms feels productive - it's not. Getting to the actual cause is what prevents the same issue from eating your team's time over and over again. This guide covers everything you need to know about root cause analysis (RCA) in software testing: what it is, how to do it, which tools help, and where AI is taking it next.

What's New at Cribl 4.17: On release days, we wear teal.

In this episode, Leon runs through all the updates in Cribl release 2603, which includes a massive update to Cribl Search, the ability to detect PII and secrets in the background as part of Cribl Guard, and two cool enhancements to Cribl Packs - monitoring and enhanced routing. Try Cribl Now! Sandboxes let you get hands-on experience with Cribl without the fuss or friction.

What is Cribl Guard background detection?

Security and compliance teams need to know exactly what sensitive data is flowing through their environments and where it’s going. ​​Because surprise PII is no one’s favorite kind of surprise. Meanwhile, upstream teams are shipping new apps, changing schemas, adding fields, and generally moving fast. However, you can only manage and protect the data you currently know of and expect. But sensitive data has a habit of showing up where no one expected it…

Meet the new Cribl Search: Faster investigations with AI

Get a quick look at the new Cribl Search experience—built to help teams investigate faster, onboard data easily, and get answers from their logs without complex query languages. In this quick overview, we show how Cribl Search helps you move from raw data to insights in minutes: The result? Faster investigations, simpler workflows, and powerful AI-assisted analysis across your telemetry. Learn how the new Cribl Search makes exploring and analyzing data easier for everyone—from experienced analysts to teams just getting started.

What is AI really going to bring to the table when it comes to migration?

Explore the real capabilities and limitations of AI in system and SIEM migrations. Learn where AI accelerates processes and where human review remains essential. Additional Resources: About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

Navigating Machine Data at Infinite Scale: Why the Modern Enterprise Demands a New Data Architecture

In the modern enterprise, data is no longer just a byproduct of business; it is the lifeblood. However, we have moved beyond the era of simple transactional data. We are now living in the age of machine data.

Why status pages suck

Cloud status pages were supposed to bring transparency to outages. Instead, they’ve become one of the most frustrating parts of incident response. Just to illustrate, here are only a few of the many posts on X: When a cloud service fails, status pages are often slow to update, incomplete, or missing information. Crowdsource platforms are noisy and misleading.

Improved SSO setup and logging

We’ve made several improvements to Single Sign-On (SSO) in StatusGator to make authentication easier to configure and easier to monitor. As a reminder the StatusGator dashboard includes SAML-based SSO on all plan tiers, even our free plan. This update introduces a simplified SSO setup flow along with a new Audit logs tab that provides visibility into authentication activity.

SharePoint Online outage on March 6, 2026

On March 6, 2026, SharePoint Online experienced a disruption that prevented some users from loading sites, accessing files, or authenticating successfully. The incident did not affect every user, but reports came in from multiple regions including North America and Europe. StatusGator detected the problem early through user outage reports and triggered an Early Warning Signal before Microsoft officially acknowledged the issue.

Create a Custom Service Health Board With the Honeycomb MCP

Your software is sending data to Honeycomb. Now where is the dashboard you want? The best dashboard is one created just for your application, or your service, or your team. You can get that in minutes with the Honeycomb MCP. Open your coding agent in your IDE, or on the command line in your code repository. Configure the Honeycomb MCP and authenticate with Read and Write permissions. Now tell it what you want. You can be high-level: Make me a service health board for the frontend service.

Why You Should Automate Network Troubleshooting

It's 2 AM. The Network Is Down. Where Do You Start? You get the call. Users can't connect. VoIP is choppy. Something is broken somewhere between your office and the cloud. You open your monitoring dashboard and it says something is wrong, but not where. Not why. Not since when? So you do what IT teams have done for decades. You open a terminal, run a traceroute, SSH into the router, pull up SNMP, check the firewall logs.

Top 5 Web Applications for Reverse Phone Lookup & Contact Verification

In today’s world, even a missed call can cause concern. Was it important? Who was trying to reach you? Or was it just another spammer? The question who called me from this phone number has become extremely relevant. Not only ordinary users feel this relevance, but also small businesses, security services, and companies that strive to maintain accurate contact databases. In response to this increase in unwanted calls and fraud, specialized web applications have emerged.

How to Design Competitor Monitoring Reports That Drive Strategic Decisions

Competitor monitoring reports often end up as data graveyards, filled with information nobody acts on. The difference between reports that gather dust and reports that drive decisions comes down to design choices made before the first data point gets collected. To get the most from your competitor monitoring, building a comprehensive and actionable report is key.

Approaching your observability migration with the right mindset

This guest blog post is authored by Nick Vecellio, Principal Engineer and Co-founder of NoBS, a Premier Datadog Partner specializing in hands-on Datadog migrations and optimizations. At NoBS, we help enterprises migrate their observability stack to Datadog. Teams often come to us after a migration has technically “worked,” but the new setup requires optimization tweaks to provide the clarity, reliability, or operational benefits they’re looking for.

Four ways engineering teams use the Datadog MCP Server to power AI agents

Since the Datadog Model Context Protocol (MCP) Server first launched in Preview, Datadog has experienced an overwhelming amount of interest and feedback from customers. We appreciate those who requested access to test our product, provided feedback, and shared their stories of how the MCP Server helped them overcome engineering challenges.

Apono integration for Grafana: Enabling Just-in-Time access for data sources

Ben Avner is the Head of Ecosystem and Strategic Alliances at Apono, where he leads the company’s global partner strategy and technology alliances. He focuses on building and scaling strategic partnerships that drive product innovation, partner-influenced pipeline, and long-term growth. A former founder and engineer, Ben brings a strong technical foundation and a builder’s mindset, combined with experience across marketing, product partnerships, and go-to-market strategy.

AI Systems Status Report - February 2026

This report covers the operational status of major AI systems during February 2026, including Anthropic, Cohere, DeepSeek, Google Gemini, Groq Cloud, OpenAI, Perplexity, Replicate, and xAI. The data includes official incidents reported on vendor status pages and unconfirmed incidents detected through IsDown's monitoring systems.

New API: Submit outage reports

We’ve added a new endpoint to the StatusGator API that allows you to submit outage reports for monitors on your board. With the new Outage Reports API, you can programmatically report issues you’re experiencing with a service. These reports help StatusGator detect outages faster and improve visibility for other users who rely on the same services.

Episode 6 - The evolution from automation to autonomy

Tom and Akhilesh unpack why automation alone will never deliver autonomy, and why intelligence means anticipating change rather than constantly reacting to it. They explore the role of people in enterprise transformation, the limits of technology without trust and context, and why the most powerful use of AI is freeing humans to focus on what they do best. Plus, Akhilesh makes the case for ping pong as a surprisingly effective way to reset when the pressure is on.

Accelerate Vulnerability Remediation with Atatus: From Detection to Secure Deployment

In microservices and cloud-native environments, vulnerabilities buried in transitive dependencies or runtime behaviors can go undetected for weeks. During that time, your attack surface keeps expanding and production systems remain exposed. The longer remediation is delayed, the greater the risk of exploitation, compliance failures, and operational disruption.

Best Rails APM Tools in 2026: A Developer's Guide

Rails applications have a specific set of performance challenges that make monitoring genuinely useful rather than just box-checking. ActiveRecord is convenient to use and also convenient to accidentally write N+1 queries with. Memory bloat in long-running processes, particularly when Sidekiq or Action Cable is involved, is a recurring production problem for a lot of teams. Background job performance tends to degrade quietly until it becomes noticeable.

How Autonomous Are Your IT Operations, Really?

This post introduces a six-level maturity model that defines what true autonomy looks like in IT operations, from basic AI chat interfaces to fully coordinated agent ecosystems. ITOps teams have more automation tooling than ever, and yet incident response still depends heavily on human judgment to hold it together. Alerts fire, engineers dig through dashboards, context gets assembled by hand, and someone at the end of the workflow makes the final call.

What is Agentic Observability?

Agentic observability is the instrumentation and correlation needed to explain and control agent behavior across multi-step workflows. Legacy observability focuses on runtime health and service behavior. You monitor metrics like CPU usage, memory, latency, and error rates to confirm that applications and infrastructure are functioning as expected. When a workflow degrades, the proximate cause is often a crash, timeout, permission error, or resource constraint.

Datadog Incident Response: One platform from alert to resolution

When incidents strike, speed and clarity are critical. Datadog Incident Response brings the full incident lifecycle into one platform so teams can move from detection to resolution with confidence. Operate from a single, unified view of your systems, coordinate across the tools your teams already use, and leverage AI that analyzes incidents in real time to surface context, guide decisions, and accelerate resolution.

Observability for Azure Virtual Desktop with SquaredUp

Managing Azure Virtual Desktop doesn’t have to mean jumping between portal blades, logs, and metrics trying to piece together what’s happening. In this webinar, you’ll learn how to design and implement a single, operational observability dashboard for Azure Virtual Desktop (AVD) using SquaredUp Cloud — transforming fragmented telemetry into clear, actionable insight. Whether you're responsible for performance, user experience, or operational stability, this session will give you a structured, repeatable framework for monitoring your AVD estate with confidence.

Trends in Mainframe Modernization: Fresh Insights from SHARE Orlando

Fresh insights from SHARE Orlando reveal mainframe modernization isn't about replacement—it's evolution. From hybrid architectures to AI-driven automation, enterprises are transforming legacy systems into agile, integrated platforms while preserving core reliability.

Full-Stack Observability Is Becoming a Business Imperative

As enterprises accelerate digital transformation, technology performance has become inseparable from business performance. Customer experiences, revenue streams, and operational efficiency increasingly depend on the reliability of complex, distributed systems. In this environment, full-stack observability is no longer a technical aspiration — it is a strategic necessity.

7 Tech Tools to Help Monitor Your Loved One's Safety

Staying connected with aging family members is a top priority for many households. Technology now offers many ways to keep tabs on health and safety without being intrusive. Choosing the right tools can provide comfort to both the senior and their caregivers. These devices help bridge the gap between independence and necessary support - creating a safer home.
Sponsored Post

Build vs Buy Monitoring: The Real Cost Breakdown for IT Teams

Every IT team eventually faces this question: should we build our own monitoring system or buy an existing solution? On the surface, building seems attractive. You get complete control, no vendor lock-in, and the illusion of "free" since you're using internal resources. But the math rarely works out that way. Let's break down what it actually costs to build, when building genuinely makes sense, and how to make the right decision for your team.

From Reactive to Predictive: Preserving BESS Uptime at Scale

Battery Energy Storage Systems (BESS) operate as revenue-generating grid assets that capture surplus electricity, deploy power during demand spikes, and support frequency control. By shifting energy across time, they stabilize grid conditions, enable renewable integration, and execute market dispatch commitments. When systems respond as designed, stored capacity becomes a flexible, monetizable supply. But BESS performance depends on precision and availability.

What Is LLMjacking? The New AI Cybercrime Stealing Cloud AI Compute

LLMjacking is a new cybercrime where attackers steal access to cloud-hosted AI models and use them for free — while the victim pays the bill. In this video, we break down what LLMjacking is, how attackers exploit compromised credentials and exposed APIs, and why security teams should treat AI infrastructure as a high-value attack target. Discovered by the Sysdig Threat Research Team, LLMjacking is quickly becoming the AI-era equivalent of cryptojacking — except instead of mining cryptocurrency, attackers run expensive large language models (LLMs) at scale.

Meet the new Bits AI SRE: Deeper reasoning, twice as fast

When we announced Bits AI SRE at DASH 2025, we introduced an autonomous SRE agent that investigates alerts the moment they trigger. Bits AI SRE reads the same telemetry data as your team, understands your architecture, and follows your runbooks to identify likely root causes before you even open your laptop. It’s your AI teammate that’s always on call.

How AI lets you talk to your company's data and get answers instantly

In this conversation recorded at Elastic’s New York office, three product leaders discuss how AI agents are transforming enterprise software. The discussion features Steve Kearns (general manager, Search solutions at Elastic), Mike Nichols (general manager, Security solutions at Elastic), and Baha Azarmi (general manager, Observability at Elastic). They explain how Elastic Agent Builder allows teams to interact with their data using natural language instead of complex queries.

How LLMs can help boost productivity

Learn how large language models (LLMs) are transforming productivity in business, coding, research, and daily workflows. Discover practical ways to use AI tools to automate tasks and improve efficiency. Additional Resources: About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

Your Questions About AI-Assisted Development Answered

We recently hosted a webinar on AI-assisted development with DORA, and the audience had a lot of questions—far more than we could get to in an hour. I picked out six that get at the stuff people are wrestling with day to day. These aren't the easy questions, and I don't think there are necessarily easy answers, but I've spent the past year building and shipping with AI coding tools and observing (literally) what happens when that code hits production. Here's what I have.

Routing OpenTelemetry logs to Sentry using OTLP

If you've already instrumented your app with OpenTelemetry, you don't have to rip it out to use Sentry. Two environment variables and your logs start flowing into Sentry, no SDK changes, no re-instrumentation. Here's how to set it up in a sample app, and when the native Sentry SDK might be the better call.

Best Python APM Tools in 2026: A Developer's Guide

Last updated: March 2026 Python applications built on Django, Flask, FastAPI, and other frameworks have the same monitoring needs as applications built in any other language: you want to know which endpoints are slow, why the database is getting hammered, what errors are firing in production, and ideally all of that in a form that does not require three separate tools to reconstruct a single incident.

How Imperva Gets Traffic Answers in Seconds with Kentik

Imperva Network Architect, Wallace Lee, shares how Kentik helps teams drill deeper than traditional reporting tools to improve network and customer experience. Wallace shares how, during a live architecture review, Imperva’s Kentik power users answered a critical “are we safe?” traffic question in seconds. Kentik enables engineers to instantly understand prefix-level bandwidth and shows exactly which ASN and ISP traffic came from. Wallace also highlights how Kentik makes Anycast traffic visibility an “easy win,” helping teams move from questions to confident decision-making fast.

Preventing SLA Breaches With Proactive Monitoring as MSPs Move Toward Autonomous IT

AI-first hybrid observability with proactive monitoring helps MSPs protect SLAs as they move toward autonomous IT by getting engineers the right alerts before issues impact service. Managed services lives and dies on timing. The difference between a minor issue and a customer-facing incident often comes down to how early an engineer gets the right signal and how quickly they can act on it. That timing shows up in SLAs, service credits, escalations, and the trust you earn when customers feel taken care of.

SquaredUp vs Grafana: The Enterprise IT dashboard showdown

Modern enterprises operate across an increasingly complex mix of hybrid cloud services, and productivity platforms. As environments scale, stakeholders need a single pane of glass (SPoG) to understand what’s happening across IT operations without jumping across dozens of disconnected tools.

How Race Communications Automates DDoS Mitigation with Kentik

Sorin Esanu, Director of Network Engineering at Race Communications, explains why deep, always-on network intelligence is essential when you have massive volumes of traffic moving in and out from many sources. After outgrowing an on-prem tool that required ongoing maintenance and didn’t deliver the analytics they needed, Race chose Kentik for richer visibility, daily traffic optimization, and improved security.

Avoid the Swivel-Chair Tool Stack: Conway Corporation on Why Kentik Wins

Everett Sinclair, Network Administrator at Conway Corporation, explains why Kentik became their “one pane of glass” for cloud-based network visibility, rapid troubleshooting, and smarter peering and caching decisions. With Kentik’s SaaS network intelligence platform, Conway gets updates automatically, avoids server rebuilds, and can deploy cloud agents remotely to run simple metric tests close to customer locations.

Continuous Security Monitoring: The Practical Guide for Modern Ops Teams

If you've ever been on call during a "nothing changed... except everything" incident, you already understand the real problem with traditional security checks: they're snapshots. And snapshots are useless the moment your infrastructure shifts, a new SaaS tool gets approved, a developer spins up a service in a different region, or a vendor quietly exposes an admin portal to the internet. Modern environments don't stay still. So security can't, either.

AWS Middle East data center strikes: 92 SaaS platforms report disruptions

StatusGator analysis identifies 92 cloud services that publicly acknowledged disruptions tied to the AWS Middle East incident. Over the weekend, Amazon confirmed that drone strikes damaged AWS facilities in the Middle East, disrupting cloud infrastructure across the region. The strikes affected AWS regions in the United Arab Emirates and Bahrain, causing outages and degraded performance across core cloud services including compute, storage, and databases.

Buy vs Build in the Age of AI (Part 1)

A few months ago, I spoke to an engineering manager who proudly told me they had rebuilt their monitoring stack over a long weekend. They’d used AI to scaffold synthetic checks. They’d generated alert logic with dynamic thresholds. They’d then wired everything into Slack and PagerDuty, and built a clean internal dashboard. “It used to take us weeks to prototype something like this,” they said. “Now it’s basically instant.” They weren’t wrong.

Introducing Rocky AI to General Availability

After months of being available in Beta for our app users, Rocky AI is now generally available to all users and plans. Rocky AI is Checkly’s AI agent that works around the clock, 24/7, to make sure your application’s reliability is optimal. In this first release, Rocky AI ships with the ability to run continual Analysis on test and check failures, giving your teams AI-powered root cause analysis, impact analysis, and more.

We Turned Our WireShark Wizard Into a Markdown File

Rocky AI — Checkly’s AI agent — is now Generally Available. We developed Rocky AI over the last ~6 to 8 months. This is an aeon in AI-years. During this period, we learned a ton. About AI, but mostly about how to fit them into an existing SaaS product, not just another chat widget. This is my ramble…

Public Sector Observability: Service Experience and Reliability Are Now Mission-Critical

Reliable digital services aren’t optional for public sector agencies. They’re essential to mission success. Across the U.S. public sector, service experience and reliability have moved from operational concerns to mission requirements. At a federal level, Executive Order 14058 makes improving service delivery and customer experience a federal priority, measured by real outcomes for the public. And for state and local governments, the bar is set by the private sector.

Use plain English to query your multi-cloud infrastructure in Resource Catalog

Modern cloud environments include thousands of resources across providers, teams, and accounts. Organizations need the ability to quickly locate the right resources so that they can manage resource compliance and troubleshoot issues. When engineers need to answer questions such as which databases are still on extended support or which storage buckets lack encryption, they often have to switch consoles, use provider-specific query languages, and know obscure version strings or configuration flags.

Generating metrics from traces with cardinality control: A closer look at HyperLogLog in Tempo

While tracing is a critical component of any observability strategy, metrics — especially RED metrics (request rate, error rate, and duration) — are widely considered the gold standard for monitoring service health. Tempo, the open source, easy-to-use, and highly scalable distributed tracing backend, is well known in the OSS community for storing and querying traces. It can also, however, generate RED metrics directly from those traces using the optional metrics-generator component.

7 Real Ways to Modernize NetOps with Kentik AI Advisor

Kentik’s AI Advisor acts as a virtual network engineer, helping teams of all skill levels troubleshoot, manage, and optimize their infrastructure with unprecedented speed and context. We explore seven practical NetOps use cases, from rapid incident triage and capacity planning to upcoming live-device command support, that demonstrate how using AI as a collaborative teammate dramatically reduces manual investigative work.

Skills vs. MCP: You're probably reaching for the wrong one

Everyone is adding Model Context Protocol (MCP) servers to everything right now. And I get it. MCP is clean. It’s standardized. You write a server, expose some tools, and suddenly your LLM can query your log platform, pull a dashboard, and fire an alert. It feels like the right abstraction. But I’ve watched teams at serious companies burn weeks building MCP integrations for workflows that should have been skills, and build skills for things that genuinely needed MCP.

The Spark Avengers Unite: Dispatches on the FUTURE of IT (w/ Matt, Moe & Denis)

Tom assembles the “Spark Avengers” for a deep dive into the most talked-about innovation in IT: Nexthink Spark, the personal AI agent for every employee. Joined by Moe Haidar, Denis Schertenleib and Matt Rose, the team unpacks how Spark evolved from early LLM experiments into an enterprise-ready, autonomous IT agent already delivering 70%+ first contact resolution. From printers and frozen cameras to complex root-cause analysis, Spark is transforming support from reactive to proactive.

How does AI enhance search?

Explore how artificial intelligence enhances search engines through semantic understanding, vector embeddings, and contextual retrieval. Learn how AI-powered search delivers faster and more accurate results. Additional Resources: About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

Centralizing Docker Logs for Observability and Security

Most people can remember the old game of telephone, the stream of whispered sentences or phrases across a group of kids. At each transmission, a different piece of information gets lost or misheard, leaving the last person with an incomplete or incomprehensible statement. Managing Docker logs can feel the same way, especially when an error message is lost or an error message lacks context.
Sponsored Post

The art of software engineering management

Like any leadership role, leading an engineering team in a mature, compact company like Raygun comes with both honor and responsibility. Leading a major development project is a bit like conducting a symphony orchestra, where every individual plays a crucial role and has a great impact on the work they release to customers and end-users.

The Battle for Control: Introducing Avantra AIR

SAP operations teams are drowning. Every day is a battle against alert fatigue, complex root causes, and repetitive firefighting. And while vendor spin will tell you that moving to the cloud or adopting SAP RISE magically simplifies everything, the reality on the ground is entirely different. We call it the Hybrid Cloud Paradox: Different providers might own different parts of your critical business landscape, but you still own the business risk.

Did ChatGPT take down Claude?

On March 2, 2026, Claude experienced a widespread service disruption that affected users across North America, Europe, Asia, and Australia. The outage quickly drew significant media attention, with numerous technology news outlets reporting on user frustration and downtime. In the early hours of the incident, some commentators speculated that the disruption may have been caused by a sudden influx of new users migrating from OpenAI. However, there is no public evidence confirming that theory.

February 2026 product updates

February brought powerful new improvements to StatusGator – from better status page analytics and expanded API capabilities to smarter incident detection. We also published our latest Early Warning Signals report, highlighting major outages we detected before providers acknowledged them. Here’s everything that’s new.

What You Need to Know About Choosing a Data Center Location for SolarWinds Papertrail

When signing up for SolarWinds Papertrail, you’ll see an option to choose where your data is stored. What does this mean? What should you consider when choosing a data center location? In this blog, we’ll explore how you can determine where to store your data. First off, the region you choose is the physical location where your data is stored. Once you select a region, you can’t migrate data from it, so it’s important to choose carefully.

Simplifying troubleshooting across the user journey with Datadog Synthetic Monitoring

Every digital experience is a chain reaction. A customer logs in to an application, an API authenticates the request, a backend call retrieves data, a page loads, and somewhere along the way, something might break. When it does, teams often chase symptoms while the root cause remains hard to find. The more distributed the system, the more difficult it becomes to see how one small failure can cascade into a visible outage.

Announcing Automated Diagnostics: Reduce MTTR with Instant, Data-Driven Troubleshooting

Automated Diagnostics closes the gap between detection and diagnosis instantly. Every IT operations team knows the pressure. When an alert hits at 2 a.m., it’s a race against time to find the root cause before users feel the impact. But gathering diagnostic data such as logs, process stats, and thread dumps can eat up critical minutes. That manual lag is exactly what Automated Diagnostics eliminates.

How to create and manage secrets with Grafana Cloud Synthetic Monitoring

Observability isn’t just about collecting metrics and logs; it’s about proactively validating that your systems work as expected. Synthetic monitoring helps teams continuously test APIs, applications, and critical user journeys. But when those checks require the use of sensitive data, securely managing credentials becomes essential to maintain both reliability and security.

The Speed of Clarity: How Grounded Context Transforms Triage and Strengthens Operational Decision-Making

Modern operations move at a pace that leaves little room for ambiguity. When an incident emerges, teams must determine what is happening and how best to respond. Yet triage often slows under the weight of fragmented data, noisy alerts, and limited shared understanding across engineering groups. These conditions stretch routine issues into drawn-out investigations and delay action exactly when teams need to move with purpose.

Responsible transformation: Agentic AI for the public sector

The world is transforming, and artificial intelligence, especially agentic AI, is quickly becoming embedded across private and public sectors. For government agencies, law enforcement, and mission-critical organizations, embracing this new reality is uniquely challenging. On the one hand, agentic AI promises measurable improvements: modernized IT workflows, faster analysis, improved citizen services, and operational efficiency.

5 Essential Capabilities that Make Coralogix an Observability Powerhouse

Sometimes observability can feel like a second job. With many traditional tools, users must become experts in a proprietary language to ask a simple question. In these cases, developers or SRE’s can find themselves spending more time manually sifting through raw text, building complex data pipelines from scratch, and bouncing between fragmented dashboards than actually solving problems.

A Practical Guide to SCADA Security

Critical infrastructure is under siege. The systems that control our power grids, water treatment plants, and oil pipelines weren’t designed for a connected world. This post covers what security measures teams need to understand and how time series monitoring can help turn SCADA’s weaknesses into a security advantage.

Saved queries now support template variables | Grafana Cloud

In this video, Collin Fingar, Software Engineer at Grafana Labs, demonstrates how template variables can be used in saved queries, a feature that enables users to reuse queries they or others in their org have saved. You'll see how a query that contains variables can be reused, and how the variables can be replaced at the point of reuse.

Why Website Change Monitoring Matters for Modern Brand Management

A competitor quietly slashes their prices, and the sales team doesn’t find out until deals start falling through. A rogue plugin update changes the homepage headline to something nobody approved. These scenarios play out constantly for brands without visibility into what’s happening on their own websites and their competitors’ sites. Website change monitoring provides that visibility through automated tracking and alerts, turning potential blind spots into strategic advantages.

Enabling Proactive ITOps with Skylar Advisor

By continuously connecting signals across your IT environment, Skylar Advisor turns operational complexity into clear, prioritized guidance. It highlights potential impact, explains why it matters, and delivers clear next steps so IT teams can act early and stay ahead of alerts before they turn into issues.

When was the term artificial intelligence coined?

Discover when the term artificial intelligence was first introduced and how it shaped the future of AI research and machine learning. This video breaks down the origin of AI and its historical significance in modern technology. About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

Telemetry Talks ep.2 - How to use OpenTelemetry in VictoriaMetrics Cloud

Telemetry Talks – Episode is here! In this episode, Diana and Jose introduce VictoriaMetrics Cloud, covering what it is, the problems it solves, and its pricing model, including how overages are handled. If you’re building or operating cloud-native systems and want a clearer, real-world understanding of OpenTelemetry and managed observability, this episode is for you. Resources for Further Learning.

What does investigation look like when data lives in multiple tools?

War rooms don’t fix fragmentation. They expose it. Incident hits. App checks traces. Infra checks hosts. Cloud checks dashboards. Network checks packets. Everyone sees their layer. No one sees the system. So we guess. Rollback. Add capacity. Freeze change. The noise stops. The constraint doesn’t. Modern failures don’t live in tools. They live in dependencies. If your platform can’t follow a transaction across hybrid and AI infrastructure — to the exact constraint — you don’t have observability.

Why Small Businesses Still Underestimate Endpoint Monitoring - And What MSPs Can Do About It

Small businesses tend to think of cybersecurity in terms of firewalls and antivirus software. If those two boxes are checked, the assumption is that the network is protected. But the threat landscape has shifted dramatically in the last few years, and endpoints - laptops, desktops, mobile devices, even printers - have become the primary attack surface. Most small businesses haven't adjusted their defenses accordingly.

Protecting sensitive PII data with effective log management

Organizations rely heavily on logs or tracking changes, troubleshooting issues, and addressing authentication attempts. Although these logs are essential for ensuring a smooth onboarding experience, they often contain users' personally identifiable information (PII), including names, email addresses, phone numbers, and sometimes location or device details. The following sample log illustrates this scenario: 2025-11-01 09:12:33 ACCOUNT_CREATED - New user registered: Name: Michael Scott, Email.

February 2026 Early Warning Signals

February 2026 saw another wave of impactful service disruptions across AI platforms, e-commerce infrastructure, developer tools, education providers, collaboration apps, and cloud services. Using StatusGator’s Early Warning Signals, we detected outages before providers publicly acknowledged them – and in several cases, providers never acknowledged them at all. Many services still lack transparent or timely status communication, leaving users with little visibility during critical incidents.

System Datasets: From Alert Fatigue to Optimized Notifications

Alert fatigue rarely begins as a single mistake. It grows as systems scale, teams grow, and “just in case” monitoring becomes the default. A few extra alerts, another threshold, and soon the on-call channel becomes overwhelmed. Engineers get interrupted for noise or stop trusting pages; either way, real signals get missed. Reliability drops, and productivity quietly declines. Most teams respond tactically: tune thresholds, change notifications, suppress noise.

Tech Talk | Application management with Targeted Application Install for Victoria Experience

Apps create endless opportunities to leverage the strengths of the Splunk Cloud platform. Until now, you could only install Splunk apps across every search head on a Splunk Cloud Platform Victoria Experience deployment. With TAI you now have fine-grained control over which search head groups will run which apps.

Grafana Alerting: faster rules, personalized filters, and an operations workspace

Alerts are only useful when you can quickly find and act on the right signal. That's why, over the past two years, we rebuilt Grafana Alerting’s UI to make it more reliable and efficient, especially at scale. The result: a faster, paginated alert rules page that handles tens of thousands of rules, with a powerful filter dropdown and saved searches so you can quickly get back to the views you care about most.

Why we open-sourced AURA: Infrastructure for production AI

Over the last year, I’ve talked to dozens of SRE teams about AI. The excitement is real, but conversations hit a wall when we get to production reality. How does an agent manage complex context without losing the plot? How does it avoid hallucinating relationships between signals? Who owns the orchestration logic that ties it all together? We realized the bottleneck wasn’t model intelligence. It was the lack of a reliable logic layer between the data and the model.