Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Geo Maps: See Where Your Infrastructure Lives

When your infrastructure is spread across regions, data centers, branch offices, or edge locations, knowing where a node is physically located matters more than people usually admit. During an incident, “the node in the Singapore POP” communicates faster than a hostname. When you’re planning capacity, seeing geographic clustering tells you something that a flat list of nodes doesn’t.

Avantra 26: A Breath of Fresh Multi-Tenant AIR

There’s a crackle and spark in the air at Avantra lately, and I’m so pleased to be writing this bit on what we’ve accomplished with the Avantra 26 release. Automated root cause analysis, multi-tenant management support for Cloud ALM, enhanced security operations and financial operations monitoring BTP – it’s all there, and more. It’s an exciting and innovative release for Avantra!

AWS outage takes down more than 150 cloud services

On May 7th and 8th, 2026, Amazon Web Services (AWS) experienced an outage affecting Amazon Elastic Compute Cloud (EC2) in the dreaded US East 1 region. The original region of AWS located in Northern Virginia, us-east-1 or just “US East” as it is known, has been the subject of some of the internet’s most high profile and destructive outages and remains Amazon’s least reliable region.

Multi-tiered Observability: A Practical Way to Handle Diverse Workloads

Observability in large companies is rarely one-size-fits-all. The VictoriaMetrics topologies guide shows why different deployment patterns are needed as scale, isolation, and reliability requirements grow. Different workloads require different trade-offs: some need long retention for audits and trend analysis, while others need higher resolution for debugging. Business-critical systems also demand dependable alerting and high availability, often with several 9s of reliability.

Retroactive sampling reduce trace traffic and costs

In this short, our software engineer Zhu Jiekun, explains how retroactive sampling can reduce trace traffic and ingestion costs by sending minimal data for sampling decisions and retrieving full spans only when needed—at the cost of added system complexity. Resources for Further Learning.

Monitor Unreal Engine Game Performance with Application Metrics

Your Unreal game can ship with zero errors and still not feel great. Stutters during combat, a frame-rate cliff on the big boss, rubber-banding in multiplayer, none of it shows up as a crash and none of it shows up in Sentry, leaving you without any visibility into what your players are actually experiencing in the wild. Well, until now. Unreal Engine already gives you plenty of tools to measure game performance and collect runtime stats, but all that data stays on the dev’s machine.

The Journey to Production AI: Five Steps for SRE and Platform Teams

In a recent webinar, The Journey to Production AI, Andre Elizondo walked through what separates a working agent demo from an agent worth trusting on a 2 a.m. page. Live polls during the session put numbers behind a pattern most platform teams already feel. ‍ ‍ Most teams are early. The ones who are further along did not get there by shipping a flashier demo. They got there by treating production AI as a platform problem.

How Modern Ops Lost Their Bearings

Modern operations carry a quiet contradiction. Organizations have never had more data, more dashboards, or more instrumentation, yet teams increasingly struggle to gain a reliable sense of what the environment is actually doing. The problem is not the absence of information. It is the absence of bearings. This drift did not happen suddenly. It accumulated across years of transformation.

A Runnable Reference Architecture for Battery Energy Storage Systems on InfluxDB 3

A battery is a complex electrochemical system where safety and revenue are decided in milliseconds. Cell temperatures, voltages, and state of charge change in real-time; dispatch decisions and thermal alarms must fire in real-time. Anything in between—your data pipeline, your historian, your alerting layer—has to disappear into the background.