Operations | Monitoring | ITSM | DevOps | Cloud

AWS outage takes down more than 150 cloud services

On May 7th and 8th, 2026, Amazon Web Services (AWS) experienced an outage affecting Amazon Elastic Compute Cloud (EC2) in the dreaded US East 1 region. The original region of AWS located in Northern Virginia, us-east-1 or just “US East” as it is known, has been the subject of some of the internet’s most high profile and destructive outages and remains Amazon’s least reliable region.

Collective IQ Business: meet the artificial intelligence that transforms IT management

The employee digital experience (DEX) is no longer just a concept; it has become a concrete discipline supported by specialized tools. At the center of this transformation is Collective IQ, Almaden’s DEX solution, available in the Essential and Business editions. The Business edition includes AlmaAI a family of generative AI capabilities that take IT management to a new level.

Operational Intelligence and the Hidden Structure in System Logs

Most IT teams do not suffer from a lack of data. They suffer from the amount of effort required to make sense of it. Every network device, application, cloud service, and infrastructure component generates a constant stream of machine output. Logs capture state changes, failures, retries, warnings, and thousands of other small signals about how systems behave. The problem is that raw logs are hard to use at operational speed.

Multi-tiered Observability: A Practical Way to Handle Diverse Workloads

Observability in large companies is rarely one-size-fits-all. The VictoriaMetrics topologies guide shows why different deployment patterns are needed as scale, isolation, and reliability requirements grow. Different workloads require different trade-offs: some need long retention for audits and trend analysis, while others need higher resolution for debugging. Business-critical systems also demand dependable alerting and high availability, often with several 9s of reliability.

A New Era of Linux Kernel Vulnerabilities

There have been TWO major kernel vulnerabilities announced this week. Copy-Fail (CVE-2026-31431) was announced on April 30th. Dirty Frag (CVE-2026-43284), also known as 'Copy Fail 2: Electric Boogaloo' announced literally hours ago. Both have already been patched on Cycle, and our users can receive this update simply by restarting their nodes. The Linux patch was released less than an two hours ago, and we're the first to get it to our customers.

Retroactive sampling reduce trace traffic and costs

In this short, our software engineer Zhu Jiekun, explains how retroactive sampling can reduce trace traffic and ingestion costs by sending minimal data for sampling decisions and retrieving full spans only when needed—at the cost of added system complexity. Resources for Further Learning.

Monitor Unreal Engine Game Performance with Application Metrics

Your Unreal game can ship with zero errors and still not feel great. Stutters during combat, a frame-rate cliff on the big boss, rubber-banding in multiplayer, none of it shows up as a crash and none of it shows up in Sentry, leaving you without any visibility into what your players are actually experiencing in the wild. Well, until now. Unreal Engine already gives you plenty of tools to measure game performance and collect runtime stats, but all that data stays on the dev’s machine.

The Journey to Production AI: Five Steps for SRE and Platform Teams

In a recent webinar, The Journey to Production AI, Andre Elizondo walked through what separates a working agent demo from an agent worth trusting on a 2 a.m. page. Live polls during the session put numbers behind a pattern most platform teams already feel. ‍ ‍ Most teams are early. The ones who are further along did not get there by shipping a flashier demo. They got there by treating production AI as a platform problem.

How Modern Ops Lost Their Bearings

Modern operations carry a quiet contradiction. Organizations have never had more data, more dashboards, or more instrumentation, yet teams increasingly struggle to gain a reliable sense of what the environment is actually doing. The problem is not the absence of information. It is the absence of bearings. This drift did not happen suddenly. It accumulated across years of transformation.