Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Stop ECS Containers From Collapsing Into One Service in OpenTelemetry

Why ECS containers collapse under service.name = aws_ecs and how to fix it for both EC2 launch type and Fargate, including the resource-vs-log-record pitfall that quietly breaks log filtering. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

April 2026 Early Warning Signals

April saw widespread disruptions across SaaS platforms, developer tools, and cloud services, with login failures, pipeline issues, and general service outages among the most common problems. StatusGator’s Early Warning Signals consistently identified these incidents ahead of official provider updates. In several cases, the lead time was significant. Bitbucket pipeline failures were detected 1 hour 17 minutes before acknowledgment, while Claude performance issues surfaced 59 minutes early.

Telemetry Talks ep 4: Retroactive sampling and OpenTelemetry

This episode of Telemetry Talks explores the evolution of an OTLP/gRPC tracing pipeline for VictoriaTraces within OpenTelemetry and VictoriaMetrics, including a shift from standard gRPC-Go to a simplified HTTP/2-based implementation to reduce complexity and improve flexibility. Together with the our guest, Jiekun, we revisited the VictoriaMetrics KubeCon talk ideas on tail-based and retroactive sampling — and their impact on the broader OpenTelemetry community.

When Dashboards Start Teaching the System: Why Selector's Natural Language Querying Matters

Operations teams have lived with the same frustrating tradeoff for years: the data exists, but getting to the right answer often takes too much time and too much expertise. Engineers are expected to know platform-specific query languages, navigate layers of dashboards, and understand exactly where the right visualization lives before they can even begin troubleshooting. That approach can work in smaller environments, but as infrastructure grows more distributed and complex, it becomes a bottleneck.

ActiveMQ Slow Consumer: Detection, Strategy & Prevention Guide

One of the most counterintuitive failure modes in enterprise ActiveMQ deployments is this: a single application team deploys a new consumer for a high-volume market data topic. Their consumer is slow, maybe they added a database write on every message, or their processing thread pool is undersized.

Add dynamically updating context to logs with Reference Tables and Observability Pipelines

Security and platform engineering teams rely on context-rich logs to investigate threats, prioritize incidents, and meet compliance requirements. Context is often stored separately from applications that generate logs, in sources like threat intelligence feeds in Snowflake, asset lists in Amazon S3, ownership data in ServiceNow CMDB, and risk scores produced in Databricks.

Notes from the Field: Keyboard mapping issues with IGEL Linux endpoints on Windows Server 2025 VDAs

New Windows Server versions often introduce subtle behavioral changes that only surface when interacting with different endpoint types. In mixed environments where both Windows and Linux-based endpoints are used, these differences can become more apparent. The following case highlights an issue encountered when using IGEL Linux thin clients against Windows Server 2025 VDAs, where keyboard input behaved differently compared to Windows endpoints.