Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Federated Search | From Silos to Insight | AWS S3 Schema Discovery with Splunk-Managed Tables

This walk-through shows how Splunk's crawler, available through the Data Management app, can discover schema and partition keys for S3 backed datasets and create Splunk managed catalog tables. Once the data is mapped, analysts can search AWS S3 data through Splunk and bring it into broader security, observability, and operational workflows.

Diagnose and resolve database performance issues faster with Database Investigator

When your database performance degrades, diagnosing the root cause is rarely quick or straightforward. Your existing tools might surface metrics like CPU utilization, wait events, and query duration, but then leave you to correlate the data and identify what went wrong. Worse, what first appears to be the root cause can often just be a downstream effect of multiple interrelated issues.

Zero-Code OpenTelemetry for Vert.x

Drop a JAR on the JVM. Get distributed tracing, RxJava context propagation, log-trace correlation, and Vert.x internal metrics. No code changes. No Maven dependency. Java 8–21. Inside the design of last9/vertx-opentelemetry v2.3.4. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

From noise to knowledge: How GenAI is revolutionizing log management and analytics

Focusing on GenAI and logs for IT efficiency Efficiency is everything for managing today’s digital systems. Technology is constantly transforming and expanding operations are driving an explosion in data. Consequently, data ingest and storage costs have soared. But it’s not just storage data costs that keeps teams behind.The challenge of managing all that observability data forces IT teams to choose between efficiency and the bottom line.
Sponsored Post

How to Reduce MTTR When Third-Party Services Go Down

Most MTTR guides assume the problem is in your infra. For modern apps, it's often not - it's Stripe, AWS, Auth0, or another vendor. Vendor status pages lie by omission. The lag between impact and acknowledgment can stretch to an hour or more. You need two runbooks, proactive vendor monitoring, and graceful degradation baked in before the 3 AM page hits. This post shows you exactly how.

7 Best Practices to Improve Digital Employee Experience in Modern IT Environments

Digital employee experience isn’t just a nice to have anymore. In hybrid, SaaS heavy IT environments Digital Employee Experience (DEX) is where productivity can live or die. Employees don’t care whether the culprit is Wi‑Fi connectivity, CPU/RAM load, poor battery life, or a misbehaving cloud app. They just know work got harder.

Auvik Aurora and the Future of AI in IT Operations

We built something called Auvik Aurora, and before you scroll any further, I can already hear your thoughts. “Wait a second, Anto. Is this going to be another blog post giving me the hard sell on using AI?” Fair enough, I don’t think anyone would blame you, especially when we’re seeing AI adoption across nearly every industry, tool, hobby, workflow, or even . The blank is intentional, AI is everywhere, and chances are that you already know that it matters.

Fixing JavaScript observability, one library at a time

Over the past few weeks, we have been driving a cross-ecosystem effort to replace the “monkey-patching” that powers all JavaScript APM tools today with something built into the runtime. Here is why, how, and where it stands. This applies to server-side JavaScript only (Node.js, Bun, Deno, Cloudflare Workers). Browsers do not have diagnostics_channel and lack the async context propagation primitives needed to polyfill it.

ActiveMQ Monitoring & Alerting Setup: The Complete 2026 Guide

Most ActiveMQ outages are not sudden failures. They are visible in the metrics for minutes, sometimes hours, before they become incidents. A memory usage graph climbing past 60%. A queue depth that isn't draining. An enqueue time that doubled after a deployment. A consumer count that dropped from 3 to 1 at 2 AM.