Operations | Monitoring | ITSM | DevOps | Cloud

Trace without traces

A customer emailed on a Tuesday: checkout hung for ten seconds. I opened our tracing tool, punched in the time window, and got nothing. The trace was sampled out. We keep 1% of traces, like most shops with real traffic do. The one request that actually mattered was in the 99% we threw away. I spent twenty minutes admiring our observability stack before admitting it couldn’t answer a first-grader’s question: what happened to this person? Here’s what I know now.

June 2026 Early Warning Signals

June 2026 saw major outages across ecommerce, AI, developer tools, and business applications. StatusGator’s Early Warning Signals surfaced many of these incidents before providers updated their official status pages. Of the 1,067 incidents detected by StatusGator in June, only 191 (17.9%) were eventually acknowledged by providers.

Introducing relationships for Service Monitors

Understanding a service outage is easier when you can see what it’s connected to. That’s why we’re introducing Relationships for Service Monitors, one of the most requested features from StatusGator’s hundreds of enterprise IT teams. You can now explore related services directly from the Service Details page by opening the Relationships dropdown.

ACP vs MCP: What's the difference for agentic coding?

An AI coding agent holds many conversations at once. Not only is the user prompting it, the agent also talks to the IDE, showing diffs and asking before it touches a file. At the same time it talks to tools, pulling a failing build or querying a database. Two open protocols standardize those conversations. This guide compares ACP vs MCP in practical terms: what each protocol does and when each applies. ACP (Agent Client Protocol) connects a code editor to an AI coding agent.

Autoscaling Checkly Private Location Agents in Kubernetes with KEDA

Monitoring load is not always steady. A team might add a new batch of checks or run several ad hoc tests during a rollout. When that happens, your Private Location agents need to pick up more work at once. If there aren’t enough agents available during a burst, checks start piling up in the queue, which can delay or disrupt check execution. But solving this by running a high number of agents around the clock has the opposite problem: most of that capacity sits idle until the next busy period.

Any Apple update can break our app. Here's how we find out first.

This is a guest post by Dan Mindru, a Frontend Developer and Designer who is also the co-host of the Morning Maker Show. Dan is currently developing a number of applications including PageUI, Clobbr, and CronTool. It feels like with every release, we are walking a tightrope. We need to keep our app lightweight, stable, and performant, all the while depending on APIs that can shift at any moment (without warning, too!).

Self-Healing ITOps: Close the Loop From Detection to Resolution

Self-healing ITOps helps restore services faster by combining AI-driven analysis, automation, and recovery validation. Organizations have invested heavily in monitoring, observability, and AIOps. These platforms are effective at identifying issues, but incident resolution is often still a manual process. Engineers still need to investigate alerts, determine the appropriate remediation, and verify that services have recovered.

When One Agent Plans and Another Executes, the Planner's View Decides Everything

Split network operations into a planning agent and an executing agent and you have an elegant design on paper. One agent reasons about what should change and validates it. The other carries it out. The elegance is real, and so is the structural consequence: the split puts the entire weight of judgment on the planner. A plan built on a partial view, then executed precisely and at machine speed, is more dangerous than a cautious human who would have hesitated at the part that did not add up.