Operations | Monitoring | ITSM | DevOps | Cloud

What's Special About MCP?

AI agents can interact with the world using tools. Those tools can be generic or specific. For example: Generic: Specific: The most general ones, like “run a bash command” and “read and write files” are built into the agent. More specific ones are provided through Model Control Protocol (MCP) servers. Every tool provided to the agent comes with instructions sent as part of the context.

Installing TrackJS on Certkit

I recorded a video showing how to properly set up TrackJS for a new production website, specifically CertKit, our new certificate lifecycle management tool. The key to effective error monitoring isn’t just installing the tracking snippet, it’s configuring the system to surface real issues while filtering out the noise. I configure a forwarding domain (errors.certkit.io) to bypass ad blockers that might prevent error reporting.

Top Causes of Data Center Outages and How You Can Reduce Risk

Outages are less common than they once were, but when they happen, the impact is severe. According to the Uptime Institute Global Data Center Survey 2025, half of data center operators reported at least one impactful outage in the past three years, and one in ten of those caused a serious or severe disruption. The financial risk is just as significant. 20% of operators said their most recent outage cost more than $1 million when accounting for downtime, recovery, and reputational damage.

<100ms E-commerce: Instant loads with Speculation Rules API

In e-commerce, we all know that speed = money. I know it, you know it, Amazon knows it, eBay knows it, Shopify knows it, everyone knows it. In this article we’ll see how we can improve the perceived performance of our site’s critical pages, like the Product Details page, the Cart page, the Checkout page. We’re going to use the Speculation Rules API (SRA) to prerender/prefetch them, and also explain how certain frameworks like Next.js offer their own prefetching mechanisms.

Eliminating N+1 Queries with Seer's Automated Root Cause Analysis

When I was working at Shopify, Black Friday and Cyber Monday were our Superbowl. We initiated code-freeze weeks before to make sure merchants wouldn't have any unexpected issues during one of the most important times of the year. Sometimes, though, you need to ship updates last minute. Picture this: It's Black Friday Eve, 11:47 PM. You've just deployed a new /sale page with 50+ products at discounted prices. Marketing is about to email 500,000 subscribers. Everything tested fine with your sample data.

Side-by-Side Variable Comparison for Snapshot Debugging

When you’re debugging a tricky issue in a distributed system, “what changed?” is often the most important question. You add logs, you capture data, you redeploy, and suddenly your browser is full of open tabs, copied JSON blobs, and screenshots of log lines. Comparing behavior between two requests, two users, or two releases turns into a manual, error-prone chore. Lightrun Snapshots were built to fix the data collection side of that story.

How continuous profiling cut our cloud spend

At Coralogix, we’re constantly looking to evolve the measurements we take to better understand the efficiency of our infrastructure. We constantly assess and investigate sources of cost in our cloud infrastructure, to ensure we’re getting the best return on investment. This activity, often referred to as FinOps, is becoming a cornerstone of engineering teams.

Announcing a forthcoming integration with PagerDuty + Azure AI SRE Agent for faster incident response

The energy at Microsoft Ignite this year was electric. AI was everywhere, and the possibilities are limitless. As developers and operations teams explore what AI can do, one thing became clear: the future isn’t about switching between tools. It’s about intelligent agents working together to help humans solve problems faster. At PagerDuty, we’re building on that excitement.

The $8.8 trillion advantage: how open source software reduces IT costs

Open source software is known for its ability to lower IT costs. But in 2025, affordability is only part of the story. A new Linux Foundation report, The strategic evolution of open source, reveals that open source has evolved from a tactical cost-saving measure to a mission-critical infrastructure supporting enterprise-grade investments, and delivering stronger business outcomes as a result.

7 Observability Solutions for Full-Fidelity Telemetry

You don’t have to choose between capturing every signal and keeping costs predictable. Modern observability stacks blend full-fidelity storage (time series or columnar systems like ClickHouse and Apache Druid), tail-based sampling for heavy traffic, and tiered storage (hot/warm/cold with S3-backed archives). This gives you full-fidelity incident forensics with the day-to-day cost profile of a sampled setup.