%term

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

What is MTTR, and how can agentic ITOps reduce it?

Jul 16, 2026 By BigPanda In BigPanda

Mean time to resolution (MTTR) measures the average duration to restore regular operation for an application, service, or infrastructure component. It’s a key performance indicator (KPI) for IT incident management. To tie MTTR directly to customer satisfaction, you first need to understand how it affects service and application reliability and availability. From there, you can make informed decisions, operate efficiently, and provide a seamless customer experience.

Read Post

BigPanda

Read more about What is MTTR, and how can agentic ITOps reduce it?

On call? Don't miss the next World Cup match.

Jul 16, 2026 By Derdack SIGNL4 In SIGNL4

Plans change - and your on-call schedule should be able to change with them. With SIGNL4, you can quickly arrange shift coverage from your smartphone, so your team stays fully staffed while everyone knows exactly who's on duty. Whether it's a World Cup match, a family event, or any other last-minute plan, SIGNL4 helps you manage stand-ins and shift handovers without phone calls, spreadsheets, or confusion.

View Video

SIGNL4

Read more about On call? Don't miss the next World Cup match.

Software Quality Beyond Code: The Three Dimensions of Reliable IT Operations

Jul 15, 2026 By SIGNL4 In SIGNL4

Many teams invest heavily in code quality, architecture and testing – yet still struggle with outages, slow response times, and unclear ownership. The reason is simple: software quality is about more than technology alone.

Read Post

SIGNL4

Read more about Software Quality Beyond Code: The Three Dimensions of Reliable IT Operations

AI vs. AI: from alert fatigue to agentic cybersecurity

Jul 15, 2026 By Rootly In Rootly

AI is transforming cybersecurity on both sides of the battlefield. Attackers can now launch highly personalized phishing campaigns at scale and build malware capable of making autonomous decisions. At the same time, security teams are using AI agents to investigate alerts, reduce noise, and respond to threats faster. In this episode of Humans of Reliability, we speak with Nir Soudry, Head of R&D at 7AI, about the shift from alert fatigue to agentic cybersecurity.

View Video

Rootly

Read more about AI vs. AI: from alert fatigue to agentic cybersecurity

PagerDuty Announces Arnaud Lagarde, Vice President of EMEA

Jul 14, 2026 By PagerDuty In PagerDuty

PagerDuty, Inc. announces the appointment of Arnaud Lagarde as vice president of EMEA. Lagarde will lead PagerDuty's next phase of growth in the EMEA region, bringing the entire incident management lifecycle to customers across EMEA to solve their biggest digital challenges.

Read Post

PagerDuty

Read more about PagerDuty Announces Arnaud Lagarde, Vice President of EMEA

How to lay the data foundation to support agentic ITOps

Jul 14, 2026 By Carlos Gutierrez In BigPanda

Agentic IT operations have arrived. It’s no longer a question of if enterprise IT departments will adopt agentic ITOps, but how quickly. Every year, IT environments grow more distributed, complex, and difficult to monitor with legacy tools and processes. At the same time, the pace of AI development is accelerating the volume of changes and incidents, straining teams that are still trying to manage them manually, reactively, and one alert at a time.

Read Post

BigPanda

Read more about How to lay the data foundation to support agentic ITOps

Stop Triaging in the Dark: Full Visibility Across Every IT Domain

Jul 14, 2026 By BigPanda In BigPanda

Alert correlation solved the noise problem. But noise was never the whole problem. Today’s most disruptive incidents cascade across networks, infrastructure, applications, and services simultaneously, without clear visibility into the true root cause. As a result, L1 teams are left manually piecing together context from multiple dashboards and tools to find the primary root cause while SLA clocks keep ticking and end user tickets add up.

View Video

BigPanda

Read more about Stop Triaging in the Dark: Full Visibility Across Every IT Domain

The Value of Preventive Maintenance in Modern Business Operations

Jul 13, 2026 By OpsMatters In OpsMatters

Preventive maintenance helps businesses reduce downtime, avoid costly breakdowns, extend equipment life, and maintain safer, more efficient operations. By addressing small issues early, companies can keep workflows running smoothly and protect productivity in a competitive business environment.

Read Post

OpsMatters

Read more about The Value of Preventive Maintenance in Modern Business Operations

Where Status Pages Fit in a Modern Incident-Response Workflow

Jul 12, 2026 By OpsMatters In OpsMatters

An incident-response process has two audiences from the moment a service begins to fail. Engineers need evidence detailed enough to isolate the fault. Customers need a clear account of what is affected, what still works, and when they should expect another update. Trying to serve both groups from the same dashboard usually leaves each with the wrong information.

Read Post

OpsMatters

Read more about Where Status Pages Fit in a Modern Incident-Response Workflow

From BigQuery to ClickHouse: How we made our analytics 5× faster

Jul 10, 2026 By Aleksandr Meshcheriakov In iLert

‍For years, ilert has given our customers extensive analytics across their alerts, notifications, and on-call activity, a comprehensive overview of how their teams and services respond to incidents. These capabilities were backed by a separate analytical database running on Google BigQuery. It held the numbers behind every reporting dashboard in ilert, and for a long stretch it was perfectly fine. Then three problems grew too big to ignore.

Read Post