Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

SSL Certificate Monitoring: Best Tools and Practices

SSL certificate monitoring is the continuous process of checking whether your TLS certificates are valid, correctly configured, and not approaching their expiry date. When SSL monitoring is absent or inadequate, the first signal you get that something is wrong is a browser security warning blocking your users from accessing your site. By then, the damage has already started.

How to Assign Tasks to Slack Alerts Channels Guide

An alert fires in your Slack alerts channel. It sits there for four minutes while three engineers each assume someone else is going to respond. Nobody owns it. Nobody creates a ticket. By the time someone acts, the incident has escalated. This is the accountability gap that unstructured Slack alert channels create. Visibility without assignment is not enough.

How to Add On-Call Rotations to Google Calendar

Your on-call rotation lives in a scheduling tool or a spreadsheet. Your engineers' actual work schedules live in Google Calendar. When these two systems do not talk to each other, engineers are constantly context-switching to figure out who is on-call and when. They miss shift reminders. They schedule personal appointments during on-call windows. And handovers get messy because nobody has a single place to see the full picture.

The Follow-the-Sun Field Log: Running an SRE Rotation Across Lisbon, Singapore and Austin in One Quarter

Quick note before we start. At 03:17 on a Tuesday in Lisbon, a watch buzzes against a hotel pillow. Two seconds later a phone screen lights the ceiling: P1, payments-writer-secondary, error rate seventy-eight percent. The on-call lead is twelve thousand kilometres from her desk. The team's five-minute escalation service-level objective is already running. The next ninety seconds will decide whether this is a clean save or a long retro.

What IT Incident Management Can Teach Workplace Safety

In most modern enterprises, the playbook for a production outage is well understood. An alert fires. An on-call engineer responds within a documented service level. The incident is triaged, assigned a severity, and worked through to resolution by a team that has rehearsed the steps. Afterward, a postmortem is written. The root cause is identified, blameless analysis is performed, and the findings flow back into runbooks, monitoring rules, and training materials. The cycle is closed.

Replace Verizon Email-to-Text with OnPage's Paging / Critical Alerting Capabilities

It’s 2:00 AM on a Saturday. An energy company’s thermal storage system temperature violently spikes past safe operating thresholds. The monitoring system instantly fires off an emergency alert via a standard Verizon email-to-text gateway. But instead of waking the engineer, the message is delayed by the carrier network. By the time the on-call responder sees the text hours later, the equipment has failed, resulting in catastrophic downtime.

Slack outage on May 14, 2026

On May 14, 2026, users across multiple regions began reporting problems with Slack, including messaging failures, sign-in issues, and problems loading attachments and images. While the outage did not affect every user, reports quickly showed the issue was widespread enough to disrupt business communication for organizations around the world. StatusGator identified the incident through customer outage reports and triggered an Early Warning Signals alert at 14:21 UTC.

Problem Management vs. Incident Management

Why Fixing Incidents Is Only Half the Work Fixing an incident is not the same as solving a problem. In enterprise IT operations, that distinction carries significant operational weight. Organizations that treat every disruption as a discrete, isolated event to be resolved and closed will continue to encounter the same disruptions, on the same infrastructure, from the same root causes. The cycle does not end because the underlying problem was never addressed.

Jira Notifications Management: The Enterprise Guide to Routing, Reducing Noise, and Closing the Loop

Jira is the system of record for engineering work at nearly every enterprise that runs agile delivery. It tracks epics, stories, bugs, sprints, releases, and the long tail of technical debt that keeps platform teams awake. What Jira was never designed to be is an alerting system.