Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

[SRE: From Theory to Practice] What's difficult about problem detection?

In this episode of FTTP, Kurt Andersen and Matt Davis are joined by Joanna Mazgaj and Laura Nolan to talk about the implications of and considerations for problem detection. Watch the full episode and hear them share personal stories about the types of challenges you might face. Ultimately, how do we explain and address the socio-technical concepts behind problem detection?

[SRE: From Theory to Practice] What's difficult about incident command?

Welcome back to our mini series of fireside chats with SRE experts talking about the realities of their day-to-day. Episode 2 gets intimate — What’s difficult about incident command? We invited Alyson van Hardenberg, Engineering Manager at Honeycomb.io, and Varun Pal, Staff SRE at Procore, to chat with Jake Englund and Matt Davis from the Blameless team. Watch the full conversation where they cover everything from methodologies and technical expertise to the human and social aspects of reliability engineering.

Using Tagging and Routing Rules in Squadcast I Incident Classification I Event Tagging I Squadcast

Event Tagging is a rule-based, auto-tagging system with which you can define customized tags based on incident payloads, that get automatically assigned to incidents when they are triggered. This video explains how to create Tagging rules for efficient Incident Classification.

Maximizing IT Company Success through Effective On-Call Support

Having your systems monitored by a reliable solution is important, but how do you ensure that the right people are informed about issues that arise? Identifying problems is the first step, but they also need to be routed to the appropriate individuals. Keep in mind that employees may not always be sitting in front of the dashboard. This means being available outside of normal working hours to quickly respond to emergencies and problems, including not only weeknights but also weekends and holidays.

Adding Incident Watchers in Squadcast | Incident Notifications and Updates | Squadcast

This video talks about Squadcast's Incident Watchers Feature. In Squadcast, any user/stakeholder can subscribe to an Incident and act as a Watcher for an incident. Incident Watchers can choose to receive notifications for all the updates of an incident. This allows any user/stakeholder to act as an observer of the incident, even if they are not active responders. You can customize your watch options for the incident and receive notifications only for those updates.

Common Incident Terminology

Operations, customer support, engineers and most groups use inconsistent language. This is a serious problem. Imagine NASA doing that with astronauts or a navy with ships talking to each other, but not using the same terms. Something very bad will happen. In our space of incident management, we use words like broke, failed, outage, doesn’t work, dead…all describing the same condition.