Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

A guide to post-mortem meetings and how we run them at incident.io

You've just made it through a particularly tough incident. It was a short outage affecting a subset of customers, so not exactly the end of the world, but bad enough that it involved multiple people across a number of teams to resolve. Either way, the incident was well managed, and the dust has settled. Now what? Most guidance would say that putting together a post-mortem document is a good idea, given the severity of the incident. You've also done this, so what's next?

Three Ways to Better Appreciate your SREs and DevOps Engineers

DevOps engineers and Site Reliability Engineers are vitally important to the continued health of your product and business. We all know it’s true, and yet people in these roles often feel underappreciated and undervalued. This sort of work runs into the issue of “when process and infrastructure break, it gets shoved in the spotlight; but when everything works perfectly, no one notices.” ‍

How AIOps modernizes CMDBs to drive accuracy and value

Maintaining your Configuration Management Database’s (CMDB) accuracy, keeping it fully updated, and improving its performance is a frustrating and elusive goal for ITOps and IT leaders. Aiming for this ‘golden’ CMDB standard can feel like running on a treadmill where you’re putting in a lot of work, but remain as distant as ever from your goal. Can IT leaders ever catch up?

Bridging the ITIL vs DevOps Mindset: CI/CD Best Practices for ITIL Organizations

DevOps practices in software development have revolutionized the way updates are released. However, many companies entrenched in ITIL practices find it challenging to seamlessly integrate with the DevOps practice of Continuous Integration and Continuous Delivery/Deployment (CI/CD). This is because ITIL focuses on stability, which suits older systems, while DevOps is ideal for modern setups with its agile, automated practices.

Revolutionizing your Grafana setup with intelligent alerting

Once upon a time, in the bustling city of DataVille, lived a team of dedicated IT professionals tirelessly working to maintain the city’s digital heartbeat. Their mission was to ensure the smooth operation of their city’s digital infrastructure, which was not limited to the daytime operations but extended beyond business hours. They were the unsung heroes, the guardians of the city’s data. Their tool of choice? Grafana, a powerful open-source platform for observability.

What is HCAHPS: A Comprehensive Overview

In the realm of hospitals and healthcare organizations, the term “HCAHPS survey” is a recurrent presence: Hospital Administrator A: “The latest HCAHPS survey results just came out, and patients seem satisfied with…” Hospital Administrator B: “Some of our past patients participated in the HCAHPS survey, but they expressed disappointment with…” You might be left wondering, “What exactly is the HCAHPS survey?” Allow me to elucidate.

Unified Incident Management: Merits of Combined On-Call and Incident Response | Squadcast

In this session, we explore the crucial aspects of effective on-call management and incident response in product organizations. Squadcast combines On-Call and Incident Response into a single platform using automation capabilities for enhanced reliability, continuous learning, and better productivity. 🔍 Timestamps.

Choosing the Right Career Path in Tech: Software Engineering vs. Site Reliability Engineering (SRE)

The tech industry is booming, and there are many different career paths. But, two of the most popular and in-demand roles are Software Engineering and Site Reliability Engineering (SRE). Site Reliability Engineering (SRE) blends elements of software engineering with IT operations, focusing on reliability. On the other hand, SWE Software Engineering involves designing, developing, testing, and deploying software applications.