Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Featured Post

After action reports: post-incident investigations

When something unexpected happens within the digital operations remit, software engineers put on their deerstalker hats and wax their fussy little moustaches-metaphorically. It's their time to play detective as they unravel the evidence and create the reports to explain the recent IT incident. But unlike with a hat-wearing Sherlock Holmes or a hirsute Hercule Poirot, cliff-hanger endings are not encouraged in software engineering.

Understanding Kubernetes Logs and Using Them to Improve Cluster Resilience

In the complex world of Kubernetes, logs serve as the backbone of effective monitoring, debugging, and issue diagnosis. They provide indispensable insights into the behavior and performance of individual components within a Kubernetes cluster, such as containers, nodes, and services.

What Is Root Cause Analysis?

Root Cause Analysis (RCA) is a systematic process designed to uncover the fundamental, underlying issues that lead to IT incidents. These 'root causes' are often masked by surface-level symptoms, making them challenging to identify without a systematic approach. Root Cause Analysis serves as a metaphorical excavation, drilling past the initial problems to discover deeper, hidden issues.

Introducing powerful APIs and webhooks for Grafana Incident

Grafana Incident, Grafana’s powerful incident response tool, comes with a range of integrations out of the box, including Zoom and Google Meet spaces, GitHub and JIRA issues, and even a Google Doc template for post-incident review documents. However, every team has unique needs and workflows, and you may need to integrate with other systems not currently on our roadmap or even use your own in-house tools.

Unplanned, Episode 1: Damon Edwards Rages Against the Ticket Machine

In this, the inaugural episode of “Unplanned”, Dormain Drewitz talks to Damon Edwards about the “capacity conundrum” where everyone is working so hard, but everything takes too long and costs too much. We talk about the “coordination overhead” costs of getting unplanned work done, how generative AI is both adding complexity and offers to accelerate automating as much as you can, and four steps to creating capacity.

Proactive IT: Disaster Recovery Testing

In today's business environment, the continuity of IT systems is crucial to the success of an organization. Unforeseen disasters, such as infrastructure failures or cyber attacks, can severely impact the productivity of your organization. To mitigate these risks, IT departments must develop and implement robust disaster recovery (DR) plans. But, how can you ensure that these plans work effectively in times of crisis?