%term

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

After action reports: post-incident investigations

Jun 7, 2023 By Justyn Roberts, Senior Solutions Consultant In PagerDuty

When something unexpected happens within the digital operations remit, software engineers put on their deerstalker hats and wax their fussy little moustaches-metaphorically. It's their time to play detective as they unravel the evidence and create the reports to explain the recent IT incident. But unlike with a hat-wearing Sherlock Holmes or a hirsute Hercule Poirot, cliff-hanger endings are not encouraged in software engineering.

Read Post

PagerDuty

Read more about After action reports: post-incident investigations

Understanding Kubernetes Logs and Using Them to Improve Cluster Resilience

Jun 6, 2023 By Ritika Bramhe In OnPage

In the complex world of Kubernetes, logs serve as the backbone of effective monitoring, debugging, and issue diagnosis. They provide indispensable insights into the behavior and performance of individual components within a Kubernetes cluster, such as containers, nodes, and services.

Read Post

OnPage

Read more about Understanding Kubernetes Logs and Using Them to Improve Cluster Resilience

What Is Root Cause Analysis?

Jun 5, 2023 By StatusCast In StatusCast

Root Cause Analysis (RCA) is a systematic process designed to uncover the fundamental, underlying issues that lead to IT incidents. These 'root causes' are often masked by surface-level symptoms, making them challenging to identify without a systematic approach. Root Cause Analysis serves as a metaphorical excavation, drilling past the initial problems to discover deeper, hidden issues.

Read Post

StatusCast

Read more about What Is Root Cause Analysis?

Introducing powerful APIs and webhooks for Grafana Incident

Jun 5, 2023 By Mat Ryer In Grafana

Grafana Incident, Grafana’s powerful incident response tool, comes with a range of integrations out of the box, including Zoom and Google Meet spaces, GitHub and JIRA issues, and even a Google Doc template for post-incident review documents. However, every team has unique needs and workflows, and you may need to integrate with other systems not currently on our roadmap or even use your own in-house tools.

Read Post

Grafana

Read more about Introducing powerful APIs and webhooks for Grafana Incident

Incident Management Highlights | Jira Service Management | Atlassian

Jun 5, 2023 By Atlassian In Atlassian

Jira Service Management brings all of the context and data you need to resolve an incident quickly and efficiently. Incident Management helps teams escalate, bring in the right responders, swarm, and ultimately minimize downtime.

View Video

Atlassian

Read more about Incident Management Highlights | Jira Service Management | Atlassian

Slack And MS Teams As A Device - xMatters Support

Jun 5, 2023 By xMatters In xMatters

With xMatters, you can easily connect Slack or Microsoft Teams to your instance in just a few short steps. Allowing you to use them as a user messaging device just like email, SMS, voice, and push.

View Video

xMatters

Read more about Slack And MS Teams As A Device - xMatters Support

Unplanned, Episode 1: Damon Edwards Rages Against the Ticket Machine

Jun 5, 2023 By PagerDuty In PagerDuty

In this, the inaugural episode of “Unplanned”, Dormain Drewitz talks to Damon Edwards about the “capacity conundrum” where everyone is working so hard, but everything takes too long and costs too much. We talk about the “coordination overhead” costs of getting unplanned work done, how generative AI is both adding complexity and offers to accelerate automating as much as you can, and four steps to creating capacity.

View Video

PagerDuty

Read more about Unplanned, Episode 1: Damon Edwards Rages Against the Ticket Machine

Proactive IT: Disaster Recovery Testing

Jun 5, 2023 By StatusCast In StatusCast

In today's business environment, the continuity of IT systems is crucial to the success of an organization. Unforeseen disasters, such as infrastructure failures or cyber attacks, can severely impact the productivity of your organization. To mitigate these risks, IT departments must develop and implement robust disaster recovery (DR) plans. But, how can you ensure that these plans work effectively in times of crisis?

Read Post