Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Developer environments should be cattle, not pets

Cattle, not pets is a DevOps phrase referring to servers that are disposable and automatically replaced (cattle) as opposed to indispensable and manually managed (pets). Local development environments should be treated the same way, and your tooling should make that as easy as possible. Here, I’ll walk through an example from one of my first projects at incident.io, where I reset my local environment a few times to keep us moving quickly.

Admin Panel - General Settings - xMatters Support

You can define the details for a company using the General Settings page accessed via the Admin menu. Depending on your permission level, you may not be able to view the General Settings screen. In addition, the settings you see on this page depend on both your role permissions and the features available in your product plan.

Redundancy for IT resilience: The backup guide for a disaster-proof network

Around six years ago on a Wednesday morning, software professionals worldwide were startled by a tweet from GitLab stating that they had accidentally deleted their production data, causing their site to go offline. Unfortunately, at that point in time, the open-source code repository giant had no idea that it would take them another 36 hours to restore their systems only to learn that 5,000 projects and 700 new user accounts were affected while they were fixing the outage.

The Guide to SRE Principles

Site reliability engineering (SRE) is a discipline in which automated software systems are built to manage the development operations (DevOps) of a product or service. In other words, SRE automates the functions of an operations team via software systems. The main purpose of SRE is to encourage the deployment and proper maintenance of large-scale systems.

The 7 IT Automations for Highly Effective Organization: IT incident Remediation | Web App Down

No organization is immune to outages, unplanned interruptions, or quality reduction of normal service. But having a streamlined response plan can ensure these situations are dealt with more effectively to restore normalcy. In a world where IT efficiency is being measured by mean time to resolution, triaging and remediating alarms can directly impact the business in a positive way.

Komodor + Squadcast Integration: Simplifying Kubernetes Monitoring & Incident Response

Kubernetes (K8s) is a powerful tool for container orchestration, but it presents unique challenges when it comes to monitoring and incident response. Managing K8s requires 360º visibility into your environment, proactive health monitoring, along with right incident management, and suppression capabilities. In this article, we'll explore the benefits of integrating Squadcast with Komodor, two powerful tools that can help you overcome these challenges.

How metrics can make or break your IT operations strategy

IT people know that data is king, especially in optimizing IT operations. However, figuring out which metrics to collect and how to collect them can be challenging. IT teams have to factor in what IT directors, team managers, and the people overseeing operations want, what they’re concerned about, and what they consider important.

Managing Incidents in Energy and Utility Companies

Several challenges impact customers and operations of utilities and energy companies, including aging infrastructure, cybersecurity threats, inclement weather, operational failures and transmission interruptions. These challenges can cause prolonged service disruptions, potentially leading to customer attrition and irreversible damage to businesses. Responding quickly and efficiently to incidents is critical to minimize damages or contain potentially dangerous scenarios.