VictorOps

victorops

The Complete Template for On-Call Incident Response

Modern Agile practices and DevOps methodologies are leading to faster feature releases even though systems are becoming more complex. With high velocity comes more change and more change leads to more alerts and incidents in applications and infrastructure. So, the only surefire way for DevOps and IT teams to build reliable services is through proactive testing and an efficient on-call incident response plan.

victorops

Tracking Application Service Ownership in a DevOps Organization

It’s 6 PM on a Friday and your database service just failed. You, the on-call engineer, are the sole team member left in the office. Your colleagues may or may not be paying close attention to Slack channels and email as they make their way to weekend destinations. How do you figure out who can help you solve the database issue? If you don’t know who wrote or maintains the code that powers it, you’ve found yourself in a tough spot indeed.

victorops

The Template for Humane Root Cause Analysis

In the traditional IT Infrastructure Library (ITIL) approach to IT service management (ITSM) and IT operations, root cause analysis is required for effective incident management. But, over time, DevOps and IT teams are learning that there’s rarely one single root cause. Sure, one singular action (e.g. a new deployment) can result in one, short-lived incident. But, what about all the other actions leading up to that action?

victorops

Crafting a Comprehensive IT Monitoring Plan

Sysadmins, database admins and other IT professionals are constantly tweaking monitoring tools and trying to create more reliable systems. But, IT infrastructure and applications are constantly shifting underneath the people maintaining them – making it hard to maintain robust services. And, to top it all off, microservices, containerized applications, hybrid cloud infrastructure and faster deployment lifecycles are leading to more complex systems.

victorops

Cohesive Incident Management: Bringing Help Desks and Developers Together

Collaborative help desks and service desks are essential to both IT and customer support. Together, they give teams a way to respond to internal and external incidents and work cross-functionally to support reliable services for end-users. Whether incidents are detected via monitoring tools or through technical support help desks, the business needs a cohesive incident management plan to maintain uptime and keep customers happy.

victorops

Humane On-Call Alerting Improves ITIL Service Operations

Today, software developers and sysadmins alike are waking up to critical incidents at 4 AM. They’re collaboratively taking on-call responsibilities for applications, infrastructure and networks – working together to maintain uptime and availability of service operations. However, frequent alerts can easily lead to employee burnout and actually hinder service stability.

victorops

Analyzing a Fishbone Diagram for Incident Management

Ishikawa’s fishbone diagram is a method for visualizing and analyzing nearly any problem to find the root cause of an issue. According to TechTarget, the diagram was invented by Dr. Kaoru Ishikawa, a Japanese quality control expert. The methodology can be used both proactively and retroactively to help determine the cause and effect of a current problem or the potential of future problems.

victorops

The Guide to Troubleshooting With Runbooks

Runbooks and playbooks are maintained as a standardized set of instructions for identifying and resolving incidents in IT service management (ITSM) and DevOps. On-call responders in both traditional IT and software development teams can leverage automation and runbooks to improve the speed of incident response and remediation. By surfacing useful directions and wiki pages with context earlier in the incident lifecycle, on-call teams are instantly ready to jump into action.

victorops

Making Error Monitoring Actionable With Real-Time Incident Response

This is a guest article by Freyja Spaven from Raygun – an error, crash, and performance monitoring tool for web and mobile applications. The Raygun team are experts in comprehensive application monitoring and surfacing actionable incident insights. You spent months redesigning your on-call schedule, researched best practices, and looked at strategies from the most prominent tech leaders.