Operations | Monitoring | ITSM | DevOps | Cloud

May 2023

How to build organizational resilience: six proven steps

In today’s world, where natural disasters, terrorist threats, and cyberattacks are becoming increasingly common, business leaders must prioritize building resilience to ensure the long-term success of their organizations. However, the ways an organization can adapt, recover, and thrive in the face of adversity are often unclear.

Reduce MTTR and Address the Talent Gap with Logz.io Alert Recommendations

When our CEO and co-founder Tomer Levy delivered his “Observability is Broken” presentation at last year’s AWS re:Invent, he highlighted numerous challenges faced by today’s organizations as they seek to advance their observability practices. Of the six individual points that he noted, two specifically dealt with the current shortage of available engineering expertise, with another two focused on data overload.

Use incident cycle time to optimize your incident response process

Although the causes and solutions for incidents vary widely, most incidents follow a similar timeline from declaration to resolution. We call the period of time it takes to move from one phase or milestone of an incident to the next cycle time.

SIGNL4 Onboarding: 3rd Party Integration: Webhook & Email

The SIGNL4 Onboarding series walks users through the process's of SIGNL4 from Signup to Alerts to Settings. Todays video focuses on Scheduling users for duty shifts. Learn how to create an app inside of Signl4 to receive events from third party systems. Learn how to create an app and then receive events from those apps to create alerts. This video is packed with helpful tips to help you get the most out of your account.

Getting started with Squadcast's On-Call Scheduling

We understand that everyone values a simple and straightforward approach when it comes to setting up schedules. We at Squadcast are fully aware of the difficulties involved in creating an on-call schedule from scratch or migrating it to a new platform. Hence we have come up with a blog to assist you in seamlessly setting up your on-call schedule using Squadcast. Our goal is to provide guidance and support to make the process as effortless as possible for you.

Prometheus Blackbox Exporter: Guide & Tutorial

Prometheus is a favored open-source monitoring system that collects, stores, and queries metrics from various sources. In Prometheus, an exporter is a component that collects and exposes metrics in a format Prometheus can scrape. The Prometheus Blackbox Exporter is designed to monitor “black box” systems with internal workings that are not accessible by Prometheus. It sends HTTP, TCP, and ICMP requests to the external systems and measures their response times and statuses.

Data Shows Outage Time & Costs are Increasing - 3 Solutions You Should Consider

The Uptime Institute recently released its Annual Outage Analysis 2023 report. Overall, the report highlights the increasing costs, frequency, and duration of outages, the prominent role of cloud and digital services in outages, the shortcomings of service providers, and the need to address human error and management failures. It also underscores the ongoing challenges of handling failures in complex distributed architectures.

10 Incident Management Best Practices

Before we dive into the nitty-gritty of incident management, let’s look a bit closer at the actual meaning of ‘incident.’ In the world of IT service management, the official definition for ‘incident’ is an “unplanned interruption to an IT service or reduction in the quality of an IT service.” Whether that means a slowdown in response time or a total system crash, you’re looking at an incident.

The Swedbank Outage shows that Change Controls don't work

This week I’ve been reading through the recent judgment from the Swedish FSA on the Swedbank outage. If you’re unfamiliar with this story, Swedbank had a major outage in April 2022 that was caused by an unapproved change to their IT systems. It temporarily left nearly a million customers with incorrect balances, many of whom were unable to meet payments.

Hello World

It feels great writing this. It's hard to believe that we have been working on Spike.sh full-time for 3 years now. It's been the most rewarding experience of my life. A big thank you to all of our users and your constant feedback, which has only made Spike.sh better month on month. We are - Over the years, we have always kept our heads down and built. During this entire process, we have learnt a huge deal of things when it comes to incidents and how they are being managed.

Debug State Capture for Traditional Infrastructure & Apps

In our previous blogs on Capturing Application State and using Ephemeral Containers for Debugging Kubernetes, we discussed the value of being able to deploy specific tools to gather diagnostics for later analysis, while also providing the responder to the incident the means to resolve infrastructure or application issues.

5 Immediate Business Benefits of Leveraging Domain-Agnostic AIOps

Legacy systems and point solutions are part of any business. And while they have their history and benefits, it’s critical to find a balance for your organization. IT teams have been acclimated to disparate event management and monitoring tools. Now, with massive and rapidly increasing data flow, this disconnect is slowing and paralyzing IT teams.

The Ultimate Guide to Automating and Mobilizing Your Secops Processes with Derdack SIGNL4 and Microsoft Sentinel

The threat and security landscape is becoming increasingly cluttered. As incidents increase, so do alerts and notifications, leading to too many alerts and too few hours to address them. Many businesses work remote and with the ever-present smartphones, we are always on the go. Yet it is essential that security teams receive and prioritize meaningful threats, but that task is easier said than done.

Updating Your Tools for API Scopes

The PagerDuty REST API provides 200+ endpoints for users to programmatically access objects and workflows in the PagerDuty platform. Teams leverage these APIs to streamline creating and managing users, teams, services and other components for their environment. Up until now, access to the REST API has been authorized and authenticated via API Keys.

Sponsored Post

How Runbook Automation can Simplify CloudOps Use

.Organizations in every industry continue their transition to cloud services, and while this may be a step forward in general, it does bring with it its own unique set of challenges. Cloud use, and in particular CloudOps, relies on a complex and intricate infrastructure which is difficult to manage and maintain, and it's a critical part of keeping a business' networks functioning. This makes finding a way to simplify the use of CloudOps a top priority for many businesses, but does a solution exist?

Exploring Key Concepts of Site Reliability Engineering (SRE)

Site Reliability Engineering is a process of automating IT infrastructure functions, including system management and application monitoring using software tools. It is used by businesses to guarantee that their software applications are reliable even when they receive frequent upgrades from development teams. SRE allows engineers or operations teams to automate the activities that are traditionally performed by operations teams manually to manage production systems and handle issues.

Why an incident response plan is a security must-have for every organization

“By failing to prepare, you are preparing to fail. Preparation prior to a breach is critical to reducing recovery time and costs.” (RSAConference) For 83% of companies, a cyber incident is just a matter of time (IBM). And when it does happen, it will cost the organization millions, coming in at a global average of $4.35 million per breach. The damage isn’t only financial, nor solely related to customer loyalty and brand equity.

PagerDuty Launches New Innovations to Reduce Tool Sprawl and Optimize Operations

The number of tools used by distributed teams to manage incidents has multiplied over the years, leading to a valley of tool sprawl. Throw in manual processes and you’ve got too much toil and multiple points of failure. Maintaining disparate tools and systems isn’t just unwieldy, it’s expensive. Our latest capabilities add to the PagerDuty Operations Cloud to make it easier than ever for teams to consolidate their incident management stack.

Our social resurgence: activating our social media presence to revamp Incident Management

Over the past year, Spike.sh social media activity has been null. As a bunch of shy nerds in a small team working remotely across the world, we really never bothered with social media and our presence on it. We always kept our heads low and maneuvered around it. But no more. As of today, we are coming back on social media channels like LinkedIn, Twitter, and Reddit as well.

Learn How PagerDuty Customers Save Money and Achieve Fast ROI

Saving time and money is always important, but these days, it’s a mission-critical business imperative. At PagerDuty, we help organizations realize transformational gains in efficiency that drive both immediate financial impact and long-term business success. PagerDuty delivers clear value for any organization at any stage of operational maturity. But you don’t have to take our word for it – the real-life experiences of our customers speak volumes.

How Helpdesks Facilitate Major Incident Management

Helpdesks serve as the initial line of defense for IT incidents, responsible for facilitating incident management, including logging, categorizing, and prioritizing incidents. In the event of a major incident, the helpdesk plays a crucial role in escalating the incident to the appropriate major incident management (MIM) team. The success of this process relies on the expertise of the helpdesk staff in providing situational context to expedite resolution.

Building A DevTools Saas Company Today! Incidentally Reliable Podcast | Zenduty

Catch Rajesh Tilwani talking about Building a DevTools SaaS Company, and everything reliability only on the Incidentally Reliable Podcast, live now on all major platforms! About Zenduty: Zenduty is a revolutionary incident management platform that gives you greater control and automation over the incident management lifecycle. With the Zenduty API, you can supplement and deploy Zenduty in sync with other tools and services, allowing you to create and update incidents, users, teams, services, integrations, schedules etc. and automate your workflows using simple scripts.

Establishing Zero Trust out of the box at Enterprise scale

At most enterprises CIOs are already multiple waves into enforcing Zero Trust policy across their processes, configurations and teams. As a DevOps Lead, being responsible for juggling user empowerment and adherence to your executive’s policy across many SaaS tools can be tricky. This problem is especially challenging in incident management where highly sensitive data is being shared, incidents rely on multiple different types of team members, and response teams fluctuate from incident to incident.

The fastest and most robust path to incident declaration from monitoring tools

Here’s a crazy question: why do we still require a human to manually declare an incident for the things that we know are incidents? If we have enough confidence to build SLOs and high-severity alert routes for these specific scenarios, why are we still asking a human to confirm it’s an incident and get the assembly process in motion? Isn’t that just another button to push when we could be problem solving instead?

Insights into Observability Tools: Commercial vs. Open-Source

Observability has become a critical aspect of modern software development and operations, allowing organizations to gain insights into the health and performance of their applications and systems. One of the key decisions when implementing observability is choosing between commercial or open-source tools. We spoke to several professionals who shared their experiences and insights on this topic, shedding light on the pros and cons of each approach.

Process Automation v4.12.0 and v4.13.0 Release Notes

Product Managers Jake Cohen and Forrest Evans are back to update us on what’s new in the 4.12.0 and 4.13.0 releases of PagerDuty Process Automation. New in these releases are features to support #Kubernetes automation, managing resources in multiple #AWS accounts, and a new plugin suite for Sensu.

Major Incident Management with Zenduty, Grafana, Slack and Zendesk

In the current fast-paced world, businesses are seeking methods to increase their efficiency and simplify their processes. But, there are times when teams are unaware of an issue at the initial stage, leading to a bad customer experience. For example, you are a part of the Infrastructure team, where your primary responsibility is to check resources and notify when they reach their maximum capacity. Let's say due to an anomalous traffic load, our resource CPU utilization goes above 90%.

7 Types of Incident Response Tools

Incident response tools are software applications or platforms designed to assist security teams in identifying, managing, and resolving cybersecurity incidents. Incident response is a crucial part of an organization’s cybersecurity strategy, making it possible to detect threats, analyze vulnerabilities, respond to attacks, and recover from security breaches. Incident response tools are vital for safeguarding organizations against evolving cyber threats.

Welcome To xMatters - Ep 2 - Organizing Your Teams

Even the most gifted and powerful people could do with a helping hand now and again. Thankfully, they are not alone in the multiverse! xMatters has made the process of organizing your teams and creating a customized on-call schedule as if by magic. This way, when help is urgently needed, the appropriate on-call individual will quickly join the team to save the day. To learn more about organizing your teams with xMatters, check out our tutorial videos on how to get started.

How Sony Interactive Entertainment drives better IT operations based on alert data

Sony Interactive Entertainment (SIE) is a multinational video game and digital entertainment company owned by global conglomerate Sony. SIE primarily operates the PlayStation brand of video game consoles and products.

Insights on Hiring Engineers with Different Tech Stacks

In the world of software engineering, the choice of programming languages, frameworks, and technologies is constantly evolving. As a result, hiring engineers who have experience in different tech stacks has become a common practice for many companies. However, this practice also raises questions and concerns about the potential challenges and advantages of hiring engineers who work in predominantly different stacks.

Learning from incidents is not the goal

Learning from incidents has become something of a hot topic within the software industry, and for good reason. Analyzing mistakes and mishaps can help organizations avoid similar issues in the future, leading to improved operations and increased safety. But too often we treat learning from incidents as the end goal, rather than a means to achieving greater business success. The goal is not for our organisations to learn from incidents: it’s for them to be better, more successful businesses.

Status page best practices

Although some organizations may hesitate to publicly announce when they have an incident — afraid that acknowledging outages will scare customers away — the opposite is often true. When you proactively communicate with your customers, even during bad times, you have the opportunity to not only build trust but also buy grace during the incident.

Admin Panel - Location Settings - xMatters Support

In xMatters, sites, and region settings represent physical locations like street addresses or geographic coordinates. Every user in the system belongs to a single site and it controls some default settings on their profile page, such as their language and time zone. Let’s take a dive into xMatters location settings.

Our Opsgenie integration is now available

When we detect a problem with your site we can notify you via mail, a slack message, a webhook, or any of our other notifications channels. For most of our users this is enough, but those work in larger teams often need more flexibility. Today, we are launching our Opsgenie integration, a modern incident management platform.

Automate your DevOps processes, and let go (a little)

As the demand for instant innovation and real-time delivery of mission-critical processes continues to grow, your organization risks falling behind if it can’t adapt to an automation-centric strategy. To be successful, managers have to loosen the reigns and enable teams to automate their DevOps processes. Automating DevOps processes isn’t an all-or-nothing decision, and implementing automation processes slowly can let teams adapt to the changing environment and let go, little by little.

Squadcast's Improved Slack (V2) Integration | Better Collaboration & Incident Management | Squadcast

This video will give you an overview of the latest improvements supported by the Squadcast-Slack integration, which we hope will help in better collaboration and Incident Management.

Why Incident Management is an Essential Part of Risk Management

In any operation or activity, unforeseen happenings can derail progress. The job of a good manager is to try their best to make the hitherto unforeseen visible and planned for. It’s all too easy to find yourself reacting to occurrences that can throw you and the company into turmoil, with frantic fixing on the back foot being the result. The best managers can make it look like they don’t do much.

See Global Event Orchestration End-to-End

Global Event Orchestration’s powerful decision engine enriches events, controls their routing, and triggers self-healing actions based on event data. Teams can use this functionality across any or all services within PagerDuty. This feature is a continued investment in Event Orchestration, demonstrating PagerDuty’s commitment to providing customers with best-in-class automation capabilities. Check out this live demo from Principal Product Manager Frank Emery.

Assembly time is where you have the most control of an incident

The FDNY EMS Command responds to more than 4,000 calls per day. They range from car accidents to building fires to cats stuck in trees, and responses vary accordingly. Sometimes they might take hours, sometimes they take just a few minutes. With such unpredictable conditions, the FDNY focuses on improving what they call “response time.” That’s the amount of time between a 911 call being made and emergency responders arriving on the scene. This might sound familiar.

Trust shouldn't start at zero

How often have you heard the phrase “trust is earned” in life? While well-meaning, I think this can actually lead to some strange behaviour at work, especially when you’re on a fast growing team. Startups experience a lot of chaos and unknowns your teams need to navigate, so it’s vital to know you can trust the people around you. As you grow, how you set expectations around trust as people join your team can impact your ability to hire, onboard, ship and ultimately, survive.

How to Manage Customer Support Channels in Slack: A Step-by-Step Plan

As more and more teams transition to remote work, collaboration tools like Slack have become increasingly popular. Slack's chat-based communication platform makes it easy to keep teams connected and informed, but it can also create challenges when it comes to managing support channels. In this post, we'll explore different approaches to building a Slack-based support system and provide some tips for success.

10 Mistakes to avoid when framing your IT Incident Management Strategy

An IT incident is an unplanned disruption that negatively impacts an IT service. As the importance of IT to the business has increased, the impact of IT incidents has become greater. IT incidents can result in revenue loss, loss of employee productivity, SLA financial penalties, government fines, and more. An effective IT incident management strategy is now essential in every organization. For a business like Amazon whose entire business relies on IT, a single second of slowness can cost over $15,000.

Four steps for organizations to proactively address chronic hazards

Global climate change continues to have a profound impact on businesses worldwide, with chronic hazards such as flooding, wildfires, and extreme weather conditions posing a significant risk to industries. As organizations continue to operate in an increasingly interconnected world, they face a growing range of challenges. One such challenge is the impact of chronic hazards on their operations.

How to get started with incident management metrics

Tracking incident metrics can help you discover patterns in the causes and costs of incidents and help you understand brittle parts of your organization. We've seen them help teams zero in on things like: But it can be intimidating to get started. Do you really need metrics if you're a small team or just beginning to formalize your incident management program? I say yes. The key is to start with something manageable and grow.

How Abbott transformed its incident management process with Workflow Automation

Eliminating errors and streamlining the incident management process are top priorities for many ITOps, NOC, SRE, and DevOps teams. With organizations using multiple tools in their IT stack, manually finding the right information at the right time becomes crucial during incident triage. By automating tasks and workflows, businesses can eliminate manual tasks that are time-consuming, repetitive, and prone to mistakes.

Debugging Kubernetes with Automated Runbooks & Ephemeral Containers

In our previous blog, we discussed the difficulty in capturing all relevant diagnostics during an incident before a “band-aid” fix is applied. The most common, concrete example of this is an application running in a container and the container is redeployed—perhaps to a prior version or the same version—simply to solve the immediate issue.

The Rise of ServiceOps: Unifying IT Service Delivery

With the complex and steadfast growth of IT service delivery processes, organizations and their internal teams have come to rely on several tools in their toolbox to deliver best-in-class products and services. The use of AIOps, AI/ML, and overall automation has shaped modern delivery methods, but what we call this process, and how we grow to advance it, has yet to find a definition that’s universally recognized.

Reflecting on one of the biggest incidents in our history

We have to come clean. During KubeCon, we experienced an incident that we weren’t ready to discuss until now. This incident caused quite a disruption and, had it been left unresolved, would have had a massive snowball effect. At the time, we didn’t want to raise any alarms, so we kept it quiet while our team rallied to resolve it. And to be honest, most folks probably didn’t even realize that it happened since we moved so quickly.

It's time to rethink the way you do external comms

April was a month to remember at incident.io. Not only did we attend our second conference ever with KubeCon in Amsterdam, but we also very subtly released our brand-new Status Pages product. OK, it probably wasn't subtle. Both moments required months of preparation, feedback loops, iteration, and so much more behind-the-scenes work to get right. So if you ran into us at KubeCon, thank you for stopping by and meeting with our team.

Mastering IT Response Time

In today’s fast-paced digital landscape, businesses heavily rely on their IT departments to ensure smooth operations and deliver exceptional customer experiences. When it comes to IT support, one critical metric stands out: response time. A prompt and efficient response can be the difference between a satisfied customer and a frustrated one. In this blog post, we will explore strategies to improve IT response times, enhance customer satisfaction, and optimize overall productivity.