Operations | Monitoring | ITSM | DevOps | Cloud

March 2021

How to configure services in Squadcast: Best practices to reduce MTTR

With a rise in digital platforms, IT infrastructure has grown exponentially complex to a level where multiple application interdependencies coexist with varied architecture & oncall team types. This blog looks at how you can model your infrastructure in Squadcast to reduce your time to respond & resolve incidents.

5 AIOps Trends for 2021

Recently, there has been a steep rise in the research and utilization of Artificial Intelligence (AI). While AI once seemed like nothing more than a fantasy from a sci-fi movie, AI technology is now very much a reality in our everyday lives. Artificial intelligence and machine learning are involved in many of our daily tasks, from search engines that finish your thought, to pulling up directions in Google Maps, and how your Facebook and other social feeds are so perfectly catered to your interests.

Four Ways to Reduce Patient Churn in Healthcare

Maximum patient satisfaction is achieved through an organization’s ability to provide effective and timely care. Healthcare staff realize that poor clinical care leads to dissatisfaction, frustration and ultimately, patient churn. To reduce patient churn, hospitals must focus on what matters the most—effective care team communication, collaboration and decision making. Patient loyalty and positive word of mouth ensures that an organization continues to generate revenue.

How to Analyze Incidents Better with the Right Metrics

An important SRE best practice is analyzing and learning from incidents. When an incident occurs, you shouldn’t think of it as a setback, but as an opportunity to grow. Good incident analysis involves building an incident retrospective. This document will contain everything from incident metrics to the narrative of those involved. These metrics aren’t the whole story, but they can help teams make data-driven decisions. But choosing which metrics are best to analyze can be difficult.

Optimizing Alert Policies with Dynamic Destinations

Targeted reliable notifications are the core of any alerting solution. Blasting out emails may be good for quantity, but Enterprise Alert focuses on the quality, this means notifying the right people at the right time. We often see monitoring and ticketing solutions creating an incident and then relying on the emailed recipient to not only identify and handle the incident but also to close out the ticket that is raised.

Runbooks: What They Are and Why You Need One Yesterday

Let’s talk about The Legend of Zelda: A Link to the Past, and how it relates to DevOps. The game tasks our hero with finding three pendants, which unlock a Master Sword he can use to travel to an alternate realm and ultimately take down the bad guy. The US version of this SNES masterpiece came packaged with a fairly detailed instruction manual that contained an optional guide at the end to help locate the three pendants.

SRE Thought Leader Panel: SRE Adoption as Organizational Transformation

SRE adoption can be difficult. It’s more than just new tooling; it requires a change of process and mindset as well. So how can we go about convincing our organizations that SRE is worthwhile? How can we drive this change? Learn from experts who have done this in our latest SRE Thought Leader Panel “SRE Adoption as Organizational Transformation.” Panelists include: Kurt Andersen, SRE Architect at Blameless Vanessa Yiu, Executive Director, Enterprise Architecture at Goldman Sachs Tony Hansmann, Former Global CTO at Pivotal Software, Inc. Chris Hendrix (Host), Staff Software Engineer at Blameless.

Adding Rich Content to Alerts, Work Orders or Service Requests

When you send alerts, work orders or service requests to your workers in the field, on the shop floor or campus it is essential to provide them with all relevant information necessary to solve the task. This prevents misunderstandings, avoids waste work, time for searching information and thus increases productivity and facilities an effective, timely incident resolution.

Import and Export for OnCall Times

On-call planning is one of the most popular features in Enterprise Alert and is widely used by users, team managers and administrators. However, in our discussions we keep finding that it is not simply done with 5 minutes of planning. Scheduling often depend on external systems. This can range from a simple excel form provided to HR all the way to a comprehensive billing system such as SAP. As a result, it takes a quite a bit of time to transfer the planned shifts to third-party systems.

Why do I need to switch to Firebase?

Apple announced some time ago that the Apple Push Notification (APN) will be deactivated for sending push messages as of March 31, 2021. To continue to ensure the sending of push messages to iOS devices, we have already implemented push shipping via Firebase in Enterprise Alert 2019. Unfortunately, the change could not be done automatically and requires manual intervention.

Global bank transforms incident alert management & communications

Customer Profile One of the top 10 largest financial services companies in the world 200,000+ employees worldwide. Serving tens of millions of customers. With operations in more than 60 countries, the Interlink Incident Alert Management app serves an audience of thousands of service owners and business stakeholders - across 20+ global markets

How to Scale for Reliability and Trust

As more people depend on your product, reliability expectations tend to grow. For a service to continue succeeding, it has to be one customers can rely upon. At the same time, as you bring on more customers, the technical demands put on your service increase as well. Dealing with both the increased expectations and challenges of reliability as you scale is difficult. You’ll need to maintain your development velocity and build customer trust through transparency.

What's New: Improvements to On-Call Schedule Exceptions

We’re excited to present a feature update to the OnPage platform. The new update will bring more flexibility and resiliency to a team’s on-call management workflow. With the new scheduling capabilities, OnPage system administrators can create exceptions to configured, recurring on-call schedules.

Phoenix Project: Sometimes you have to look back to look forward

It has been eight years since The Phoenix Project was published and a lot has changed since then! I started to think about what we’ve learned in that time. It starts with the theory of constraints. I still see it all the time. Organizations take actions which are merely temporary, putting out fires but not solving for the underlying causes of those fires.

Mattermost Incident Collaboration now includes improved communication, automation, and history for incident response teams

Teams are always looking for a speed advantage, and that comes from planning, crisp execution, and teamwork. To this end, we’re excited to release new enhancements to Incident Collaboration to help make life easier for DevOps teams during incident response. The Mattermost platform includes built-in Incident Playbooks with predefined response plans and task lists. Playbooks can be customized to your environment and specific use cases.

Say goodbye to guessing: Introducing Automatic Incident Triage by BigPanda

Low MTTR is the much-desired nirvana-state in IT Operations. One of the most painful parts of the incident management lifecycle, which prevents the achievement of this nirvana, is triage: the time it takes first incident responders to determine the next action when facing a barrage of IT incidents. Why?

PagerDuty for AIOps & Automation: Innovate & Automate Faster

We continue to improve our AIOps and machine learning capabilities to help customers reduce noise, quickly identify root cause, and automate the resolution of critical, business-impacting issues. This will help organizations further increase cost savings, reduce mean time to resolution (MTTR), and preserve people hours. The following capabilities empower responders to gain control, deliver critical context for faster root cause identification, assess impact, and automate actions with minimal configuration.

PagerDuty Enterprise Collab & Communication, Cloud Migration, & Customer Service: New Integrations

We continue expanding our ecosystem of native integrations to help teams bridge the communication gap between customer service and engineering teams, embrace full-service ownership, and better manage cloud migration initiatives.

IT Incident Response is Improved with a Corporate Status Page

To understand the impact that stovepipes have on incident response, one need look no further than the 9/11 terrorist attacks that occurred in the United States. The CIA, DoD, and FBI all knew about the Al Qaeda terror threats before the planes hit the World Trade Center, but the 9/11 Commission found that a lack of data and intelligence sharing among the agencies limited each agency’s understanding of the looming terrorist threat; thereby, limiting their incident response.

How to Analyze Contributing Factors Blamelessly

SRE advocates addressing problems blamelessly. When something goes wrong, don’t try to determine who is at fault. Instead, look for systemic causes. Adopting this approach has many benefits, from the practical to the cultural. Your system will become more resilient as you learn from each failure. Your team will also feel safer when they don’t fear blame, leading to more initiative and innovation. Learning everything you can from incidents is a challenge.

Introduction to on-call schedules

An on-call schedule tells you and everyone in the team who will be the first responder when an issue happens in production. The on-call team member is responsible for investigating the issue, either fixing the issue herself or adding other people who can help fix it. Having an on-call schedule is important for building reliable systems because making someone responsible for production issues makes sure that they're not ignored.

How to get mobile push notifications from any service

Love 'em or hate 'em, mobile push notifications can be very useful. They are not as intrusive as a phone call and have better information formats and control than text messages. Which is why it can be very frustrating to not get push notifications for your favorite product because it doesn't have a mobile app. In this post, we will see how to get mobile push notifications from any service, even if they don't have a mobile app.

What's New: Updates to Event Intelligence, Compliance and Reporting, and More!

We’re excited to announce a new set of updates and enhancements to the PagerDuty platform! These updates are designed to help organizations accelerate cloud migration, provide premium levels of customer service, streamline collaboration and communication, and deliver a seamless customer experience in the moments that matter most.

How to speed up incidents with a lot of cooks in the kitchen

In one of our recent webinars we discussed a substantial challenge IT Ops teams face in today’s complex IT environments: defining and clearly communicating incident/operational roles and processes, in an effort to create a well-coordinated incident management lifecycle. This lifecycle is essential for restoring service as quickly as possible when disruptions occur. Following are the highlights of that discussion, also recently published in an ApmDigest article.

9 Barriers to DevOps Implementation

The DevOps model unites development and IT operations to create a powerful organizational culture to achieve business goals more efficiently. Formerly siloed teams can now collaborate continuously to build more robust products, with increased confidence, and achieve business goals faster. The model has the power to transform operations, but there are barriers to DevOps that must be overcome first.

Why Your APIs Should Fly First Class

Picture yourself flying first class. You board the plane first, you get champagne, and you feel as though you’re the most important. Why not treat your APIs the same way? In this talk, FireHydrant CEO and Co-Founder, Robert Ross (a.k.a @bobbytables) shares why putting your APIs first can be a game-changer for your business and how this mindset shaped the way FireHydrant was built.

How to Build an SRE Team with a Growth Mindset

The biggest benefit of SRE isn’t always the processes or tools, but the cultural shift. Building a blameless culture can profoundly change how your organization functions. Your SRE team should be your champions for cultural development. To drive change, SREs need to embody a growth mindset. They need to believe that their own abilities and perspectives can always grow, and encourage this mindset across the organization.

How to get mobile push notifications from Spike.sh

When an issue happens in your software in production, the channel to send the alert on depends on multiple factors. If it's a critical issue requiring immediate attention, you should alert the team member via phone call. But not all issues require a phone call, and in fact it may become annoying if your phone keeps ringing for minor issues. This is where other channels like SMS, Slack and mobile push notifications come in.

Alert Fatigue and Your Health

As an on-call engineer, you might deal with the day-in, day-out occurrence of alerts. These alerts may come from your alerting provider (PagerDuty, OpsGenie, etc.), Slack notifications telling you the site is down, or the ever concerning text message "Hey, is the site down?". These alerts elicit reactions that range from "shit" to "again?" and in many cases, both.

How We Built and Use Runbook Documentation at Blameless

Even if you don’t notice, you are executing runbooks everyday, all the time. When you have an incident in your day-to-day operations, you follow a series of ordered and connected steps to solve it. For instance, if you lose your internet connection, you will follow a series of steps to resolve that issue: This could be different depending on your method, but you have the idea.

6 Automations to Accelerate IT Operations

The role of IT teams continues to expand and evolve as digital transformation accelerates. Technologies such as cloud, virtualization, edge computing, microservices, and containers have now entered a phase of mass adoption and are being implemented at unprecedented rates while staffing has remained flat for most IT teams. Overburdened IT organizations are struggling to keep up with the scale of their infrastructure and the diversity of the technologies they support.

IT Trends You Don't Want to Miss

The COVID pandemic has redefined the workplace and accelerated the process of digitization for many. Organizations are migrating to systems that are flexible, distributed and resilient. Per Gartner, IT spending will reach $3.9 trillion worldwide in 2021. IT teams will be channeling investments into enterprise software as remote work becomes essential. Systems that support remote work will see a growth of 8.8 percent this year.

Why we went passwordless on our new product

Passwords are dying. The cost of creating and maintaining passwords is becoming untenable. Which can be seen in the rise of users logging in with social products and developers outsourcing their pain to Auth0 and the likes. We decided to sidestep the password based authentication and went passwordless on our new product. Read on to see how you can go passwordless too.

Using OnPage to Deliver Exceptional Customer Support

The OnPage Customer Support team consists of knowledgeable, friendly technicians that offer 24/7 assistance. Support recognizes the importance of client relationships and always aims to achieve maximum customer satisfaction. The OnPage incident management system is at the center of Support’s quality service delivery. OnPage triggers instant, critical mobile alerts to technicians whenever customer-initiated tickets are created.

Introducing Incident Timer

We’re excited to announce Incident Timer - a “days without an incident” timer for software teams to keep track of major engineering incidents. As the people behind Spike.sh, we keep discussing how to build a culture of reliability with our customers. We loved the idea of safety/accident timers in factories which kept track of major accidents. It's a simple and elegant way to keep safety on everybody’s minds.

What is DevOps?

What is DevOps? DevOps is a term for a cluster of concepts that has become a movement, “a cross-disciplinary practice dedicated to the study of building, evolving and operating, rapidly-changing resilient systems at scale.” (Jez Humble) The definition of DevOps is not agreed upon by everyone because of the complex processes attached to the term, however, the benefits to teams are universally agreed upon.

SRE as Organizational Transformation: Lessons from Activist Organizers

In the software industry’s recent past, the biggest disruptive wave was Agile methodologies. While Site Reliability Engineering is still early in its adoption, those of us who experienced the disruptive transformation of Agile see the writing on the wall: SRE will impact everyone. Any kind of major transformation like this requires a change in culture, which is a catch-all term for changing people’s principles and behaviors.

Accelerate your logs investigations with Watchdog Insights

If you’re investigating an incident, every minute means degraded performance or even downtime for customers. The causes of an issue often come from parts of your systems and applications that you would not think to check, and the sooner you can bring these to light, the better.

SRE2AUX: How Flight Controllers were the first SREs

In the beginning, there were flight controllers. These were a strange breed. In the early days of the US Manned Space Program, most american households, regardless of class or race, knew the names of the astronauts. John Glen, Alan Shepard, Neil Armstrong. The manned space program was a unifying force of national pride. But no-one knew the names of the anonymous men and later, women, who got the astronauts to orbit, to the moon, and most importantly, got them back to earth.

6 incident management hacks to implement using ServiceDesk Plus

Ever wondered how enterprises like Zoho, with over 50 SaaS applications and more than 180,000 customers, handle the spectrum of IT incidents they face? Download this free e-book now to get an insider look into the incident response and management processes that Zoho has perfected over the years.

6 incident management hacks to implement using ServiceDesk Plus Cloud

Ever wondered how enterprises like Zoho, with over 50 SaaS applications and more than 180,000 customers, handle the spectrum of IT incidents they face? Download this free e-book now to get an insider look into the incident response and management processes that Zoho has perfected over the years.

What Our Customers Say About the PagerDuty Platform

As noted in this blog a couple of weeks ago, we recently commissioned IDC to interview PagerDuty customers to quantify the business value they gain from our platform. It found that, on average, the 14 PagerDuty customers interviewed gained annual benefits of $3.48 million, a three-year ROI of 795%, and a payback period of just over two months.