Operations | Monitoring | ITSM | DevOps | Cloud

June 2022

Sponsored Post

Top Five Pitfalls of On-Call Scheduling

On-call schedules ensure that there's someone available day and night to fix or escalate any issues that arise. Using an on-call schedule helps keep things running smoothly. These on-call workers can be anyone from nurses and doctors required to respond to emergencies to IT and software engineering staff who need to fix service outages or significant bugs. Being on-call can be challenging and stressful. But with the proper practices in place, on-call schedules can fit well into an employee's work-life balance while still meeting the organization's needs.

Why More Incidents Are Better

Ask most SREs how many incidents they’d have to respond to in a perfect world, and their answer would probably be “zero.” After all, making software and infrastructure so reliable that incidents never occur is the dream that SREs are theoretically chasing. Reducing actual incidents by as much as possible is a noble goal. However, it’s important to recognize that incidents aren’t an SRE’s number one enemy.

Why Operational Maturity Helps Businesses Reduce the Great Resignation Trend

The past few years have led to fundamental business and cultural shifts for both companies and employees. Covid-19 has brought opportunities for companies who invested early in digital operations, while others struggled to maintain the status quo. The latter gave rise to record employee burnout, and what is now commonly referred to as the Great Resignation.

3 mistakes I've made at the beginning of an incident (and how not to make them)

The first few minutes of an incident are often the hardest. Tension and adrenaline levels are high, and if you don’t have a well-documented incident management plan in place, mistakes are inevitable. It was actually the years I spent managing incidents without the right tools in those high-tension moments that inspired me to build FireHydrant. I built the tool I wished I’d had when I was trying to move fast at the start of incidents.

Better Data for Public Health: How Nexleaf and PagerDuty are Monitoring Healthcare

Having a reliable power source is something many of us take for granted. It is particularly important for healthcare facilities to have a consistent, reliable power source to ensure that vulnerable patients – specifically those who rely on electricity to sustain their lives – are not disrupted. In rural Sub-Saharan Africa, however, it’s estimated that only about 28% of hospitals have reliable electricity.

What It Means to Be an Incident Commander

Leadership is essential in an organization. Establishing a leadership hierarchy helps teams avoid getting confused about who to turn to with questions and concerns, allowing them to focus their efforts where needed. High-quality leadership is vital to success but becomes even more important when the pressure to resolve an issue with minimal downtime is turned up.

Engineering Manager from a non-STEM background?

There is a long list of requirements a hiring manager looks at before hiring an Engineering Manager, there needs to be a balance between technical and leadership skills to perform well in the position. Engineering Manager roles differ from company to company. It is hard to list what a day in an engineering manager’s life looks like.

Uncovering the mysteries of on-call

For the vast majority of organisations, some form of round-the-clock cover is critical to successful business operations. On-call is an essential part of an effective incident response process, yet there is no commonly accepted playbook on how to most effectively structure and compensate on-callers. We ran a survey to uncover the mysteries of how on-call works in organisations of different shapes and sizes around the world.

Everbridge Live: Future Proofing Physical Security Management

Everbridge Control Center correlates events from disparate safety and security systems into a common operating picture to focus people’s attention on what really matters. The platform provides users with actionable alerts, next step actions, and automated reporting to better manage risks, ensure compliance with operating procedures, and support your business continuity. Automated workflows ensure rapid, consistent responses, reducing the risk of human error. It also facilitates device activation to ensure you are always in operational control and protecting your people. Dynamic reports and dashboards provide real-time actionable insights for your operations teams and senior executives.

What is Live Call Routing?

If there’s one essential thing we’ve learned from being in the business of digital operations for more than 13 years, it’s that every business has a unique approach to building resilience with its bespoke tech stacks and processes. Many PagerDuty customers around the world are starting to provide direct access to their on-call teams with Live Call Routing (LCR).

xMatters Service Intelligence Keeps Your Services Running!

Organizations spend heavily on digital services and business applications, with the expectation they deliver reliable value streams. When an issue occurs, the fear of losing revenue, damaging customer relationships, and upsetting employees can put a tremendous amount of pressure on incident resolvers. With xMatters Service Intelligence, organizations can visualize incidents in real-time, gain greater insight into their root cause, and remediate issues faster with service-centric automation.
Sponsored Post

Best Practices for Communicating with Customers During an Outage

Incidents are unavoidable when running a business. When an incident does inevitably occur, communication is critical while your teams are working to minimize the impact and expedite a solution. For technical resolvers, the first steps during an incident are to look for any leads that point to the source of the issue. Customer service and communications teams, however, must prioritize establishing effective communication with impacted users. Both teams have the right frame of mind, they need to be aligned. This becomes more complicated when such an incident is an outage.

What is an incident, how to handle it, and tips for good incident management

Customer retention is critical. Studies show that acquiring a new customer is five to 25 times more expensive than retaining an existing one. On top of this, a marginal increase in customer retention can yield increases in revenue up to 95%. Customers spend a lot of time interacting with businesses online and their user experience can have a major impact on how they view a company. One bad user experience can send a customer into the arms of a company's competitor.

Lightstep Notebooks helps speed troubleshooting for SREs and developers

Digital business is an imperative for 21st-century companies. Increasingly, organizations are directing investments toward technologies that deliver outcomes fast and enable more resilient digital business models. In this landscape, incidents such as software bugs, power outages, or downed networks have major consequences that affect both revenue and customer loyalty.

June releases: discover a faster and more intuitive FireHydrant

It’s been a busy month at FireHydrant. We’ve had our heads down shipping loads of improvements across the platform, and I want to take you on a quick tour of the changes. At the core of all these updates is a common theme: things are now a heck of a lot more intuitive. There’s a lot to digest here; read the full roundup of June releases below or follow us on Twitter for a bite-size demo each day this week.

New Features: Custom Hold Music for Call Routing, Conditional Alert Actions, Company-wide Private Status Pages, MFA

‍ This post highlights some of the features and improvements that we have released in the last 3 months. If you want to submit your own ideas or vote on existing feature requests, you can now use our new public roadmap at roadmap.ilert.com.

Know Instantly When Kubernetes Violations Occur - Your First PagerDuty and Shipa Alert

Imagine having the ability to instantly know when a Kubernetes compliance or security violation occurs. Now you can with Shipa Insights. Coupling Shipa Insights with the robust notification and alerting capabilities of PagerDuty makes this very possible. Shipa has the capability of sending fine-grained events externally e.g to PagerDuty. Now with the power of Shipa Insights, you have the capabilities to alert on policy violations. Let’s take a look at gettings started.

Words matter: incident management versus incident response

I recently published a couple of blog posts about what happens when you invest in a thoughtful incident management strategy and three first steps to take to do so. What I’m getting at in these posts is that we need a shift toward proactivity in the software operators community. I’d wager most of the world is responding to incidents as they happen, and nothing more.

Developing a Data Breach Incident Response Plan

With cybersecurity boundaries going beyond the traditional walls of an office and attack surfaces constantly expanding, data breaches are inevitable. Managing risks from data breaches requires organizations to develop a comprehensive incident response plan – an established guideline that facilitates incident detection, response and containment, and empowers cybersecurity analysts to secure a company’s digital asset.

How to Standardize Service Ownership at Scale for Improved Incident Response

Service ownership is a DevOps best practice where team members take responsibility for supporting the software they deliver at every stage of the development lifecycle. This level of ownership brings development teams much closer to their customers, the business, and the value being delivered. Service owners are the subject matter experts (SMEs) for their services – and in a service ownership model, they are also responsible for responding to any production issues.

Product Roundup: New Blameless Features in June 2022

Summer means things are heating up. And things are definitely heating up at Blameless! We’ve been hard at work delivering new features and capabilities to our customers, so today I wanted to share a quick summary of all the latest. Here are 4 exciting product updates that enhance the way teams manage incidents and deliver reliable products to their customers.

A Day in the Life with PagerDuty (2022)

Learn about the day in the life with the PagerDuty Operations Cloud to be ready for anything in a world of digital everything. Watch as the platform helps an organization face increasing digital complexity and dependencies and leverages PagerDuty to transform their operations from manual, rigid, and ticket queue-based, to a continuously improving system that focuses on outcomes and customer experience, delivers operational speed AND resilience, and is heavily automated and augmented by machine learning and AI.

How IT Operations can demonstrate business value with Unified Analytics

As an IT Ops exec, imagine your jubilation upon learning that after a year of hard work across your NOC, DevOps and SRE teams, you are able to automate incident response by 25%. You’re elated as you enter your CTO’s office to share this information, and their response is.

3 Effective Ways to Enhance Patient Safety with EHR Alerts

Hospitals that adopt electronic health records (EHR) to optimize clinical workflows face the decision of how to integrate EHR alerts into their workflows. The rationale is to surface actionable data from EHR systems and present healthcare providers with this information to supplement their day-to-day clinical decisions.

Cloudflare outage? The Domino Effect!

This day started a bit abruptly, with several services experiencing outages due to a Cloudflare outage. It started approximately at 06:34 AM UTC. Check the official announcement. What came next was a domino effect through many popular services over the internet. Major services like Gitlab, Notion, Hubspot, Digital Ocean, Monday, Recurly, and a lot more. We registered incidents from 230 services between the outage was published until it was marked as resolved.

Choosing the Right Incident Notification Tool for Your Incident Response Plan

Is your IT team ready to respond to an increasing volume of data security incidents? According to the 2021 Annual Data Breach report from the Identity Theft Resource Center, 2021 saw a record number of data breaches, representing a 68% increase from the year prior. The most recent Cost of a Data Breach report from IBM shares the Ponemon Institute’s finding that the average data breach is a $4.24 million expense, up 9.8% from the previous year.

Why Crisis Management Preparedness Matters

Almost 70 percent of leaders have dealt with a corporate crisis in the last five years, PricewaterhouseCoopers (PwC) found in its 2019 Global Crisis Survey. And, according to management consultancy McKinsey, between 2010 and 2017, the name of a Forbes-recognized top 100 company appeared in headlines together with the word “crisis” 80 percent more often than in the previous decade — and those are just the organizations that made the news.

Introducing xMatters New Integration with Everbridge Signal

When Russia invaded Ukraine on February 24, 2022, it sent ripples through many markets. Ukrainian car factories which supplied Europe were interrupted, oil and gas supply from Russia was throttled, and the supplies of steel, sunflowers, corn, and wheat were affected. Prices of sugar and petroleum surged, a threat of long-lasting high inflation emerged, and social unrest began to foment, with cyber-attacks coming both out of and going into Russia.

Mattermost Playbooks How-to: Release Management

Releasing software to users has become a sophisticated and intricate process that requires high levels of consistency and coordination. A release has to be built, brought together, documented, tested and deployed, which requires coordination of at least four separate teams and a generous handful of pipelines and other tools. Without a well-documented process things can get messy very quickly, causing stress for everyone involved.

Mattermost Playbooks How-to: Incident Resolution

Whether you’re part of a team managing SaaS products or a high-security digital workspace, sometimes Things Go Wrong and must be addressed with extreme care, professionalism, and predictability. For outages, data breaches, vulnerabilities and more, you and your team are juggling a variety of tools, processes, and rigid incident management systems. When the on-call pager goes off at 3 am almost no one has the ability to remember every step needed to kick off all the response workflows.

The Cost of Downtime: How Much Does an IT Outage Cost Your Business?

Life in the world of managed IT services is not without its pleasant surprises. Although we’re an industry of system builders dedicated to facilitating the smoothest of operations possible, downtime still happens. An unexpected system or network failure is not uncommon. In fact, it's inevitable. Even some of the world’s biggest companies can’t get away without painful outages.

GrafanaCONline 2022 Day 1 recap: Grafana 9 release, Grafana OnCall open source, Grafana and Grafana Loki in space, and more!

GrafanaCONline 2022 is off to a great start with exciting news from around the Grafana-verse and a jam-packed day filled with dashboards showcasing how Grafana is used in space, in industrial IoT, at live events, and even in an effort to prevent food waste.

Mattermost Playbooks How-to: Software Feature Development

For teams that follow a structured build and release cycle, having a reliable, shared workflow makes the difference between chaos and consistency. With every new feature in development the team needs to know what the specs are, how it fits in the roadmap, what the customer feedback was, where to find the repository, who is responsible for each step, and so much more.

Four key takeaways from our recent webinar: BigPanda picks up where Netcool left off

For years, Netcool has been omnipresent in many IT Operations organizations. That, combined with the sheer utility it once brought to the table, sometimes gave it a special sort of nostalgic reverence in IT Operations circles. But with all due respect to Netcool, there’s also little doubt the platform’s real-world utility has waned in the era of cloud and hybrid ops.

Introducing Grafana OnCall OSS, on-call management for the open source community

Last November, we announced the launch of Grafana OnCall, an easy-to-use on-call management tool that helps reduce toil through simpler workflows and interfaces tailored for developers. Born out of Grafana Labs' acquisition of Amixr Inc., Grafana OnCall began as a cloud-only solution that became generally available to all Grafana Cloud users, on both paid and free plans, in February.

5 Ways to Reduce IT Incidents Before Your Team Succumbs to the Ticket Backlog

If you talk to any Service Desk agent, they will agree there has been an explosion in IT tickets since the transition to remote and hybrid work. Even now, there are growing challenges preventing them from being able to reduce IT incidents. In the last year, average ticket volume has risen by 16% since the pandemic, stressing already overtaxed help desk agents. This increase in tickets has led to wasted resources, poor IT service delivery and frustrated employees.

Squadcast Product Demo | Incident Management | On-call | SRE | Status Page | SLO Tracker | Runbooks

This video explains why Squadcast is a feature-rich solution for SRE, DevOps, and Engineering teams in general. With the ability to help teams quickly mobilize response teams during critical incidents, easily manage on-call schedules, and track SLOs for better SRE, Squadcast is a multi-purpose platform with numerous capabilities. This short video covers everything the product is capable of.

Setting up Route 53 Health Checks

We live in an age where the internet and digital data drive modern day markets, which results in huge amounts of data being generated and consumed. Hence, it has become very important for online platforms to manage this traffic and serve their customers more efficiently. In this blog we will explore the Amazon Route 53 service and see how it addresses domain name system routing and health check problems.

Crossing "The Last Mile" with an Incident Response System

Delivering dependable and high-performing IT services in 2022 requires coordination and collaboration across different workflows, areas of expertise, and even time zones. Whether serving in-house colleagues or external clients, there is immense pressure on IT management to create seamless experiences 24/7/365. Seconds matter when critical systems break down, and slow incident resolution can have costly ramifications on customer experience and employee productivity.

Driving Effective Communication in Nursing

Effective communication in nursing is central to providing top-quality patient care. Nurses communicate with patients to understand their health issues, and they provide them with the care and compassion needed for recovery. Accomplishing effective communication with patients directly impacts patient health outcomes, and it has far-fetched implications when carried out ineffectively. As such, effective communication in nursing drives patient-centered care.

3 ways to improve your incident management posture today

Too many of us are still playing whack-a-mole when it comes to incidents: an incident is declared, the on-call engineer is paged, the incident is resolved and then forgotten — until next time. It’s time to start thinking in terms of proactive incident management, not just reactive incident response.

Calling all Reliability Practitioners: Participate in the SRE Survey 2022

For the past four years, Catchpoint and various partners have been running a yearly SRE Survey. This year, Blameless is excited to partner with Catchpoint for the fifth annual survey. We want to hear from you if you are in a DevOps or SRE role or even if you work on reliability with some other title or role. There are tremendous, valuable learnings when we listen closely to practitioners.

Receiving PagerDuty alerts from MetricFire

One of the most critical aspects of monitoring your digital assets is getting a timely alert when something goes wrong. Even when you finish building a monitoring stack and expose metrics on a beautifully designed dashboard if you cannot notice abnormal behaviors and fail to take pre-emptive or follow-up actions swiftly, this means your monitoring system does not serve the purpose.

Summit Recap: How to adapt to a "Digital Everything" World

Every interaction with our customers, partners, and employees is special – but this year’s PagerDuty Summit went far beyond my wildest dreams. Together we committed to helping you learn and grow in how you manage business critical operations – in other words, getting you ready for anything in a world of Digital Everything.

Minimize MTTR to Mitigate Impact of Change Management

In the first blog this demo series, we showed you how to use Restorepoint to remediate after a network breach. In our second blog of this three-part series, we walk you through a change management instance—showing how to speed problem resolution and how to mitigate the impact of poor change management to minimize MTTR.

Ready for Anything with the PagerDuty Operations Cloud

In a world of digital everything, teams face increasing complexity. Ever-growing dependencies across systems and processes put customer and employee experience, not to mention revenue, at risk. There is simply too much data to sift through and correlate for humans to understand what is important and know when something is going wrong.

A "Single Source of Truth": New Tools for Fast, Efficient Customer Service

Customer-facing teams have their hands full doing whatever they can to address customer issues quickly. At PagerDuty, our goal is to ease the burden of these teams by giving them the tools and access they need to deliver excellent customer experiences. Over the last year, we have deepened our integration with Salesforce Service Cloud, allowing users to work directly within the platform, reducing the need to context switch.

The Future of Incident Response is Automated, Flexible, and Proactive

We know our customers rely on PagerDuty as the backbone of critical real-time operations, so we want to make sure each and every enhancement helps streamline incident response. How can we help our customers spend less time firefighting and more time innovating? One of PagerDuty’s values is Champion the Customer – and we take this very seriously. When building and improving features, we aim to keep a pulse on what’s going on with our customers: what’s keeping them up at night?

How the unicorn got its horn: a tale of market opportunity and technical innovation

Insight Partners is a leader in working with scale-up companies that have existing product/market fit and can use our help establishing best practices for their businesses. But my specific focus is in developer-driven companies. I look for the best technical teams that are building products that developers love and adore.

Improved Design Interface. Less Code. Runbook Studio 5.0 Makes Runbook Automation a Cinch

Kelverion Runbook Studio V5.0 makes it even easier for organizations to automate IT service desk requests and reduce IT burden. In its fifth iteration, The Runbook Studio has undergone a significant design overhaul. The Studio’s technical capabilities have always been exceptional and now it has a user interface to match. On top of that, this version takes Kelverion’s low code/no-code design environment to the next level.

Declare early, declare often: why you shouldn't hesitate to raise an incident

My first incident.io-incident happened in my second week here, when I screwed up the process for requesting extra Slack permissions, which made it impossible to install our app for a few minutes. This was a bit embarrassing, but also simple to resolve for someone more familiar with the process, and declaring an incident meant we got there in just a few minutes. Declaring the first incident when you start a new job can be intimidating, but it really shouldn’t be.

What is Automated Diagnostics and Why Should You Care?

A lot of people in technology talk about the cost of an incident solely from the perspective of downtime, or the number of customers and employees impacted. And from the surface, oftentimes that is a fair angle to take. It makes the headlines, and customer reputation and trust are critical to the success of any business—obviously.

Evaluating xMatters Alternatives

The cost of IT downtimes is enormous as service breakdowns impact both the top-line and bottom-line growth. As the digital ecosystem continues to become complex and organizations continue to adopt additional tools and systems to scale their businesses, it’s imperative that they are equipped with incident response tools that can help drive accelerated incident response and mitigate expensive downtimes.

Squadcast + OSNexus QuantaStor Integration: Making Incident Management & Alerting more effective

Storage systems are an integral part of IT infrastructure. Given that modern markets are highly competitive and demanding, businesses strive for 24/7 availability. This in turn sets higher expectations for storage systems to be operational all the time. But just like other IT components, even storage systems are prone to incidents. Hence, it is important to have an efficient communication process, to manage alerts during system failures/disasters.

Real Talk webinar recap: analytics and reporting maturity

MTTR, or mean time to resolve, is an important key performance indicator for incident response teams to track, but it’s rarely useful for technological stakeholders or customers. To really make use of the data at their disposal, decision-makers must tailor the info they provide—and understand the scope and granularity of the data they have when they deploy an AIOps platform like ours. That’s the gist of our latest Real Talk webinar on analytics and reporting maturity.

5 Reliability Insights That Immediately Transform Your SRE

As infrastructure engineers, there’s so much you can learn from studying past incidents. Luckily, Blameless Reliability Insights helps you find patterns that better equip you to deal with incidents to come. If you’ve never used it before and you’re curious what it looks like, you can watch a video demo here! These statistical insights give you the power to learn everything you can when something goes wrong. ‍

Zenduty's Commitment to Security; Soc 2 Type 2

Security is a major requirement while dealing with SAAS companies across the globe. As a service provider to leading companies globally YellowAnt is fully committed to provide the best in class security compliance, in lieu of that we on May 31, 2022 have become Soc2 Type II compliant. It is integral to maintain our customers’ trust by keeping their data safe and secure.

How To Build an Escalation Policy for Effective Incident Management

Regardless of your organization’s size, industry, or security measures, you will inevitably face IT incidents. But what do you do if an incident affects a critical system and your on-call responders can’t resolve it? Does your team have a set of clearly outlined next steps they should take to handle the issue? Answering these questions can be complicated, even more so for large organizations that rely on cloud-based services to fuel their IT environment.