Operations | Monitoring | ITSM | DevOps | Cloud

January 2021

What the Big Brother Approach to IT Monitoring and Incident Management May Be Missing

We asked in a recent poll which popular TV show your IT team resembles the most. Big Brother came out on top, with almost 40% of respondents saying that their incident resolution process most resembled this show. Would you compare your incident management process to an episode of Big Brother? If so, it's likely that your IT environment is highly monitored, but incidents still seem to slip through the cracks.

SLA vs SLI vs SLO: Know the differences between them.

SLA basically means a Service Level Agreement. It’s a formal agreement between you and your customer. It basically describes the reliability of your product/service so you can have a formal agreement which basically says our product will be online 99 percent of the time annually and if we fail to achieve that objective we will give 30% of your annual license fee back. SLA’s also include penalties in the contract.

The U.S. COVID Vaccine Distribution Plan: Challenges and Solutions

As coronavirus (COVID-19) continues to spread and new virus strains emerge, the public is frantically looking for answers regarding the U.S. government’s vaccine distribution plan. A sound vaccine distribution plan is especially crucial in times like these. All U.S. states, stretching from both coasts, are experiencing a vast number of COVID-related deaths and hospitalizations. The dire situation underscores the importance of having an effective, accelerated vaccine delivery process.

New Feature: Incident types

Incidents are inevitable, and the reality is some of them are inevitably going to repeat themselves. FireHydrant has always strived to make the entire incident response lifecycle smooth, but up until today, common incident types were slightly burdensome for our customers. We decided it was time to help people make it easy to declare incidents using easy-to-use templates, which we’re deeming Incident types.

Who Else Wants to Increase Development Velocity?

Implementing SRE is fundamentally about shifting culture, but it often means adding new tooling and processes to your team's workflows to support that cultural change. Teams add new steps and checks to incident response procedures. Incident responders write retrospectives and create new meetings to review them. Engineers consult new tools like monitoring dashboards and SLOs. In other words, SRE creates another layer of consideration in development and operations.

OnPage Corporation Continues To Grow Despite the 2020 Pandemic

WALTHAM, Mass., Jan. 25, 2021 — OnPage Corporation, a Boston-based incident management and pager replacement company, today unveiled its fiscal 2020 year in review. OnPage delivered another year of strong results considering the uncertain situation brought upon the world with COVID-19. Past year results were driven by current customers that rely on OnPage for critical notifications and had to enlarge their deployment.

Have a Cloud Transition you can be Proud Of

In the reliability era, many services are migrating from in-house servers to the cloud. The cloud model allows your service to capitalize on the benefits of large hosting providers such as AWS, Microsoft Azure, or Google Cloud. These servers can be more reliable than in-house servers for reasons including: However, as with all things, cloud providers present their own risks and challenges as well. Teams will want to take advantage of the benefits while accounting for these limitations.

How to build your own incident management process

IT incident management is a fundamental operational process designed to ensure rapid service restoration. This process is typically assigned to the help desk but is also very much entrenched in the day-to-day of DevOps. When incident management goes right, service is restored quickly and the impact on productivity, continuity, and customer satisfaction is minimal.

7 Tips On Building And Maintaining An SRE Team In Your Company

In today's "always on" world, Reliability is a primary business KPI. Plant the culture of Reliability by implementing these 7 simple tips to build a solid SRE team in your organization. Many of today’s hottest jobs didn’t exist at the turn of the millennium. Social media managers, data scientists, and growth hackers were never heard of before. Another relatively new job role in demand is that of a Site Reliability Engineer or SRE. The profession is quite new.

4 Essential Types of MSP Tools (in 2021)

Managed service providers (MSPs) need the right tools to get the job done quickly and securely. MSP tools dictate control over everything from virtual machine (VM) management and database administration to application and server monitoring. They can also help MSPs oversee IT infrastructure. MSP tools are valuable, but not all tools are created equal.

2021 is the Year of Reliability

There’s no better time than now to dedicate effort to reliable software. If it wasn’t apparent before, this past year has made it more evident than ever: People expect their software tools to work every time, all the time. The shift in the way end-users think about software was as inevitable as our daily applications entered our lives, almost like water and electricity entered our homes.

The Key Differences between SLI, SLO, and SLA in SRE

To incentivize reliability in your platform, there should be shared goals across your team to measure & quantify the capabilities of your product/service along with customer experience. Define the path of "Always-On" services by understanding few key SRE fundamentals and their implications - SLIs, SLOs & SLA. Framing SRE metrics for building or scaling a product is quite a daunting task.

Why AlertOps is the best PagerDuty alternative

We will compare AlertOps to PagerDuty in 3 broad areas: On-call management Whether your on-call management needs are basic or complex, AlertOps has a solution for you. Creating on-call schedules is simple whether there one person on-call, two or more people on-call, or even multiple teams on-call. Escalations Automatic escalations based on your on-call schedules. Expand the possibilities with Workflows and Escalation Rule.

The Secret of Communicating Incident Retrospectives

In the world of SRE, incidents are unplanned investments in reliability. Why? Because they are valuable opportunities to learn and grow. This perspective can be difficult to communicate to other stakeholders. Some may be upset about the cost incurred or the affected customers. Others might not understand why incidents happen in the first place. It is important to show how the lessons of an incident are relevant to each stakeholder role.

Top Reliability and Scaling Practices from Experts at Citrix, Greenlight Financial, and Incognia

Downtime costs more than dollars. It also costs customer happiness and trust. So how do teams maximize for reliability while scaling? Tooling, communication, observability, and more all play into a complete reliability strategy. In a recent industry leaders’ roundtable hosted by Blameless, top experts discussed best practices for responding to incidents, scaling for reliability, and how to engineer with the customer in mind.

OnPage Recognized in Gartner's Market Guide for Emergency Mass Notification Solutions

Gartner’s Market Guide for Emergency Mass Notification Solutions (EMNS) is a trusted report for security and risk management leaders. It provides insight into effective crisis communication procedures and identifies solutions that help perfect emergency management plans. The EMNS Market Guide has a large, loyal readership in several industries including, state and local government, healthcare, IT support and higher education.

Best Practices for Incident Management: A Checklist

If productivity is the engine that helps optimize how a business operates then being proactive is the oil and knowing how to effectively maintain productivity is regularly checking and replacing said oil. Whenever a service outage occurs it throws a wrench into the whole process and can put an entire organization in flux, mainly because the outage.

The True Cost of Building your Own Incident Management System (IMS)

Is your organization on the lookout for an incident management tool? If yes, you may wonder- am I better off building my own? Our latest blog outlines some of the key factors to consider while choosing whether to build or buy an incident management software.

Incident Communications With Alina Anderson

Incidents happen. They’re disruptive, they can be stressful, and if they aren’t managed well, they can cause chaos on your team. How your team manages incidents is only half the battle. How you let other stakeholders know what is going on is the other half. Alina Anderson from Smartsheet joined the Community team in our booth this year at PagerDuty Summit to talk about Incident Communications, and we’ve shared that conversation as an episode of our Page It to the Limit podcast.

What's in store for IT Ops in 2021? Top execs from leading enterprises share their predictions

2020 is (finally) over, and it’s safe to say that this very challenging year taught us once again that (as the old Danish proverb says) it’s difficult making predictions, especially about the future. Who would have imagined in January 2020 that we would find ourselves where we are today… And yet, as Tim Harford once wrote in the Financial Times, predictions are like Pringles: nobody thinks that there’s any great virtue in them but we find them hard to resist.

A look back at 2020

2020 was, needless to say, not the best. Looking on the brighter side, in December, FireHydrant turned 2, and in spite of it all, we grew quite a bit. We raised our $8M Series A in May, our team grew nearly 4x in size, added some amazing features such as making FireHydrant Runbooks even more powerful with conditions, and great integrations, which you can find here. But even better, we got to work with all of you!

5 Steps to Building a Robust Incident Response Plan for your MSP

Today’s organizations face ransomware, malware, and other cyber attacks, and managed service providers (MSPs) need an incident response plan (or “IRP”) to mitigate against these threats. In a recent survey of 200 MSPs, 74% of respondents said they have suffered a cyber attack, and 83% noted their small and medium-sized business (SMB) customers experienced one as well. Yet, with an incident response plan (IRP), MSPs can protect themselves and their customers against cyber attacks.

This Is the Most Underappreciated Skill for SREs

Delivering great software and sustainable systems is a team sport. Without the support of all stakeholders, adoption initiatives often fail. In successful initiatives, SREs are responsible for bringing together all resources and team members to help resolve reliability-related issues. But getting together these resources takes much more effort than people think. SREs engage in lots of glue work to ensure these collaborative efforts happen.

Building and Scaling Your SRE Team

Building Site Reliability Engineering (SRE) teams is hard! There are so many articles and explanations of what SRE means, it’s easy to get lost. Going beyond understanding what the individual SRE role is into building and scaling a team of SREs is more of a challenge. It’s important to find the right information that will help you take your SRE team to the next level.

Seamless CMDB Provisioning Gives Responders the Data They Need to Respond Faster

We knew that the most loved feature in our ServiceNow 7.0 release would be the CMDB features. And in our ServiceNow 7.5 release (available now), we’ve expanded our CMDB capabilities even further—based on your feedback—around the importance of reducing the effort it takes to re-create the same services within PagerDuty.

2020 Year in Review: OnPage Continues to Grow Despite the Pandemic

2020 was an unpredictable year that presented several challenges, such as the outbreak of the coronavirus (COVID-19) pandemic. As part of the “new normal,” the world has adopted infection prevention procedures. The 2020 calendar year was defined by face coverings, constant sanitization and physical distancing. At its core, the year was an exhausting, surreal 12-month period for many.

Better incident management while working remotely: The Squadcast way

As the pandemic wears on, remote incident management has become the norm worldwide for businesses. Here we share some best practices that helped us to address remote incidents and make on-call less stressful. With the onset of remote work due to Covid-19, remote incident management has become the norm for businesses worldwide. Organisations that were earlier used to having war rooms now find themselves having to coordinate teams through Slack, MS Teams or other collaboration tools.

Four key metrics for responding to IT incidents and failures

If you’re a veteran in this space, you probably understand the many incident response metrics and concepts, along with the many (at times exasperating) acronyms. For those new to the space, or even those with years of experience, the terminology is often overwhelming. If you’re one of those people who’s struggling to navigate through the world of DevOps metrics, we’ve created this article for you.

G2 Recognizes Squadcast as Momentum Leader in Incident Management

We are thrilled to begin the year on a high note! Squadcast has been awarded in the Incident management and IT Alerting category in G2's Winter Report 2021 for below categories. ‍‍ “We are honoured to be recognised as a Momentum Leader in the IT Incident management category by G2. We have always strived to create the fastest and easiest Incident Response experience for Engineering and DevOps teams that enables organisations to better monitor their IT infrastructure and applications.

Leverage MSP Automation to Drive Profitability (in 2021)

Managed service providers (MSPs) require automation, so they can deliver fast, efficient IT services that meet customer expectations. But, MSP automation can be difficult — and the longer it takes an MSP to automate IT service management (ITSM), the further it falls behind its competitors. Today’s MSPs face several challenges relative to automation, including: 1. Complex Scripting Language IT technicians may need to learn a complex scripting language to leverage an ITSM platform.

Little Known Ways to Better Use Your Error Budgets

One of the most versatile and foundational SRE tools is the SLO, or service level objective. The SLO is a threshold set for key reliability metrics. When incidents push the metric over the threshold, a response launches to prevent further damage. Conversely, as long as you meet your SLO, you can continue to ship new code. The space you have before you breach this threshold is the error budget.

Incident Ready: How to Chaos Engineer Your Incident Response Process - FireHydrant

We’re pretty sure using a real incident to test a new response process is not the best idea. So, how do you test your process ahead of time? In this video, FireHydrant CEO, Robert Ross, will share how FireHydrant customers leverage best practices to break, mitigate, resolve, and fireproof incident processes. We’ll show you how to use chaos engineering philosophies to stress test 3 critical parts of a great process.
Sponsored Post

Boost IT Savings with CloudReady and Incident Workflow

Companies love data. Aggregating data from multiple sources makes decision-making easier and brings a new depth of the conversation to business meetings. But all of this is at the management level. IT managers and administrators also search for data from multiple sources to ensure that the ecosystem works. Companies demand the continued maintenance and availability of mission-critical applications. Without a framework or incident workflow, revenue can suffer, and customers churn if the company does not proactively address problems that arise in its infrastructure.

Modern Operations Best Practices from Engineering Leaders at New Relic and Tenable

As reliability shifts left, more companies are adopting SRE best practices. These best practices don’t only include conducting incident retrospectives. The heart and soul of these best practices are a blameless culture and a desire to grow from each incident. In a recent industry leaders’ roundtable hosted by Blameless, top experts discussed how teams can embrace SRE best practices and make cultural shifts towards blamelessness.

Segment and SIGNL4: Know your Customer's Actions, Anywhere and Anytime

You have a web site, app, online shop, or SaaS offering? Then you have plenty of user actions. That can be visiting a certain page, signing up for a service or canceling a subscription. Wouldn’t it be great to know in real time when an important customer action takes place? This would allow you sales, customer service or technical teams to act immediately no matter where they are.