Operations | Monitoring | ITSM | DevOps | Cloud

December 2021

Looking back at our journey through 2021!

As we step into another year, its time to reflect back on our most memorable moments & milestones that tell the story of Squadcast in 2021. 😇 The last 12 months have been nothing short of a spectacular journey for us as a company. We raised funding (Yaay!🙌), launched an open-source tool called SLO Tracker, helped organizations globally improve their reliability and made on-call shifts in general less stressful. Here’s how our year went by.

Outage Alert: Top 10 Downtime Incidents of 2021

2021 has been an eye-opening year for both businesses and consumers who use popular websites and applications. We have all seen notable increases in the frequency and severity of outages as dependency on internet infrastructure grows – with no signs of slowing down. With our reliance on automation and connectivity expected to increase in 2022 – let’s review some of the top internet outages and website downtime incidents of 2021.

What to Expect From xMatters in 2022

With only a few days left of 2021, we all know what that means: making New Year’s resolutions. While some love the tradition of laying out their goals for the coming 12 months, others loathe it with a passion. And with approximately 80% of people failing to achieve their resolutions, it’s easy to see why there’s so much resentment towards this common habit. At xMatters, we plan to—and often do—beat those odds.

SRE Predictions 2022 | Blameless SRE

As the new year approaches, we at Blameless like to ponder the future of Reliability Engineering. For 2021, we predicted that the practice of site reliability engineering (SRE) would continue to grow in terms of adoption, we would see adoption increase faster among smaller organizations, and SRE practices would get more attention to drive adoption compared to hiring. We’re sure you’ll agree that these trends have indeed strengthened in the last year.

On-Call Escalations

With the AlertOps ServiceNow integration, you can use automatic escalations for on-call schedules and create custom escalations. Automatically escalate to a level 2 or level 3 team and notify management and stakeholders. Set each escalation to use the notification channel you choose (email, voice, SMS, mobile app, and chat). Set your escalations to trigger reminders when a response SLA or a resolution SLA has been breached or is approaching the deadline.

Tips & Tricks: Keeping Track of Event-Processing Delays

A couple of weeks ago our partner Rok Ponikvar from S&T contacted me about an issue one of his customers faced. His customer complained that Enterprise Alert is not alerting on current issues and even if he creates a test ticket in his OBM system no alert goes out. After a little back and forth we concluded that Enterprise Alert is still processing historic data from an Event Storm in OBM earlier that day.

Common Security related Questions and Answers

In light of the recent news about yet another reported Zero-Day Exploit and the accompanying discussions about security, let’s touch on the topic of security audits and how Enterprise Alert can be configured to avoid or at least minimize potential security impact. First, let’s establish what we mean by security audit.

How to Measure Uptime SLOs Using Pingdom and Nobl9

Do you find yourself asking, “What should our first service-level objective (SLO)be?” The simplest way to get started if you have a website is to measure uptime SLOs. The SLO will measure your uptime and how your site compares to your reliability goals. By following the steps outlined here, you can get up and running with your first SLO in minutes. To get started, you’ll need to set up an account on SolarWinds® Pingdom®.

Oracle's Cerner Acquisition Will Drive Smarter Care Decisions

Oracle is gearing up to execute the largest deal in its entire history – the company has agreed to buy Cerner, a leading electronic health records vendor, for $28.3 billion. The Cerner acquisition is slated to be an all-cash deal of $95/share and is expected to complete early next year. Cerner is a healthcare technology firm that streamlines health information and facilitates its accessibility for modern clinical teams.

Enhanced Enterprise Alert Reporting with Power BI

The benefits of using the correct reporting, analytics and information delivery capabilities can transform an organization. Having access to timely data, reporting, and analytic capabilities helps to ensure the ability to get the right data to the right users at the right time. Having the ability to pull any information that your business needs at any given time allows for the flexibility to get the information for your business when and where it is needed.

Using context.Context to mock API clients

We've found a pattern to mock external client libraries while keeping code simple, reducing the number of injection spots and ensuring all the code down a callstack uses the same mock client. Establishing patterns like these is what makes test suites great, and improves developer productivity when writing tests. Here's how it works.

Use Microservices to Modernize IT Operations

Many organizations are experiencing the need to modernize their IT systems to keep pace in an increasingly digital world. Adopting DevOps helps companies implement and initialize the modernization processes. At xMatters, our path to IT modernization has included implementing DevOps, but we have done it a little differently to ensure we are using agile processes.

Why leading healthcare organizations recommend OnPage

Adrienne, a Family and Nurse Practitioner at a leading healthcare organization recommends OnPage to anyone looking to adopt a clinical communication and collaboration solution. Keep watching to learn how her organization adopts OnPage to enhance their after-hours call paging workflows.

We've successfully completed our SOC 2 audit

We're very pleased to announce that incident.io is now SOC 2 compliant, having successfully completed our Type I audit. Put simply, this means an external auditor has looked at how the company is operating, and how our software is managed and operated, and confirmed that we meet a set of high security standards.

Managed IT Service Provider, BDNet Corporate Networking Recommends OnPage

In this video, Brian Domschke, CEO of BD Net Corporate Networking recommends OnPage for on-call management. Keep watching to learn how his organization leverages OnPage's digital fail-safe scheduling capabilities and alerting system to notify on-call staff after hours. OnPage continues to empower Managed Service Providers of all sizes to accelerate incident remediation for clients and provide exceptional IT services.

On-call by default

Like many SaaS businesses, we have an on-call rota to enable us to provide 24x7 cover if there are problems with incident.io. We have a 'pager' which will alert the relevant person if something unexpected happens in our app, so that they can investigate and fix it if needed. Note: This was adapted from an internal document we wrote about how we think about on-call at incident.io.

What does a DevOps Engineer do? We analyzed 29 job postings to find out.

Introduction As all companies become software driven, DevOps is becoming an important practice in enterprises and startups across the world. DevOps is about bringing velocity to delivering tech products and services, so you can delight customers and meet business goals. To achieve this velocity, development (dev) and operations (ops) teams work closely together across the software lifecycle - from planning to release. And this has led to a new role in engineering teams - DevOps Engineer.

PagerDuty for Facilities and Crisis Response

Jason Flint, Senior Manager of Facilities and Crisis Response at PagerDuty joins the stream to chat about how PagerDuty the company uses PagerDuty the platform to meet the needs of an increasingly distributed workforce. His team keeps track of everything from extreme weather events to political unrest that might impact PagerDuty employees.

The Best Tools for System Monitoring

It takes a lot to run a modern business. From websites to technical solutions and everything in between, it’s no surprise we need better monitoring systems to make sure everything is operational. With multiple gears turning at once on any given platform, incidents are inevitable—especially for companies that are constantly growing and innovating. And the impact of incidents can affect user services, operations, and even business reputation.

How Disaster Ready are Your Backup Systems, Really?

In SRE, we believe that some failure is inevitable. Complex systems receiving updates will eventually experience incidents that you can’t anticipate. What you can do is be ready to mitigate the damage of these incidents as much as possible. One facet of disaster readiness is incident response - setting up procedures to solve the incident and restore service as quickly as possible. Another strategy involves reducing the chances for failure with tactics like reducing single points of failure.

Breaking down complex projects into smaller, shippable increments

Building a complex new product can be scary. What if no-one gets value from it? What if it doesn't work? What if it's hard to change? One way to mitigate these risks is to break down the product into smaller shippable increments, allowing you to capture feedback early and confirming the most important assumptions before fully committing to a solution.

Automating Work in Real Time Through the PagerDuty Operations Cloud

Earlier this fall, we announced a significant evolution in the IT process automation portfolio at PagerDuty—the general availability of PagerDuty Rundeck Actions and early access for Rundeck Cloud. These new offerings reflect our vision to enable companies to take real-time actions by democratizing access to automation. In other words, to quickly and safely delegate automated IT processes to the IT users (and APIs) that need them to get work done.

Sponsored Post

The Principles of DevSecOps

As a Solution Architect here at xMatters, an Everbridge Company, and through my 30-year career in the IT industry, I've seen many frameworks offering bold new ideas. CMMI, ITIL, Prince 2, Agile, Scrum, and most recently, DevOps. These frameworks come and go, offering huge improvements in the way we deliver and manage our IT capabilities, but never lasting long enough to act on those promises. That's not to say they haven't made a marked difference in the IT space, or that they haven't been hugely impactful for organizations around the globe. They become launching off points for a new framework, and now there's a new term that's appeared, DevSecOps.

What Does ROI Really Mean?

ROI might be one of the most popular business acronyms in recent memory, and business to business, the definition remains the same: return on investment. No matter the industry, leaders are concerned with ROI and ensuring that every dollar spent is used in the best interest of the organization. But in practice, what does ROI really mean? Let’s discuss!

Shhh... we have Private Incidents

We’re excited to announce that private incidents are now available on FireHydrant. For the first time, incidents can have visibility limited to only permissioned users are able to see. This is a great solution for security and compliance teams who need to collaborate with their engineering counterparts to resolve incidents. The nature of these incidents that these teams work on dramatically differs from operational incidents.

Uncovering the Importance of Mean Time Between Failures

In the IT world, application service providers (ASPs) build customer trust by ensuring the continuous, uninterrupted availability of their services and software. Service availability allows customers to operate normally and generate revenue without being directly impacted by their providers’ system failures. Though providers work to ensure system uptime, they are often challenged by unexpected technical issues that impact customer-facing systems.

Monthly Moo Update | December 2021

What a year 2021 has been for us all. We are extremely proud of the continuous innovation and delivery of new features and functionality we have provided throughout the year, all while maintaining enterprise scale and uptime that could win awards. We’ve heard success story after success story from our brilliant customers, each unique in their own way. We couldn’t have had the successful year we’ve had without you, and it’s been our honor to be part of your success.

BigPanda's ServiceNow integration just got better

ServiceNow is widely used across Fortune 1000 and Global 5000 enterprises, so it’s no wonder that the majority of BigPanda customers use ServiceNow and integrate with it to streamline their ticketing requests. BigPanda’s AIOps Event Correlation and Automation Platform provides context-rich incidents to IT Ops teams relying on ServiceNow and helps them gain end-to-end real-time visibility into their operations.

What we learned from AWS's us-east-1 outage

In case you missed it, for several hours on December 7, 2021, AWS's us-east-1 region had an outage impacting multiple AWS APIs, taking out various websites across the internet. According to our own monitoring at OnlineOrNot, the outage started at 2021-12-07 15:32 UTC and began to recover well at 2021-12-07 22:48 UTC (with minor signs of life for a few minutes around 2021-12-07 20:08 UTC). Had we relied solely on AWS to update their status page before reacting, we would have been waiting a while.

Modernize Your Operations with Automated Incident Response

PagerDuty helps developers and IT professionals adopt full service ownership to ensure that those who go on call are 1) only interrupted by an alert when necessary, and 2) equipped with tools to remove the toil from managing incident response. Automating incident response increases developer and IT staff productivity, improves customer experience from service interruptions and unplanned downtime, and improves responder morale. Learn from PagerDuty customer Guidewire how Automated Incident Response can do all this for your teams.

SRE Incident Management: Overview, Techniques, and Tools

In the world of a site reliability engineer (SRE), failure is not only an option, but also expected. Systems, web applications, servers, devices, etc., are all prone to performance issues and unexpected outages at some point. It is an unavoidable fact. These unexpected failures can lead to huge revenue losses, customer trust and depending on the industry, maybe fines. Fortunately, SRE incident management is one of the core practices used to limit the disruption caused by unexpected issues.

Incident Review - AWS Outages Crash Major Online Services - Including Amazon

The following is an analysis of the Amazon Web Services incident on 12/07/2021. Millions of users were affected by an Amazon Web Services outage that took down major online services such as Amazon, Amazon Prime, Amazon Alexa, Venmo, Disney+, Instacart, Roku, Kindle, and multiple online gaming sites. The outage, which originated in the US-EAST-1 region on Dec. 7, 2021, is still ongoing at the time of blog publication.

Space Made Simple: How PagerDuty Enabled Loft Orbital to Achieve Incident Response Lift Off

The next great space race is on. Today, there are multiple companies competing to earn their slice of a global space industry set to be worth more than $1 trillion by 2040. However, launching a satellite into space still isn’t an option for most organizations due to the prohibitive costs and complex engineering required.

Why automation is the incident response 'easy button' MSPs & IR firms have been waiting for

The managed security services market is booming. Coming in at $22.8 billion in 2021, it is projected to nearly double in just five years and grow to $43.7 billion by 2026. Moreover, cloud-based managed security services are poised to be the major growth driver for the broader MSP market, coming in at $219.59 billion in 2021, and expected to reach $557.10 billion by 2028. As we can see, providing robust security services is a key competitive differentiator for the lucrative MSP market.

The Cultural Shift to Modern IT Operations

In the world of always-on services, many organizations have taken the path to modernize their IT operations to provide greater agility, lower cost, and most importantly, to deliver frictionless digital customer experiences. Is your DevOps team deploying more frequently than operations can support? Are you struggling to keep up with the maintenance issues associated with aging software? Modernizing your IT operations can be the key to overcoming these complexities.

What's New: Updates to Runbook Automation, Event Intelligence,Partner Integrations, and More!

We’re excited to announce a new set of updates and enhancements to the PagerDuty platform. The product team has been hard at work making updates from Event Intelligence, Runbook Automation, and Applications with Monitoring Tools, to PagerDuty and PagerDuty Community Events.

Reimagining Retail Incident Response for the Holidays

The holiday season is here, and global retailers are prepared for the biggest retail event of the year. The decrease in new COVID-19 cases, coupled with a rise in vaccination rates, provides a glimmer of hope for shoppers looking to spend for friends and family. Holiday spending is expected to break previous records this year, growing up to 10.5 percent over 2020.

Best Practices to implement in Incident Management

They are like 5 stages of an incident: 1. Assess impact 2. Inform customers (statuspage) 3. Identify the issue 4. Mitigate the issue 5. Resolve the incident Then there’s followup and further work. Also important to note that (2) should be ongoing as you progress. Updating the status page should be done within reasonable periods – e.g. every 15-20 mins unless you specify otherwise.

What can SREs do to make holiday season's peak traffic less chaotic?

Holiday season's peak traffic is the most challenging period for SREs and on-call engineers. In this blog, we have highlighted the things that SREs can do to make the holiday season less chaotic. The recently concluded Black Friday weekend could have potentially been the most challenging shift for on-call engineers working in the Retail or E-Commerce sector. Since such peak-traffic events push the system to the limits, engineering teams are engulfed in a lot of tension preparing for it.

Dashboard Fridays: Sample PagerDuty Alerting dashboard

Adam Kinniburgh is back with another Dashboard Fridays episode, this time joined by Ashley Thompson as they showcase this example PagerDuty Alerting dashboard. This dashboard gives an overview of alerting sent to PagerDuty from any source, even external sources like Pingdom.

DevOps Workflow | A Complete Guide & Best Practices

Curious about DevOps Workflow? We explain the DevOps process, how automation relates to workflow, and best practices for workflow design DevOps is a methodology that involves Development and Operations working together during the development process. Workflow is the sequence in which tasks occur. DevOps workflow relies heavily on automation and involves: Using DevOps, teams can increase collaboration and improve processes to create more stable and manageable processes.

December 2021 Update - On-duty board, Manual Signls and Azure Sentinel update

Our December update brings a ‘Who is on duty’ board displaying current team members on duty with contact information. In addition, we have simplified the manual sending of Signls and improved the integration with Azure Sentinel. As always, you can find all the details in this article.

Workflows: your process, automated

After many weeks of work, we're delighted to announce the latest feature of the incident.io platform: Workflows. Configure your processes once, and we'll make sure you follow them, every time ✨ A little while ago, I was asked the question: “what makes a good incident response?”. Whilst there’s infinite nuance in the answer, mine was pretty straightforward. The best incidents are founded on principles of communication, coordination, and clear roles and responsibilities.

How to Reduce Noise, Resolve Faster, and Automate More Often with PagerDuty

When we asked how technology leaders are feeling about increased pressure on digital services, they reported that, unsurprisingly, their investments in digital have grown. In fact, 72% are ramping up digital transformation efforts. Yet while the C-suite is interested in AIOps and automation to help their teams, it’s not always clear what their approach should be and how this technology can be applied to solve problems for their teams today.