Operations | Monitoring | ITSM | DevOps | Cloud

September 2021

Everbridge to Showcase Latest Innovations in Countrywide Public Warning, AI for Public Safety, and Emergency Response at European Emergency Number Association (EENA) Conference 2021

As a market leader for population alerting and public safety solutions, Everbridge powers the national Public Warning system for eight countries in Europe, and more countries than any other provider across the Americas, EMEA, and APAC regions. Everbridge to present on key trends, best practices and technologies related to artificial intelligence (AI) for public safety, Next Generation 112, and helping governments adhere to EU mandate requiring member countries to have a population-wide alerting system in place by June 2022.

Reliability is not an engineering metric

If you're an engineer reading this, you might be wondering what I mean by the title. You might be a Site Reliability Engineer whose primary responsibility is to maintain the reliability of your company’s product/solution. You might be a software builder, a programmer responsible for building new capabilities and shipping them to production. All of these are important for any business to remain competitive.

Then and Now: Distributed Systems Alerting and Monitoring

Distributed systems are everywhere. Although many teams don’t think of their applications as distributed systems, if they’re developing using container-based microservices and serverless functions instead of a monolith, they’re creating a distributed system. This change also means that monitoring needs are becoming more complex.

Integrating Cyber and Physical Security Can Better Protect People and Save Enterprises Time and Money

Cyber security and physical security grew up at different times and in different neighborhoods. In fact, long before digital transformation was even a concept, physical security had staked out its corporate territory and was on the job protecting the company’s people, buildings, and other assets. Then, as the business world grew increasingly more reliant on information technology, digital security started flexing its muscles on its own turf.

From Metrics to Valuable Insights: Incident Post-Mortem Reports

IT organizations, such as managed service providers (MSPs), deploy incident alerting and on-call management solutions to accelerate software delivery and ensure seamless customer experiences. Incident alert management platforms orchestrate the distribution of alerts to ensure that technicians continue to maintain system uptime and minimize service disruptions.

Troubleshooting Outages at 3 AM with Alert Response

Imagine you are an on-call engineer, who receives an alert at 3 AM in the morning informing you that customers are experiencing high latency on your website, and are unable to shop. Being an Incident response coordinator myself at Sumo Logic, I can tell you, I don’t envy being that engineer. If this alert fired, this is what would likely follow: The biggest challenge is how to gather this information quickly, so you can decide whether to jump out of the bed or go back to sleep.

What's New: Updates to Runbook Automation, Partner Integrations, and More!

As we welcome Fall and such a transformational time of the year, we’re excited to announce a new set of updates and enhancements to the PagerDuty platform. From updates to Runbook Automation, ChatOps and Customer Service Ops Applications, to PagerDuty Community Events, users, and customers can.

Android App Update: Mute and enhanced 'Do not disturb' override

With our latest Android app update (3.1., build 242) you will never miss a critical SIGNL4 alert again. Even if your phone is muted or in do-not-disturb mode, SIGNL4 can now make a lot of ‘noise’ and wake you up reliably when a major or critical incident occurs. Here is how it works….

3 Things to Consider When Investing in On-Call Scheduling Software

On-call scheduling software modernizes the way healthcare administrators assign responsibilities to care team members. The software helps create an equitable workforce among care teams and eliminates manual errors during the on-call scheduling process. Administrators can set up digital schedules to contact the right clinicians at the right time. This ensures that on-call providers quickly resolve patients’ issues to improve patient experience.

PagerDuty goes global with national preparedness month: Preparing our workforce for crisis

The effects of climate change mean we’re increasingly seeing black swan weather events impacting our working lives. From wildfires and hurricanes to the ever-present threat of earthquakes, 2021 has seen its share of crises. This obviously raises serious questions for companies about the safety of their workforces. As a global company, PagerDuty has employees across the world. When a disaster strikes, everyone needs to have the necessary training, resources and tools to act.

Winning on Black Friday - IT Incident Response Made Simple

Even with all the changes in consumer behavior due to COVID-19, Black Friday and Cyber Monday is here to stay. Social distancing measures that limited instore shopping in 2020 has only led more people to shop online, and this trend is expected to continue in 2021. Preparing your e-commerce website and business for the seasonal business surge around Black Friday and Cyber Monday 2021 is crucial.

Why Net at Work employees are sleeping soundly again

Net at Work is a German IT company with over 100 employees that provides its customers with solutions and tools for digital communication and collaboration. Their product NoSpamProxy offers reliable protection against spam and ransomware, legally compliant email encryption and more. Customers of Net at Work are using it as a SaaS solution, and it is being monitored with the agentless network monitoring software PRTG Network Monitor from Paessler AG.

Modern SRE Practices for Incident Management

At VMware, we make use of modern development and site reliability engineering (SRE) practices on a regular basis. And those of us who work on the VMware Tanzu Observability product marketing team regularly get exposure to various SRE teams that implement modern practices with the observability technology we create.

SRE Back-to-School Checklist

Whether it's in classrooms or on Zoom calls, the kids have headed back to school! Bright-eyed students are gearing up to study new subjects and test their brains. Hopefully on their report cards, failure isn’t inevitable. Before the first day, parents load up their kids’ backpacks with everything they’ll need. Being well equipped with good supplies is the best way to stay focused and educate “reliably”. Likewise, SREs need the right tools and practices for the job.

How retailers are improving productivity, transforming incident response, and empowering teams with PagerDuty

For retailers, uptime is money and issues can cost thousands of dollars per minute. With infrastructure comprising complex services such as payment gateways, inventory, and mobile applications, maturing digital operations is vital for ensuring services are always on and customers get the best experience.

Divisions of Family Practice Adopts OnPage to Enhance Clinical Communication

Effective healthcare communication requires proper software and processes to ensure that the right person receives timely messages. Unfortunately, Divisions of Family Practice (DoFP), a large community-based network of physicians located in British Columbia, Canada, relied on a third-party answering service to connect long-term care facilities (LTCFs) with on-call providers.

What is expected in the SRE role? We analyzed 30 job postings to find out.

In 2016, Google released the definitive book on Site Reliability Engineering (SRE) - a practice that had originated in the company to take care of a monumental problem - how to keep the Google services running with high reliability. Over the years, SRE has been widely adopted by dev teams across the globe and is a popular role at startups and enterprises alike. Here is a look at how search for SRE has trended over the years.

How Do I Add a Major Incident Response to an Existing Integration? - Ask Adam

When we receive an alert, the obvious choice is to accept responsibility for the issue and start resolving it ourselves. But, what happens when the incident is far more major than we thought? With xMatters, you don't have to scramble to find who else is on-call, you can configure the platform to help find other responders for you.

A Migration That Paid Tech Dividends

TL;DR: Old, deprecated code/infrastructure is a challenge that every engineer will come across. Remedy what you can and remember that some extra effort can go a long way. It can uncover issues that, when addressed, will save you in the future. Part of the challenge of software development is maintaining legacy code and infrastructure. When you ignore or neglect these, issues start to pop up and your reliability suffers, causing pain for your customers. The trick here is to actively steward each project.

3 Ways to Use the xMatters and Microsoft Azure Monitor Integration

For a number of years, the debate on DevOps vs. ITIL has divided many technology teams. On the surface, both practices seem at odds with one another—DevOps harnesses the power of human collaboration and communication to support innovation, while ITIL utilizes a more systematic and structured approach to deliver service quality and consistency. But, if we take a deeper look, you’ll find that not only can DevOps and ITIL co-exist, they can even complement each other.

Best practices for writing incident postmortems

After you have stopped an incident from affecting your customers, you need a more thorough investigation in order to prevent similar incidents in the future. Postmortems record the root causes of an incident and provide insights for making your systems more resilient. At the same time, postmortems can be difficult to produce, since they require deeper analysis and coordination between teammates who are busy with the next development cycle.

How organizations Handled Incidents Before and After Deploying AIOps - Part 1

Organizations are always looking for new ways to innovate and reduce costs and allocate resources more efficiently. In this blog post, we will look at how enterprises handled incidents before and after deploying AIOps.

Best Practices to Reduce DevOps Burnout

As software development teams struggle with spotty, siloed software delivery cycles, the DevOps approach provides relief by unifying stakeholders to achieve faster, collaborative and continuous software delivery. However, the DevOps methodology fails if it does not address the issue of DevOps burnout. In this post, we’ll uncover strategies that DevOps teams can use to better manage their work environment.

The doctor is in: why domain agnostic AIOps is a necessity for diagnosis

Gartner recently identified two different high-level categories of AIOps: domain-centric and domain-agnostic. Elik Eizenberg, CTO at BigPanda, explains the difference and why would you need the latter to gain an overall view and understanding of your IT Ops.

What's New: Introducing the PagerDuty App for Salesforce Service Cloud

In today’s world of digital everything, where customers are increasingly demanding instant updates when problems occur, it’s more important than ever to take immediate action. Seconds matter, and teams need to be empowered to proactively solve customer-impacting incidents as quickly as possible.

A Developer's Perspective: Lessons from Open Source with FireHydrant and Backstage

We’re proud to announce that our front end FireHydrant plug in has been open-sourced as part of Backstage, an open platform for infrastructure tooling, services, and documentation created at Spotify. We introduce FireHydrant’s incident management and analytics in Backstage, where you can quickly and efficiently manage your incidents.

New integrations: Amazon EventBridge, ServiceNow, Zendesk, Zammad, Splunk, and More

Our ecosystem continues to grow: we have added 10 new integrations within the last months. Integrations are the bridge between alert sources and on-call teams and have always been a top priority at iLert. They are one of the reasons why iLert is so easy to adopt for small and large companies alike.

PagerDuty Integration Spotlight: Buildkite

PagerDuty’s Change Events are a powerful way to collect information from your service ecosystem. To maintain velocity as your application deployments scale, every second counts. Integrating Buildkite with PagerDuty ensures you have all the information you need, when you need it. After you install the integration from the PagerDuty Service Directory, you’ll be able to configure your #Buildkite pipelines to send change events to your services whenever a build completes, pass or fail.

3 Ways to Use the xMatters and Google Operations Suite Integration

Not too long ago, you would have needed development experience to oversee the delivery of scalable and reliable software. But with the rise of low-code and no-code tools, that requirement is now obsolete. What used to be hours of coding has turned into a few minutes of dragging and dropping.

10 questions teams should be asking for faster incident response

2019 and 2020 were worlds apart. Our entire ways of working, living, socializing, and learning were changed almost overnight. Over the last 18 months, technical teams have had to double down on all their digital efforts to help their customers adapt to the new normal. At the same time, teams were responsible for more unplanned work than ever as incidents steadily rose. For the first time, we’ve created the State of Digital Operations Report which is based on PagerDuty platform data.

Midwifery Care Communities Trust OnPage

The OnPage clinical communication and collaboration (CC&C) system is universally adopted by midwifery care communities across the United States and Canada. OnPage is proud to provide a real-time, secure collaboration platform that allows midwives to improve patient experience. This article examines the continued widespread adoption and implementation of OnPage’s industry-leading CC&C system by midwifery care communities.

The Role of Public Safety Communication in Local Communities

Public safety communication is vital to all local communities. At APCO 2021, Everbridge featured new products and advanced functionality of our Critical Event Management platform to showcase our ability to keep towns, cities, states, and other public agencies running, faster. Our partnership with RapidSOS and the corresponding showcase of Everbridge 911 Connect is one solution that organizations can use to better serve the public.

Automatic Alert Export to Third-Party Systems

In the SIGNL4 web portal you can manually export historic alert reports as.csv files. In some cases it might be useful to export alert data programmatically. For example you can forward all alerts including specific parameters to InfluxDB and show the alert history in Grafana to recognize peaks, trends and abnormalities over time. You can even use AIOps to recognize certain trends automatically. By using the SIGNL4 REST API it is possible to export alert data automatically.

CheckMK and Enterprise Alert - a scripted heartbeat check

A few days ago I received an inquiry about a scripting problem from one of our longtime partners, to be exact our DCP Marc Handel from IT unlimited AG. In the exchange with Marc I realized that his idea to use the Enterprise Alert Scripting Host, the Windows Task Scheduler and CheckMK to realize a roundtrip monitoring could be interesting for the whole community. Especially for all our CheckMK customers.

Introducing our open source SLO Tracker - A simple tool to track SLOs and Error Budget

One of the tools we use internally at Squadcast for SLO and Error Budget tracking is now open-source. In keeping up with the SRE ideology of automating as many ops tasks as possible, we built this SLO Tracker. We made this open-source so that the SRE community can also use it too. Looking forward to get your feedback, suggestions and patches :)

Severe Weather Preparedness: Managing Severe Weather Events During Other Crises

The novel coronavirus wasn’t the only history-making event of 2020. Last year cataloged the highest number of storms in the United States, as well as the worst wildfire season, with 9.5 million acres burned just in the western half of the country.

Essential Tools for Site Reliability Engineers

Site reliability engineers (SREs) are involved in scaling systems and making them reliable and efficient for organizations. But SREs often fail to build system resiliency when they do not have the right tools at their disposal. In this post, we’ll uncover five leading tools that SREs can use to drive the reliability and stability of computing systems. It also examines how SREs can use the tools to improve operations tasks and infrastructure processes.

3 Ways xMatters Can Ease Healthcare Incidents

Many organizations use xMatters to keep their services running and reliable. From technology businesses to complex enterprises, one particular industry that has overwhelmingly benefited from the use of xMatters is healthcare. In healthcare, speed and effectiveness are vital. Incidents are critical, and quality patient care is the highest priority.

Bridging The Digital and Physical

The world of the Internet of Things (IoT) is here, with an expected 75 million connected devices by 2025. The internet powers everything from healthcare devices to household appliances to banking systems and thus, more and more of our physical systems have a direct dependency on their digital components. Historically, physical and digital environments have been considered as standalone entities and a case needs to be made for the unification of risk management center across both the environments.

Closing The Customer Experience Gap with Continuous Automation

Every Organization has processes set in place to respond to incidents but these often result in missed SLAs which lead to poor customer experience. By establishing workflows and automating them the incident can be resolved faster and we can iterate processes rapidly. Rapid delivery of services at scale leads to good customer experiences, which is the most important metric for all organizations.