Operations | Monitoring | ITSM | DevOps | Cloud

Reliability is not an engineering metric

If you're an engineer reading this, you might be wondering what I mean by the title. You might be a Site Reliability Engineer whose primary responsibility is to maintain the reliability of your company’s product/solution. You might be a software builder, a programmer responsible for building new capabilities and shipping them to production. All of these are important for any business to remain competitive.

Then and Now: Distributed Systems Alerting and Monitoring

Distributed systems are everywhere. Although many teams don’t think of their applications as distributed systems, if they’re developing using container-based microservices and serverless functions instead of a monolith, they’re creating a distributed system. This change also means that monitoring needs are becoming more complex.

From Metrics to Valuable Insights: Incident Post-Mortem Reports

IT organizations, such as managed service providers (MSPs), deploy incident alerting and on-call management solutions to accelerate software delivery and ensure seamless customer experiences. Incident alert management platforms orchestrate the distribution of alerts to ensure that technicians continue to maintain system uptime and minimize service disruptions.

Troubleshooting Outages at 3 AM with Alert Response

Imagine you are an on-call engineer, who receives an alert at 3 AM in the morning informing you that customers are experiencing high latency on your website, and are unable to shop. Being an Incident response coordinator myself at Sumo Logic, I can tell you, I don’t envy being that engineer. If this alert fired, this is what would likely follow: The biggest challenge is how to gather this information quickly, so you can decide whether to jump out of the bed or go back to sleep.

Android App Update: Mute and enhanced 'Do not disturb' override

With our latest Android app update (3.1., build 242) you will never miss a critical SIGNL4 alert again. Even if your phone is muted or in do-not-disturb mode, SIGNL4 can now make a lot of ‘noise’ and wake you up reliably when a major or critical incident occurs. Here is how it works….

What's New: Updates to Runbook Automation, Partner Integrations, and More!

As we welcome Fall and such a transformational time of the year, we’re excited to announce a new set of updates and enhancements to the PagerDuty platform. From updates to Runbook Automation, ChatOps and Customer Service Ops Applications, to PagerDuty Community Events, users, and customers can.

3 Things to Consider When Investing in On-Call Scheduling Software

On-call scheduling software modernizes the way healthcare administrators assign responsibilities to care team members. The software helps create an equitable workforce among care teams and eliminates manual errors during the on-call scheduling process. Administrators can set up digital schedules to contact the right clinicians at the right time. This ensures that on-call providers quickly resolve patients’ issues to improve patient experience.

PagerDuty goes global with national preparedness month: Preparing our workforce for crisis

The effects of climate change mean we’re increasingly seeing black swan weather events impacting our working lives. From wildfires and hurricanes to the ever-present threat of earthquakes, 2021 has seen its share of crises. This obviously raises serious questions for companies about the safety of their workforces. As a global company, PagerDuty has employees across the world. When a disaster strikes, everyone needs to have the necessary training, resources and tools to act.

How retailers are improving productivity, transforming incident response, and empowering teams with PagerDuty

For retailers, uptime is money and issues can cost thousands of dollars per minute. With infrastructure comprising complex services such as payment gateways, inventory, and mobile applications, maturing digital operations is vital for ensuring services are always on and customers get the best experience.

Winning on Black Friday - IT Incident Response Made Simple

Even with all the changes in consumer behavior due to COVID-19, Black Friday and Cyber Monday is here to stay. Social distancing measures that limited instore shopping in 2020 has only led more people to shop online, and this trend is expected to continue in 2021. Preparing your e-commerce website and business for the seasonal business surge around Black Friday and Cyber Monday 2021 is crucial.

Why Net at Work employees are sleeping soundly again

Net at Work is a German IT company with over 100 employees that provides its customers with solutions and tools for digital communication and collaboration. Their product NoSpamProxy offers reliable protection against spam and ransomware, legally compliant email encryption and more. Customers of Net at Work are using it as a SaaS solution, and it is being monitored with the agentless network monitoring software PRTG Network Monitor from Paessler AG.

Divisions of Family Practice Adopts OnPage to Enhance Clinical Communication

Effective healthcare communication requires proper software and processes to ensure that the right person receives timely messages. Unfortunately, Divisions of Family Practice (DoFP), a large community-based network of physicians located in British Columbia, Canada, relied on a third-party answering service to connect long-term care facilities (LTCFs) with on-call providers.

What is expected in the SRE role? We analyzed 30 job postings to find out.

In 2016, Google released the definitive book on Site Reliability Engineering (SRE) - a practice that had originated in the company to take care of a monumental problem - how to keep the Google services running with high reliability. Over the years, SRE has been widely adopted by dev teams across the globe and is a popular role at startups and enterprises alike. Here is a look at how search for SRE has trended over the years.

How Do I Add a Major Incident Response to an Existing Integration? - Ask Adam

When we receive an alert, the obvious choice is to accept responsibility for the issue and start resolving it ourselves. But, what happens when the incident is far more major than we thought? With xMatters, you don't have to scramble to find who else is on-call, you can configure the platform to help find other responders for you.

3 Ways to Use the xMatters and Microsoft Azure Monitor Integration

For a number of years, the debate on DevOps vs. ITIL has divided many technology teams. On the surface, both practices seem at odds with one another—DevOps harnesses the power of human collaboration and communication to support innovation, while ITIL utilizes a more systematic and structured approach to deliver service quality and consistency. But, if we take a deeper look, you’ll find that not only can DevOps and ITIL co-exist, they can even complement each other.

Best practices for writing incident postmortems

After you have stopped an incident from affecting your customers, you need a more thorough investigation in order to prevent similar incidents in the future. Postmortems record the root causes of an incident and provide insights for making your systems more resilient. At the same time, postmortems can be difficult to produce, since they require deeper analysis and coordination between teammates who are busy with the next development cycle.

Best Practices to Reduce DevOps Burnout

As software development teams struggle with spotty, siloed software delivery cycles, the DevOps approach provides relief by unifying stakeholders to achieve faster, collaborative and continuous software delivery. However, the DevOps methodology fails if it does not address the issue of DevOps burnout. In this post, we’ll uncover strategies that DevOps teams can use to better manage their work environment.

The doctor is in: why domain agnostic AIOps is a necessity for diagnosis

Gartner recently identified two different high-level categories of AIOps: domain-centric and domain-agnostic. Elik Eizenberg, CTO at BigPanda, explains the difference and why would you need the latter to gain an overall view and understanding of your IT Ops.

What's New: Introducing the PagerDuty App for Salesforce Service Cloud

In today’s world of digital everything, where customers are increasingly demanding instant updates when problems occur, it’s more important than ever to take immediate action. Seconds matter, and teams need to be empowered to proactively solve customer-impacting incidents as quickly as possible.

A Developer's Perspective: Lessons from Open Source with FireHydrant and Backstage

We’re proud to announce that our front end FireHydrant plug in has been open-sourced as part of Backstage, an open platform for infrastructure tooling, services, and documentation created at Spotify. We introduce FireHydrant’s incident management and analytics in Backstage, where you can quickly and efficiently manage your incidents.

New integrations: Amazon EventBridge, ServiceNow, Zendesk, Zammad, Splunk, and More

Our ecosystem continues to grow: we have added 10 new integrations within the last months. Integrations are the bridge between alert sources and on-call teams and have always been a top priority at iLert. They are one of the reasons why iLert is so easy to adopt for small and large companies alike.

PagerDuty Integration Spotlight: Buildkite

PagerDuty’s Change Events are a powerful way to collect information from your service ecosystem. To maintain velocity as your application deployments scale, every second counts. Integrating Buildkite with PagerDuty ensures you have all the information you need, when you need it. After you install the integration from the PagerDuty Service Directory, you’ll be able to configure your #Buildkite pipelines to send change events to your services whenever a build completes, pass or fail.

3 Ways to Use the xMatters and Google Operations Suite Integration

Not too long ago, you would have needed development experience to oversee the delivery of scalable and reliable software. But with the rise of low-code and no-code tools, that requirement is now obsolete. What used to be hours of coding has turned into a few minutes of dragging and dropping.

10 questions teams should be asking for faster incident response

2019 and 2020 were worlds apart. Our entire ways of working, living, socializing, and learning were changed almost overnight. Over the last 18 months, technical teams have had to double down on all their digital efforts to help their customers adapt to the new normal. At the same time, teams were responsible for more unplanned work than ever as incidents steadily rose. For the first time, we’ve created the State of Digital Operations Report which is based on PagerDuty platform data.

Midwifery Care Communities Trust OnPage

The OnPage clinical communication and collaboration (CC&C) system is universally adopted by midwifery care communities across the United States and Canada. OnPage is proud to provide a real-time, secure collaboration platform that allows midwives to improve patient experience. This article examines the continued widespread adoption and implementation of OnPage’s industry-leading CC&C system by midwifery care communities.

Automatic Alert Export to Third-Party Systems

In the SIGNL4 web portal you can manually export historic alert reports as.csv files. In some cases it might be useful to export alert data programmatically. For example you can forward all alerts including specific parameters to InfluxDB and show the alert history in Grafana to recognize peaks, trends and abnormalities over time. You can even use AIOps to recognize certain trends automatically. By using the SIGNL4 REST API it is possible to export alert data automatically.

CheckMK and Enterprise Alert - a scripted heartbeat check

A few days ago I received an inquiry about a scripting problem from one of our longtime partners, to be exact our DCP Marc Handel from IT unlimited AG. In the exchange with Marc I realized that his idea to use the Enterprise Alert Scripting Host, the Windows Task Scheduler and CheckMK to realize a roundtrip monitoring could be interesting for the whole community. Especially for all our CheckMK customers.

Introducing our open source SLO Tracker - A simple tool to track SLOs and Error Budget

One of the tools we use internally at Squadcast for SLO and Error Budget tracking is now open-source. In keeping up with the SRE ideology of automating as many ops tasks as possible, we built this SLO Tracker. We made this open-source so that the SRE community can also use it too. Looking forward to get your feedback, suggestions and patches :)

Essential Tools for Site Reliability Engineers

Site reliability engineers (SREs) are involved in scaling systems and making them reliable and efficient for organizations. But SREs often fail to build system resiliency when they do not have the right tools at their disposal. In this post, we’ll uncover five leading tools that SREs can use to drive the reliability and stability of computing systems. It also examines how SREs can use the tools to improve operations tasks and infrastructure processes.

3 Ways xMatters Can Ease Healthcare Incidents

Many organizations use xMatters to keep their services running and reliable. From technology businesses to complex enterprises, one particular industry that has overwhelmingly benefited from the use of xMatters is healthcare. In healthcare, speed and effectiveness are vital. Incidents are critical, and quality patient care is the highest priority.