Operations | Monitoring | ITSM | DevOps | Cloud

February 2021

Integration with 3rd Party Systems

Integrations of third-party systems with Enterprise Alert, what is possible? In my work with new and existing customers, I keep coming across the assumption that Enterprise Alert is not able to be integrated with certain third-party systems in order to receive and process events and fault messages from this system. Basically, first of all, we have to say: We can integrate everything that communicates digitally in any way.

Preparing to Fail Fast so You can Recover Faster

The principle of fail fast is either the best thing since the transistor or nothing but hot air. It depends on the size of your organization and the cohesiveness of your teams. If your team members have a strong working relationship, and dev is well integrated with everyday work company-wide, you already have a good foundation for this particular agile thinking. Most companies that have grown beyond startup-size, and even some startups, may find this idea a bit jarring.

Announcing Updated Analytics Filters to Dive Even Deeper into your Historic Incident Data

After successfully implementing a conditional evaluation engine into Runbooks, we started looking at other places in FireHydrant that would be improved with this engine. After hearing a lot of feedback from you, we’ve implemented conditions into our Analytics page. Let’s dive in and see what new things are possible with this new filtering.

Product Updates: Creating a New Runbook Just Got Easier with Templates

Starting out with runbooks can be daunting, we've built a way to implement our best practices into a runbook that can be implemented in a single click. On top of this, there's now even more ways to attach runbooks to your incidents and a much easier way to test out the runbook that you're currently working on.

6 Ways Retailers Can Maximise Value With Creative Engineers

“Engineering” and “creativity” aren’t often considered synonymous. However, in today’s world, where the online experience is at the forefront of virtually all business transactions and experiences, the creative engineer is finally getting the recognition they deserve. These individuals are quite literally building the virtual world we live in.

Can enterprises move fast without breaking IT?

In one of our recent webinars we discussed a challenge in digital transformation that is top of mind for many IT Ops leaders: how to actually transform with the least amount of pain… No matter how tired people are of the term “digital transformation”, it still represents an imperative strategy for enterprises wishing to survive in today’s dynamic business environment, let alone see growth and increased market value.

Overview of Incident Lifecycle in SRE

Incidents that disrupt services are unavoidable. But every breakdown is an opportunity to learn & improve. Our latest blog is a deep dive into best practices to follow across the lifecycle of an incident, helping teams build a sustainable and reliable product - the SRE way As the saying goes, “Every problem we face is a blessing in disguise”.

AlertOps Expert Guidance

At AlertOps, we believe our job doesn’t end when you complete the procurement of our software service. It has just begun, our support team is only a call, email or chat away, to guide you toward your goal. Our Customer Success team will also be in constant touch, helping you with your issues at hand, guiding you on your usage patterns, recommending any under-leveraged options and highlighting new features and service that we rolled out.

What is IT Monitoring?

IT monitoring involves the use of a combination of technologies to simultaneously ensure IT equipment performs as expected and resolve any identified IT problems. The capabilities of IT monitoring technologies vary; some technologies can perform a basic assessment of equipment across an IT environment, while others can automate the identification and remediation of equipment issues. Your business can leverage monitoring technologies, but optimizing their value requires careful evaluation.

AlertOps Automation

AlertOps is built for today’s fast paced enterprise. Managing an incident can many times be a chaotic time, so a streamlined workflow, that automatically routes to the next step in your business process, helps resolve the issue at hand quickly. End to end automated workflows, aided by rules engine, allow you to optimize your processes and manage tasks efficiently.

The fault line: How to communicate in a crisis

If there’s one universal constant in the world of business, it’s that things will go wrong. Probably at the most inconvenient of times and in the most inconvenient of ways. It’s Murphy’s law, or, if you’re from England the much more fun, “Sod’s law”. These moments can define your business more than any other. Unfortunately, far more than usual day-to-day ever will.

AlertOps Flexibility

We believe that our customers should not have to make compromises in their business process to implement and use AlertOps. AlertOps offers total flexibility, meaning it is highly configurable, legitimately addressing your pain points. One of our core tenets, that we use from our ideation stage of our product roadmap, thinking through the various design aspect to allow the maximum flexibility for the user to configure the software to their needs.

Q&A: Datadog Expands Monitoring Reach with Moogsoft Observability Cloud

Nobody will dispute that a common goal of DevOps pros and SREs, and really any company today, is to delight their customers more by disappointing them less. This was the theme of a recent live webinar focused on announcing a new game-changing partnership between Datadog and Moogsoft. The live session combined remarks by Moogsoft CEO Phil Tee and CTO Dave Casper on bringing together the best of these two technologies with a new seamless integration.

IT Operations Glossary 2021

With increasing complexity and workloads, the world of IT operations is constantly evolving to meet the needs of digital-first organizations. Automation, AI and DevOps are intersecting today like never before. A constant influx of new technologies means new terms. Here's our take on the meaning of leading words and phrases in the space right now.

Getting Started as an SRE? Here are 3 Things You Need to Know.

We live in the era of reliability. The most important feature for a service is how dependable it is in the eyes of a user. Companies are hiring with this in mind. In a 2019 LinkedIn article, site reliability engineers were listed as the 2nd most promising career in the United States. But how do you get started as an SRE? In this blog post, we’ll look at: SRE is a multifaceted role. You will contribute to an organization's code base, policy, culture, and more.

FAANG proofing your Job Applications

There is one thing that hurts more than being rejected by a hiring manager - being rejected because you’re not ex-FAANG. This was not always the case though - FAANG’s combined engineering workforce is currently at 330,000+ and growing at an astounding 20% YoY. This means that at any given point in time, there are tens of thousands of FAANG engineers active in the job market vying for spots in great up-and-coming companies.

4 Things you Need to Know about Writing Better Production Readiness Checklists

When we think of reliability tools, we may overlook the humble checklist. While tools like SLOs represent the cutting edge of SRE, checklists have been recommended in many industries such as surgery and aviation for almost a century. But checklists owe this long and widespread adoption to their usefulness. Checklists can also help limit errors when deploying code to production. In this blog post, we’ll cover: Production checklists should be holistic.

IDC Value Assessment Tool: How Much Value You Could Get With PagerDuty

Many IT vendors claim to provide value and help organizations strengthen their digital operations, but we wanted to go a step further and quantify our true business value. We recently commissioned a study1 with leading analyst firm IDC to capture and measure what kinds of results our customers are able to achieve by using our platform.

February 2021 Update: Copy duty slots, password change and 2-way integration with ServiceNow and Checkmk

With our February 2021 update, we have added more features to our account management portal. We’ve now added a copy mode to the duty and shift calendar providing an even better shift and duty scheduling experience. You can now also monitor your personal profile and change your password in the account portal. We have now also added 2-way integration capabilities for ServiceNow and Checkmk.

Alert Escalation in Enterprise Alert

Enterprise Alert® is the leading enterprise-class software for automated, targeted, and traceable alerting. But what does “escalation” mean in this case? In the course of my work, be it with new customers, existing customers or even prospects, I repeatedly find that the term escalation is often defined very differently. Therefore, I would like to clarify in the context of this blog what we mean by escalation.

Maximize IT Investment by Integrating with CloudReady and ServiceNow

Businesses around the world are striving to accelerate digital transformation and increase IT visibility but tool incompatibility and API integration can be obstacles. This white paper shows how Exoprise CloudReady can be easily integrated into tools such as ServiceNow, Splunk, PagerDuty, MoogSoft, Slack, etc., and streamline incident management.

How to create a Status Page for your business in under 3 minutes?

In this video, you’re going to learn exactly how to create a Status Page for your business in just 3 minutes! To be clear: Creating a Status Page takes hard work. But with this video tutorial, you’ll have a proven process that you can use to create a Status Page under 3 minutes and save a lot of time. All in all, you save a lot of time and can relax more.

Top MSP Trends to Look out For in 2021

Unlike the previous year, more managed service providers (MSPs) are embracing digitalization in 2021. Many organizations have adopted a remote-first mentality and are investing in technologies that are strategically aligned with this new way of life. Despite the 3.2 percent decline in IT spending in 2020, Gartner projects that spending will surge and reach $3.9 trillion worldwide in 2021. More IT organizations are investing in enterprise software as remote work becomes essential.

Five Healthcare IT Trends to Watch in 2021

Healthcare information technology (healthcare IT) trends focus heavily on process improvements and clinical efficiencies. Providers can use advanced, emerging technologies to deliver quality care and overcome the challenges of today’s global health crisis. Trendspotting allows healthcare organizations to stay prepared for disruption and ensures they continue to innovate every year.

New Ops Guide: Best Practices for On-Call Teams

The always-on, always-available expectations of digital services have increased the requirements of technical teams to be ready and provide response around the clock. For teams new to this concept, introducing on-call can be stressful and complex. As part of PagerDuty’s main platform, on-call management is key to our business, but the non-technical aspects are also important for teams to consider.

Streamlining IT Operations with BigPanda and ServiceNow

Does the following sound familiar? You have a complex, hybrid and dynamic IT stack – with your cloud infrastructure changing by the minute and your container infrastructure changing by the second. Your monitoring and observability tools provide excellent visibility into your infrastructure, your applications and your services, but the dynamic environment in which they operate causes them to generate large volumes of heterogeneous machine data, with thousands of alerts a minute.

How to Improve Your Building Management System

A building management system (BMS) lets your business monitor and control mechanical and electrical equipment across one or more buildings. Heating, cooling, and ventilation (HVAC), security, and other systems linked to a BMS usually represent 70% of a building’s energy usage. So, proper configuration of your BMS is key — otherwise, a poorly configured system can negatively impact your building’s efficiency, maintenance, security, and safety.

4 Tips on Preparing for a [Great] Failure

The most essential lesson of SRE is that failure is inevitable. This shouldn’t be a cause for despair. SRE shows how embracing failure is empowering. By celebrating failure, you can accelerate development and foster a culture of learning. Rather than hoping to prevent failure, SRE prepares you to respond well to it. It can be difficult, if not impossible, to anticipate where failure will occur in complex systems given unknown unknowns.

What are MTTR, MTBF, MTTF, and MTTA? A guide to Incident Management metrics

In the present fast-moving digital world, it has become critical for businesses to measure and track their service delivery performance especially the incident management metrics that monitor the uptime of systems, downtime due to outages, and how fast and efficiently issues are resolved because even a slight glitch in the system can cause disruption in the business processes costing millions of dollars.

Using BigPanda and ServiceNow to prevent and resolve outages

BigPanda augments ServiceNow and helps IT Ops teams work more efficiently in modern IT Stacks, reducing MTTR by 40% or more. By using BigPanda and ServiceNow together, IT Ops teams are provided with real-time service mapping for dynamic infrastructures, can easily reduce and automate ServiceNow ticketing, and are able to surface the root cause changes affecting their continuous delivery.

Customer Devotion: How We're Bringing OneDuty to Life

It’s been almost a year since the world changed overnight and industries across the world quickly adapted to living, working, and learning fully virtually. While the world seemed to stop in an instant, many businesses saw an increase in demand and new challenges. PagerDuty was no different.

Communication Tool Down? Here are 3 Ways to Handle it

January 4th, 2021, the communication service Slack suffered a major outage. Teams working remotely found their primary communication method unavailable. The incident lasted over 4 hours, during which some customers had intermittent or delayed service, and others had no service at all. It was a reminder that even the most established tools are susceptible to downtime. This is a core lesson of SRE: that failure is inevitable.

How to get a phone call when your API fails

Learn how you can get a phone call alert when your API fails. Spike.sh sends you alerts via phone call, SMS message, email and Slack when you have any issues in production. Spike.sh integrates with your infrastructure, performance monitoring, error tracking, uptime monitoring, API monitoring and cron job monitoring tools. Our integrations include AWS, Google Cloud, Datadog, Grafana, Prometheus, New Relic and many more.

How to get a phone call when your cron job fails

Learn how you can get a phone call alert when your cron job fails. Spike.sh sends you alerts via phone call, SMS message, email and Slack when you have any issues in production. Spike.sh integrates with your infrastructure, performance monitoring, error tracking, uptime monitoring, API monitoring and cron job monitoring tools. Our integrations include AWS, Google Cloud, Datadog, Grafana, Prometheus, New Relic and many more.

The unattainable promised land of tool consolidation

It’s on the agenda of almost every CIO, COO and CFO, and sounds like a great idea in general: tool rationalization, often trying to standardize on top of a single vendor. It can reduce costs and provide a streamlined IT Ops process through data consistency, a single pane of glass and a single source of action.

A reliable and secure on-premises alerting solution for NRB Group

“We wanted to have something directly on-prem and exchange our physical servers for virtual ones. Enterprise Alert ensures higher internal flexibility for our technical teams. The product was easy to implement and just as easy to use.” Gregory Beterams, NRB Group

New IDC Study Highlights PagerDuty's Multi-Million Dollar ROI

Dependence on digital business skyrocketed in the last year, with customers expecting seamless, always-on access to applications and digital services from any device, anywhere. This trend has placed developer and IT teams under more pressure than ever before to not only deliver these digital experiences, but keep them up and running at all times.

What is IT Infrastructure Management (IM)?

Effective IT Infrastructure Management or “IM” is crucial. If your business prioritizes infrastructure management, it is well-equipped to keep its software applications and networks running at peak levels. Plus, your business can avoid downtime, outages, and other costly, time-intensive IT problems that put your operations and stakeholders at risk. How you manage your infrastructure can have far-flung effects on all aspects of your business.

Actionable Insights - Faster Incident Resolution with Datadog and Moogsoft Observability Cloud

Context is king, they say, and anything you can do to improve context both makes decisions and assessments more reliable and speeds up the decision process. A new, bi-directional integration between Moogsoft Observability Cloud and Datadog does just that. Many SRE teams rely on Datadog to provide comprehensive information about their application stacks.

What is the Difference between SLAs and OLAs?

In traditional IT environments, services to customers are delivered and supported by the organization. A Service Level Agreement (SLA) is created with details like what would be the availability of service be, how reliable the service would be, what penalties can be charged in case of downtime, etc. The internal teams like the network administration team, development team, IT service desk, etc. would then draw up Operational Level Agreements (OLAs) to support the SLA.

"I'm Just Doing my Job," An SRE Myth

"Sorry, but I'm just doing my job." I heard this recently from a customer service representative. What they were saying made sense (afterall, we don’t have total control over our work environments), but it felt wrong. As a customer, I was left dissatisfied with our interaction. However, the representative assured me that they were simply following protocol. This got me thinking: can established practices and protocols sometimes get in the way of excellent customer experience?

Stay Alert to Security With Xray and PagerDuty

When it comes to securing your software development against open source vulnerabilities, the earlier action occurs — by the right person — the safer you and your enterprise will be. Many IT departments rely on the PagerDuty incident response platform to improve visibility and agility across the organization.

Incident Communication Is a Key Part of Resolving Network Issues

You’ve just received a notification—a major network issue has occurred. Hoping it’s a false positive, you complete an initial triage. Dang it! It’s the real thing. If you’re like me, your mind likely turns to one thing: fixing the issue as fast as you can. But hold on! Before you turn completely to fixing it, there’s another important aspect to any incident that you can’t forget, and that’s incident communication.

Carrefour Bank Uses PagerDuty and Rundeck to Automatically Self-Heal Incidents

With the mission of transforming the customer experience for financial services, Carrefour Bank offers a wide portfolio of financial products created to meet and satisfy different customer needs. Learn how Carrefour Bank leverages PagerDuty and Rundeck to automatically self-heal incidents to keep customers happy and resolution times down.

PagerDuty's Ops Guides Get a Fresh New Look

The Community and Advocacy Team here at PagerDuty recently spruced up our library of ops guides, and we’re excited to share them with you. If you’re not familiar with the ops guides, they are an open-sourced collection of long-form documents that cover a variety of topics related to real-time operations and incident management. We’ve given them some spiffy new headers, cleaned up some sneaky errors, and added a new section titled “Next Steps.”