Operations | Monitoring | ITSM | DevOps | Cloud

June 2021

Threat Stack and Squadcast Integration Streamlines Alerts with Greater Context

This is a guest post collaboration between Squadcast & Threat Stack. The move to the cloud has rapidly expanded the cyber threat surface of modern cloud apps. This blog in partnership with Threat Stack, outlines how you can stay on top of your game with help of context-rich alerting & resolve security incidents rapidly along with few best practices to follow for faster incident response.

Wiley Relies on PagerDuty as the World Moves Towards Digital Learning

John Wiley & Sons, Inc., commonly referred to as Wiley, is a global publishing company founded in 1807 that focuses on academic publishing and instructional materials. Sean Mack, CIO and CISO of Wiley, discusses how PagerDuty is empowering teams to own and support services 24/7/365 as digital learning becomes more prevalent.

xMatters Lunar Lander Release - New Product Features - xMatters Demo

xMatters Lunar Lander release is here! Join Sr. Director of Customer Success, Kerin Munro, and Product Manager, Daniel Reich as they discuss some of the latest and greatest product features that went live with the Lunar Lander release. These updates include new possibilities in xMatters Flow Designer with a create alert step and an incident severity step, updates to Event Flood Control, and more!

3 Steps For A More Strategic Approach to Incident Reduction

When an IT incident negatively impacts employee experience, IT teams rush to remedy the issue – understandably, as a widespread incident can have major effects on employees’ productivity, security, and overall experience. Yet, so many IT teams find themselves drowning in support tickets even as they continue to resolve top call drivers (the incidents that affect the most employees and drive the most support requests).

How Integrations Lead to Easier, Quicker and Better Decision-Making

Whether from a monitoring tool such as Datadog, a collaboration tool such as Slack, an automation tool such as Chef or a ticketing tool such as ServiceNow or JIRA, AIOps seamlessly integrates data from all of your IT sources. A robust AIOps solution with integrations can help your DevOps and SRE teams better know where to begin fix problems, resolving incidents before they affect services and reducing downtime.

Why You Need Real-Time for Faster MTTR

“If you ain't first, you're last.” While that famous one-liner from Ricky Bobby (Will Ferrell) in the cult hit Talladega Nights is more joke than catchphrase, it hits home for those of us in the world of DevOps and Observability. Faster is better. And in our technology-driven world of online transactions and complex environments, faster isn’t just better — it’s crucial.

Have You Herd? Episode 2 | Getting into the DevOps Culture

Join the Moogsoft Engineering team for our second episode of Have You Herd?! This episode we talk about how you can get into the DevOps culture covering questions like... How do you contribute to a DevOps culture as an individual contributor? What pipelines and tools should a company have set up before embarking on the DevOps journey? What kind of skills should you have to market as a DevOps leading engineer?

Solarisbank Banks on PagerDuty to Keep Financial Services Online

Solarisbank is Europe’s leading Banking-as-a-Service platform that enables any business to offer their own financial services. Satyajit Ranjeev, Daria Kameneva, and Jens Hermann discuss how PagerDuty helps teams implement a “you build it, you own it” model and reduce incident response times.

Can Emails Initiate xMatters Workflows? - Ask Adam

You’ve spotted an incident, but how do you get your team to start working on it? xMatters workflow expert Adam can show you how. Email triggers in xMatters are a fast and effective, and a great way to get workflows going with minimal fuss. There are a few steps to getting them configured right so let's go through it from the beginning.

How to Introduce Automation to Incident Response with Slack and PagerDuty

Major-incident war rooms are synonymous with stress. Pressure from executives, digging for a needle in a haystack, too much noise—it’s all weight on your hardworking technical teams. Incident responders clearly need a more effective way to collaborate across various technical teams. A method that both minimizes interruptions and keeps stakeholders up to date while ensuring everyone has the right level of context to do their job.

Reliable ticket and incident alerts with ConnectWise and SIGNL4

With SIGNL4 your on-call teams and field services engineers will never miss a critical ticket. And they won't suffer alert fatigue, either. SIGNL4 adds reliable mobile alerting by push, text and voice call, event filtering, duty scheduling and much more to ConnectWise within a few minutes.

Resilience in Action E8: Vanessa Yiu on Crafting Enterprise Architecture

‍Resilience in Action is a podcast about all things resilience, from SRE to software engineering, to how it affects our personal lives, and more. Resilience in Action is hosted by Kurt Andersen. Kurt is a practitioner and an active thought leader in the SRE community. He speaks at major DevOps & SRE conferences and publishes his work through O'Reilly in quintessential SRE books such as Seeking SRE, What is SRE?, and 97 Things Every SRE Should Know.

PagerDuty Summit21 Keynote: DigitalOps Now: Go Digital First with Modern Digital Ops Management

To succeed in a world of digital first customer experiences, operations must also be digital first. Join PagerDuty CEO Jennifer Tejada & CPO Sean Scott as they share the latest PagerDuty innovations and our vision for the future of work. Don't miss exclusive fireside chats with Fox Corporation executives Paul Cheesbrough, CTO & President of Digital and Jeff Dow, EVP for Media and Broadcast, as well as Kim Hammonds, Investor and Board Member at Zoom, Box, Tenable and UiPath and The Goldman Sachs Group, Inc.

Leverage Observability With OpenTelemetry to Understand Root Cause Quickly

An observability solution should help any incident responder understand what changed and why. A lot has been written on the difference between monitoring and observability, but an easy way to understand how both are integral to incident response is to consider how customers use PagerDuty—with both monitoring and observability tools—to get to the right answer.

SREview Issue #14 June 2021

Hoping you're headed towards a fun summer season and some time without masks. Let's avoid a new kind of tan-line! This newsletter shares useful industry content and an exciting Blameless product announcement. Find our fave tweets and events in the SRE and resilience engineering community. We're hiring! Check out the job openings here.

xMatters Makes Workflow Automation as Simple as Drag and Drop

xMatters’ low to no code integrations makes creating automated workflows that align your team and processes as simple as drag and drop. With just a few clicks, your teams can be building workflows that integrate, automate and accelerate your incident response and resolution capabilities. Best yet, xMatters is free to use and you can get started today at xmatters.com/free!

Red Canary says 43% Lack Readiness to Notify Customers of a Security Breach

The phrase ‘stakeholder management” assumes that stakeholders are truly informed by alerts. However, managers can only send communications out, they cannot force people to address them. To ensure your stakeholders are engaged during an incident, it is vital to set up a defined communication process. Yet, a recent Red Canary report1 found that 43% of surveyed participants lack readiness to notify the public and/or its customers in the event of a security breach.

Everything You Need to Know About Emergency Risk Management

Emergency risk management (ERM) is the process of identifying potential threats and minimizing the impact of disasters on business operations and people. The process requires leaders within an organization to determine how they will keep stakeholders informed and safe during critical events. Leaders must also craft disaster recovery plans to quickly remedy the effects of a catastrophic event on communities, government agencies and organizations.

Monthly Moo Update | May 2021

Goodbye May, Hello June! It’s summertime in the northern hemisphere and the sun is shining bright, along with updates we’ve got for you this month. The team at Moogsoft is working on a few big items that will be sure to put a smile on your face. But, lest we forget about some of the smaller items that help you day in and day out.

Manage incidents on the go with the Datadog mobile app

The Datadog mobile app enables you to check your alerts and dashboards from anywhere, so you can triage issues—and stay up to date—regardless of whether you have access to a laptop. You can now be even more productive when responding to issues while away from your keyboard by declaring incidents and notifying responders directly from your mobile device.

WEX Automates the Triage Process and Delivers a Better Services Experience - xMatters Demo

Does your internal triage process keep you up at night, literally or figuratively? If so, WEX used to have triage and onboarding issues that got in the way of their success too, but with xMatters, they’ve found a better way. Join James Molchanoff (JT), Information Systems Engineers at WEX, John Kallio, Information Systems Engineer at WEX, Will Derksen, Product Advocate at xMatters, and Zoe Na, Customer Success Manager at xMatters, as they discuss how WEX has embraced xMatters to reduce triage and call-out time and simplified their onboarding process.

Maximize Collective Knowledge to Deliver Patient Care

Medical practitioners must move beyond their own expertise to make informed patient care decisions. This can be achieved by normalizing team collaboration, encouraging providers to access information gathered by other specialists along the patient’s continuum of care. However, healthcare is plagued with fragmented communication due to archaic technology. There is also a lack of accountability when establishing communication roles and responsibilities across care teams.

7 key processes for running a top performing NOC

Much of the fuel for today’s business organizations is comprised of cloud computing and digital and SaaS applications. So, if something goes wrong with them, there will be a grave impact on productivity, customer satisfaction and even loyalty, as well as on the costs required for resolving the incident, remediating damage, and getting back to business.

Complete Guide to Service Level Objectives (SLOs) That Work

Wondering what Service Level Objectives (SLOs) are? In this article, we will explain service level objectives and how they relate to SLAs, SLIs, and error budgets. A Service Level Objective (SLO) is a reliability target, measured by a Service Level Indicator (SLI) and sometimes serves as a safeguard for a Service Level Agreement (SLA). SLOs represent customer happiness and guide the development team’s velocity.

The MTTR that matters

“Mean time to X” is a common term used to describe how long, on average, a particular milestone takes to achieve in incident response. There’s mean time to detect, acknowledge, mitigate, etc. And then there’s the elusive “mean time to recover,” also known as “MTTR.” MTTR, a hotly debated acronym and concept, measures how long it takes to resolve an incident on average. The problem with MTTR, though, is that it doesn’t matter.

Here's what SLIs AREN'T

SLIs, or service level indicators, are powerful metrics of service health. They’re often built up from simpler metrics that are monitored from the system. SLIs transform lower level machine data into something that captures user happiness. Your organization might already have processes with this same goal. Techniques like real-time telemetry and using synthetic data also build metrics that meaningfully represent service health.

Press Release: iLert achieves Amazon RDS Ready designation

Cologne, Germany – iLert GmbH, a SaaS company for alerting, on-call management, and uptime monitoring, announced today that it has achieved the Amazon RDS Ready designation, part of the Amazon Web Services, Inc. (AWS) Service Ready Program. This designation recognizes that iLert has demonstrated successful integration with Amazon Relational Database Service (Amazon RDS).

Faster Incident Resolution with Context Rich Alerts

Labelling your alert payloads although simple can significantly improve the time it takes for your team to respond to incidents. In this blog learn how Squadcast's auto-tagging feature can be a game changer by enabling intelligent labelling & routing of alerts to ultimately reduce your MTTR. A frequent problem faced by on-call engineers when critical outages occur is pinpointing the exact point of failure.

AIOps as a modern cockpit, and why that matters

Our human capacity for ingesting information and acting on it, is constant. As the systems we operate grow more complex, we need to make sure we use technology that presents us with only the relevant information we need, exactly when we need it. In aviation, this lesson was learned long ago, and now IT Ops is catching up.

5 Steps to Building an Effective Clinical Communication Plan

Organizations require a well-crafted clinical communication plan to streamline workflows across care teams. The communication plan must include processes, hardware and software that improves how providers perform. An effective communication plan eliminates barriers across departments and ensures that all providers are informed of patient-related incidents. High-level healthcare administrators are responsible for designing, managing and launching the clinical communication plan.

Chapter 7: In Which Sarah Experiments with Observable Low-Code

This is the seventh chapter in a series of blog posts exploring the role that intelligent observability plays in the day-to-day life of smart teams. In this chapter, our DevOps Engineer, Sarah, experiments with low code and Moogsoft in her team’s DevOps toolchain to rush a new feature out the door to keep up with a competitor.

Streamline incident management with BigPanda's offering in the Datadog Marketplace

BigPanda is a domain-agnostic AIOps platform that helps organizations detect and resolve incidents in their complex IT environments. By unifying and correlating data from monitoring, change, and topology tools, BigPanda enables teams to quickly pinpoint the root cause of issues and prevent costly outages.

Are you an MS Teams shop? We've got you Covered with Blameless Incident Resolution

We have an exciting announcement. Blameless is providing early access to our Microsoft Teams integration. SRE and engineering teams can now resolve incidents faster without leaving the comfort of their favorite messaging tool. With the Blameless incident resolution product, Microsoft Teams users can now reduce toil in routine incident response processes through automation, codify processes with checklists, and craft retrospectives with the ‘add to timeline’ command.

Have You Herd? Episode 1 | DevOps vs SRE

Join the Moogsoft Engineering team for their inaugural stream as we tackle the big questions - How do we define DevOps? And as it becomes more mainstream - will the roles of development & ops combine forever into super powered developers, or does the complexity of our systems require further specialization between the two roles?

Who is on standby? Simple question, simple answer.

In our feature session for the current Enterprise Alert release, we were asked if it was possible to make the on-call page available to every employee regardless of whether they have a user account in Enterprise Alert or not. This option has existed in Enterprise Alert for a long time, but admittedly it is not very well documented. So I would like to take this opportunity to show you what the on-call overview can offer you and how to share the on-call page.

Copy and Paste Multi-Team Schedules

With the release of Enterprise Alert 9, not only have our capabilities for tighter integration with almost any source system imaginable been massively expanded, but our front end has also received some much requested updates. Among them are our multi-team schedules. These allow – especially for international companies – a simple and clear planning of readiness of several teams across different time zones.

Integration of Enterprise Alert 9 with AzureMonitor

Our Azure Monitor connector provides seamless 2-way integration of Enterprise Alert 9 with Azure Monitor. Once added to your Enterprise Alert instance, the connector will read your Azure Monitor alerts fully automatically and trigger alert notifications, e.g. to your team members on duty. It also synchronizes the alert status from Enterprise Alert 9 to Azure Monitor so that if alerts are acknowledged or closed, this status is also updated on the according alert in Azure Monitor.

Error Budgets Explained (And How to Make One for Your Team)

Wondering what error budgets (EBs) are and how they are useful? We explain what they are, how they are defined, and how they can help your team. An error budget is the amount of acceptable unreliability a service can have before customer happiness is impacted. If a service is well within its budget, the developers can take more risks in their releases. If not, developers need to make safer choices.