April 2021

SRE Leader Panel: Business Agility is what matters, SRE can help you get there

Apr 29, 2021 By Blameless In Blameless

Ready for another SRE Thought Leader Panel? This one is themed, Business Agility is what matters, SRE can help you get there. We’re chatting about topics like the value of crisis during incident response, the best and worst tech transformations we’ve seen, how reliability impacts the flow of value, and more. This panel is hosted by Chris Hendrix, staff software engineer at Blameless and features guests.

View Video

Blameless

Read more about SRE Leader Panel: Business Agility is what matters, SRE can help you get there

Incident Response Alert Routing

Apr 29, 2021 By John Hasinsky In PagerTree

You have identified a data breach, now what? Your Incident Response Playbook is up to date. You have drilled for this, you know who the key players on your team are and you have their home phone numbers, mobile phone numbers, and email addresses, so you get to work. It is seven o’clock in the evening so you are sure everyone is available and ready to respond, you begin typing “that” email and making phone calls, one at a time.

Read Post

PagerTree

Read more about Incident Response Alert Routing

7 Ways SRE Is Changing IT Ops And How To Prepare For Those Changes

Apr 29, 2021 By Squadcast Community In Squadcast

SRE best practices are disrupting and catalyzing change in the ways organizations approach IT Operations. In this blog we look at 7 ways SRE is bringing this transition. ‍Site Reliability Engineering is a new practice that has been growing in popularity among many businesses. Also known as SRE, the new activity puts a premium on monitoring, tracking bugs, and creating systems and automations that solve the problem in the long term.

Read Post

Squadcast

Read more about 7 Ways SRE Is Changing IT Ops And How To Prepare For Those Changes

How Kubernetes Can Both Help and Hinder Incident Management Teams

Apr 29, 2021 By Quentin Rousseau In Rootly

Kubernetes makes it easier in certain ways to manage reliability. But incident response teams and SREs must also be prepared to handle the unique reliability challenges that K8s creates.

Read Post

Rootly

Read more about How Kubernetes Can Both Help and Hinder Incident Management Teams

4 Major Capabilities of Automated Incident Management

Apr 29, 2021 By Christopher Gonzalez In OnPage

Automated incident management ensures that critical events are detected, addressed and resolved in a fast, efficient manner. Automation allows incident management tools to integrate with each other and fosters instant communication across the systems. Automation tears down barriers across IT operations (ITOps) teams and ensures all departments are on the same page. Teams gain full visibility into incident status to verify that incidents are addressed by the relevant groups.

Read Post

OnPage

Read more about 4 Major Capabilities of Automated Incident Management

Introducing Enterprise Alert 9's New 2-Way Rest

Apr 28, 2021 By Derdack In Derdack

A brief introduction to the powerful new 2-Way Rest introduced with the new Enterprise Alert 9

View Video

Derdack

Read more about Introducing Enterprise Alert 9's New 2-Way Rest

Pragmatic Incident Response: Lessons learned from failures by Robert Ross Failover Conf 2021

Apr 28, 2021 By Gremlin In Gremlin

Incident response is overwhelming. So where do you start? There's a lot of advice out there, but it's mostly theories that aren't taking reality into account. So how do you get a process in place that actually works and scales? In this session, FireHydrant CEO and Co-Founder, Robert Ross, will share quick stories from his experience as an SRE and what tips he’s learned along the way.

View Video

Gremlin

Read more about Pragmatic Incident Response: Lessons learned from failures by Robert Ross Failover Conf 2021

JFrog and PagerDuty Extend Ecosystem Integration

Apr 28, 2021 By Juan Perez In JFrog

JFrog and PagerDuty have deepened their technology integration to further boost IT operators’ and developers’ visibility into the software development lifecycle and accelerate incident resolution. The latest integration, which involves the JFrog Pipelines DevOps pipeline automation solution, simplifies and streamlines how to identify faulty builds that impact production environments.

Read Post

JFrog

Read more about JFrog and PagerDuty Extend Ecosystem Integration

Introducing 2-way REST capabilities with Enterprise Alert 9

Apr 28, 2021 By Derdack In Derdack

The REST API in Enterprise Alert 9 has now been extended with a 2-way functionality. This allows to call webhooks or REST endpoints from third party systems on alarm status changes (acknowledge, close). Thus, in Enterprise Alert 9, it becomes child’s play to establish a 2-way integration with almost any REST enabled third party system.

Read Post

Derdack

Read more about Introducing 2-way REST capabilities with Enterprise Alert 9

What is Site Reliability Engineering [Simple Intro to SRE]

Apr 26, 2021 By Emily Arnott In Blameless

Wondering what SRE is all about? We will explain what it is, how it works, why it was developed, and how it can help your organization. So what is SRE (Site Reliability Engineering)? SRE is a methodology that fuses software and operations teams, with the goal of producing reliable, resilient, and scalable systems. Site Reliability Engineering (SRE) was developed by Google engineer Ben Treynor Sloss in 2003. Google’s goal was to increase the reliability of its sites and services.

Read Post

Blameless

Read more about What is Site Reliability Engineering [Simple Intro to SRE]

Enterprise Alert 9 - Launch Webinar April 2021

Apr 23, 2021 By Derdack In Derdack

All new features of Enterprise Alert 9

View Video

Derdack

Read more about Enterprise Alert 9 - Launch Webinar April 2021

New Gartner AIOps Platform Market Guide Shows More Use Cases for Ops and Dev Teams

Apr 23, 2021 By Richard Whitehead In Moogsoft

Gartner jumps right into it, describing a reorientation of a tool that has previously focused on IT service management and automation. AIOps is now also enabling a variety of new observability use cases for DevOps and Site Reliability Engineering (SRE) teams. This blog presents the guide’s major findings and a link so you can read the report for more details. About the AIOps Platform Market

Read Post

Moogsoft

Read more about New Gartner AIOps Platform Market Guide Shows More Use Cases for Ops and Dev Teams

FireHydrant April 2021 Product Updates: Incident Tags & Customizable Slack Incident Modals

Apr 23, 2021 By FireHydrant In FireHydrant

We're excited to announce the release of two new features this month: customizable Slack incident modals and Incident Tags. Keep reading to more about how they can help your teams manage incidents better!

Read Post

FireHydrant

Read more about FireHydrant April 2021 Product Updates: Incident Tags & Customizable Slack Incident Modals

Creating Chaos to Achieve Reliability

Apr 22, 2021 By JJ Tang In Rootly

How can creating chaos achieve better reliability? Chaos and reliability might seem mutually exclusive, but through the use of Chaos Engineering, SREs can bring about meaningful changes to system resiliency.

Read Post

Rootly

Read more about Creating Chaos to Achieve Reliability

5 typical mistakes in alerting and how to avoid them

Apr 22, 2021 By Matt In SIGNL4

A good alerting strategy is an important prerequisite for successful operations management and the availability of mission-critical systems. But also for employee satisfaction. It’s not just about sending out alerts upon critical conditions, problems and failures at all, but more importantly, about how it is done. Here are the 5 most typical mistakes, their consequences and how to avoid them.

Read Post

SIGNL4

Read more about 5 typical mistakes in alerting and how to avoid them

Join the Dark Side with Enterprise Alert

Apr 22, 2021 By Derdack In Derdack

In recent times dark viewing mode for websites has gained a lot of popularity from users worldwide. This does not just apply to your favorites sites, but also for those applications that you rely on day in and day out. Enterprise Alert is no exception. We have heard your requests and are happy to announce that Enterprise Alert 9 now has a dark mode! In the footer of the Web Portal, there is now a Dark toggle. This theme will instantly change your viewing experience between Classic and Dark.

Read Post

Derdack

Read more about Join the Dark Side with Enterprise Alert

SOC 1 or SOC 2, which should you comply with and why?

Apr 21, 2021 By Eyal Katz In Exigence

Organizations today are more vulnerable than ever to cyberattacks and data breaches. Whether the attack is executed by an external actor or an insider, the unauthorized intrusion comes at a great cost. This cost may differ, depending on several factors. These include the cause of the breach, the actions taken to remediate the incident, whether there is a history of data infringements, what data was compromised, and how the organization aligned with the authorities and regulators.

Read Post

Exigence

Read more about SOC 1 or SOC 2, which should you comply with and why?

AIOps in 2021 and Beyond: 5 Trends You Should Be Aware Of

Apr 21, 2021 By Srinivas Miriyala In CloudFabrix

As businesses become increasingly digital, IT operations now deal with more extensive and more complex data than before. Traditional tools and strategies might no longer be enough to help them cope with their growing workload. Hence, many organizations are tuning in to the various AIOps trends available. AIOps is short for Artificial Intelligence (AI) for IT Operations. This is where they use Machine Learning(ML) to enhance and automate IT functions.

Read Post

CloudFabrix

Read more about AIOps in 2021 and Beyond: 5 Trends You Should Be Aware Of

Introducing Dark Mode in Enterprise Alert 9

Apr 21, 2021 By Derdack In Derdack

A brief introduction to the new Dark Mode inside of Enterprise Alert 9.

View Video

Derdack

Read more about Introducing Dark Mode in Enterprise Alert 9

Introducing Dark Mode in Enterprise Alert 9

Apr 21, 2021 By SIGNL4 In SIGNL4

A brief introduction to the new Dark Mode inside of Enterprise Alert 9.

View Video

SIGNL4

Read more about Introducing Dark Mode in Enterprise Alert 9

The true cost of IT Ops, the added value of AIOps

Apr 21, 2021 By Jason Walker In BigPanda

Today’s IT landscape is complex, hybrid, and fast-moving, and the adoption of multi-cloud infrastructure, applications, and new digital transformation initiatives is accelerating. IT operations teams, playing a vital role in enabling the delivery of uninterrupted services and creating business value for enterprises, are finding they need to constantly grow their resources to manage all the moving pieces in their IT stack. This can get expensive … but how much are they spending?

Read Post

BigPanda

Read more about The true cost of IT Ops, the added value of AIOps

A Day in the Life: James the IT Ops Guy Learns How to Connect All that Data

Apr 21, 2021 By Helen Beal In Moogsoft

“Morning, mate,” I greeted Dinesh as he walked into the office. “Nice get up for the big day!” He was wearing a pressed shirt, rather than his usual hoodie. “Thought I’d make an effort, you know,” he grinned. We’d been planning intensely for this moment for the last week or so – our meeting with Charlie, the CIO, to present the results of our Moogsoft experiments and ask for permission to extend the rollout across the enterprise.

Read Post

Moogsoft

Read more about A Day in the Life: James the IT Ops Guy Learns How to Connect All that Data

SREview Issue #12 April 2021

Apr 20, 2021 By Blameless Community In Blameless

Spring is here! We have rain! We have flowers! We have allergies! We also have some of the most exciting Tweets, content, and events happening in the SRE and resilience engineering community this month.

Read Post

Blameless

Read more about SREview Issue #12 April 2021

Using Coralogix + StackPulse to Automatically Enrich Alerts and Manage Incidents

Apr 20, 2021 By Jonathan Brown In Coralogix

Keeping digital services reliable is more important than ever. When something goes wrong in production, on-call teams face significant pressure to identify and resolve the incident quickly – in order to keep customers happy. But it can be difficult to get the right signals to the right person in a timely fashion.

Read Post

Coralogix

Read more about Using Coralogix + StackPulse to Automatically Enrich Alerts and Manage Incidents

PagerDuty Take the Lead- Jeff Lawson, CEO of Twilio

Apr 20, 2021 By PagerDuty In PagerDuty

"The right communication with the right person at the right moment can change someones life," Twilio CEO Jeff Lawson. Listen in on the recent #TakeTheLead conversation between our CEO Jenn Tejada and CEO of Twilio Jeff Lawson! #TakeTheLead #AMA #Firesidechat

View Video

PagerDuty

Incident Management

Read more about PagerDuty Take the Lead- Jeff Lawson, CEO of Twilio

Coffee Break Webinar Series: Under the Covers of AIOps

Apr 20, 2021 By Moogsoft Team In Moogsoft

Last week DevOps Institute’s Chief Ambassador, Helen Beal, and Moogsoft’s own Chief Evangelist, Richard Whitehead, continued to follow the exploits of DevOps Engineer Sarah and her journey towards AIOps and Observability enlightenment.

Read Post

Moogsoft

Read more about Coffee Break Webinar Series: Under the Covers of AIOps

Should You Be an SRE or a DevOps Engineer?

Apr 15, 2021 By Quentin Rousseau In Rootly

SREs may have better long-term job prospects, but DevOps might be an easier career to pursue.

Read Post

Rootly

Read more about Should You Be an SRE or a DevOps Engineer?

PagerDuty for Cloud Migration & Amazon CloudWatch Integration Workflow Demo

Apr 15, 2021 By PagerDuty In PagerDuty

View this video to view a PagerDuty Amazon CloudWatch Integration workflow demo and to learn more about how PagerDuty helps organizations accelerate cloud migration, embrace service ownership, and deliver better customer experiences.

View Video

PagerDuty

Read more about PagerDuty for Cloud Migration & Amazon CloudWatch Integration Workflow Demo

New Splunk Synthetic Monitoring Features Help Integrate Uptime and Performance Across the Entire Splunk Platform

Apr 15, 2021 By Mat Ball In Splunk

For teams that build or maintain modern applications with their end-users in mind, the acquisition of Rigor means that Splunk now offers the most comprehensive synthetic monitoring solution on the market. Rigor, now Splunk Synthetic Monitoring and Web Optimization, provides best-in-class synthetic monitoring capabilities enabling IT Ops and engineering teams to detect and respond to uptime and performance issues within incident response coordination and throughout software development lifecycles.

Read Post

Splunk

Read more about New Splunk Synthetic Monitoring Features Help Integrate Uptime and Performance Across the Entire Splunk Platform

Creating Custom Slack Commands

Apr 15, 2021 By FireHydrant In FireHydrant

Site Reliability Engineers are expected to know everything that’s happening, all of the time. That’s a lot of things! To help you sift through the noise, we’ve developed a feature that lets you find accurate data about your organization on-demand. You can do this by sending custom-designed commands to FireHydrant directly from your integrated Slack account.

Read Post

FireHydrant

Read more about Creating Custom Slack Commands

Accelerate Incident Resolution By Benchmarks-enriched On-call Contexts

Apr 15, 2021 By Bashyam Anant In Sumo Logic

In a recent experiment with my colleagues, I polled them about the following: “What would they do if the lights went out as you worked at night?” Besides identifying the funny and who-you-want-in-case-of-an-emergency responses, most of my colleagues checked to see if the problem might be broader than their own home.

Read Post

Sumo Logic

Read more about Accelerate Incident Resolution By Benchmarks-enriched On-call Contexts

Alerting of Service Technicians in Facility Management

Apr 15, 2021 By Ronald In SIGNL4

In buildings today, there are numerous systems that require regular maintenance or that need attention as quickly as possible if problems are detected. This applies, for example, to heating systems, air conditioning, cooling, ventilation, elevators or fire alarm systems. Modern facility management systems are able to reliably monitor such systems.

Read Post

SIGNL4

Read more about Alerting of Service Technicians in Facility Management

Incident triage: a key element in your MTTR

Apr 14, 2021 By Yoram Pollack In BigPanda

One of the key performance indicators for IT Ops is MTTR (Mean-Time-To-Resolution). MTTR essentially measures the length of your incident management lifecycle: from detection; through assignment, triage and investigation; to remediation and resolution. IT Ops teams strive to shorten their incident management lifecycle and lower their MTTR, to meet their SLAs and maintain healthy infrastructures and services. But that’s often easier said than done.

Read Post

BigPanda

Read more about Incident triage: a key element in your MTTR

What are MTTx Metrics Good For? Let's Find Out.

Apr 13, 2021 By Emily Arnott In Blameless

Data helps best-in-class teams make the right decisions. Analyzing your system’s metrics shows you where to invest time and resources. A common type of metric is Mean Time to X, or MTTx. These metrics detail the average time it takes for something to happen. The “x” can represent events or stages in a system’s incident response process. Yet, MTTx metrics rarely tell the whole story of a system’s reliability.

Read Post

Blameless

Read more about What are MTTx Metrics Good For? Let's Find Out.

PagerDuty - Time to Value (Extended 3.75 min.)

Apr 13, 2021 By PagerDuty In PagerDuty

When critical incidents do arise, you need the best, most accurate solution for real-time work. Even if you choose to weather the long implementation and cost, when compared to ITSM tools, PagerDuty can provide response times of over 10 times faster when it matters most.

View Video

PagerDuty

Read more about PagerDuty - Time to Value (Extended 3.75 min.)

The Cost of IT Downtime: An Overview

Apr 13, 2021 By PagerDuty In PagerDuty

As the adoption of cloud computing continues to encourage innovation across industries, high-performing and resilient systems have become a necessity in order to keep pace with the competition and meet internal/external SLAs (service level agreements). In terms of customer expectations, a minute of downtime can mean thousands of dollars in lost opportunity and a soiled customer relationship. So what exactly is downtime?

Read Post

PagerDuty

Read more about The Cost of IT Downtime: An Overview

Enterprise Alert 9 is now officially available!

Apr 13, 2021 By Derdack In Derdack

We are excited announcing the release of the 9th generation of our alerting signature product Enterprise Alert! Release 9 contains exciting new features and improvements. Read about all the details in this blog article.

Read Post

Derdack

Read more about Enterprise Alert 9 is now officially available!

Having On-call Nightmares? Runbooks can Help you Wake Up.

Apr 12, 2021 By Harry Hull In Blameless

You aren't sure how long you've been here, but the view outside the window sure is soothing. Before you can fully take in your surroundings, a siren rips you back into the conscious world. Slowly, you begin to piece together that you exist, and you are on call. The ringing, much louder now, pierces through your skull as you begin to open your bleary eyes. You turn over your pillow, grab your phone, and click through the PagerDuty notification.

Read Post

Blameless

Read more about Having On-call Nightmares? Runbooks can Help you Wake Up.

Automatic Service Now ticket creation with Enterprise Alert

Apr 12, 2021 By Derdack In Derdack

How to use Enterprise Alert to automatically create tickets inside of Service Now.

View Video

Derdack

Read more about Automatic Service Now ticket creation with Enterprise Alert

Automatic Service Now ticket creation with Enterprise Alert

Apr 12, 2021 By SIGNL4 In SIGNL4

How to use Enterprise Alert to automatically create tickets inside of Service Now.

View Video

SIGNL4

Read more about Automatic Service Now ticket creation with Enterprise Alert

Using Remote Actions to Create ServiceNow Incidents

Apr 12, 2021 By Derdack In Derdack

Recently we have received a lot of requests for Enterprise Alert to not only alert on critical situations but to also take a proactive approach to initiate, record and track those situations through ITSM tools such as ServiceNow and BMC Remedy. This post will center around what happens when critical systems fail and tickets are not being created in ServiceNow due to a break in the workflow.

Read Post

Derdack

Read more about Using Remote Actions to Create ServiceNow Incidents

Automatic Service Now ticket creation with Enterprise Alert

Apr 9, 2021 By Derdack In Derdack

How to use Enterprise Alert to automatically create tickets inside of Service Now.

View Video

Derdack

Read more about Automatic Service Now ticket creation with Enterprise Alert

Key Learnings from the Facebook Status Page

Apr 9, 2021 By Eduardo Messuti In Statuspal

Yesterday April 8th 2021 at around 22:00 UTC, Facebook experienced a major outage where Facebook, Messenger, WhatsApp web and Instagram were down, lasting for as much as 3 hours. This was reported at Facebook’s status page, which was a good example of how to communicate and incident.

Read Post

Statuspal

Read more about Key Learnings from the Facebook Status Page

Monthly Moo Update | March 2021

Apr 9, 2021 By Adam Frank In Moogsoft

Here we are a full quarter into 2021, a year that took off in a huge way for us, and the momentum continues to grow strong. March was a monumental month, and now it’s a wrap. We released significant updates across the board in almost all areas of Moogsoft, including pushing innovation to newfound levels when it comes to the ease of integrating your metric and event data.

Read Post

Moogsoft

Read more about Monthly Moo Update | March 2021

Reduce Toil with Better Alerting Systems

Apr 8, 2021 By Biju Chacko In Squadcast

If not tackled early, increasing toil can affect the morale and productivity of your SRE team. In this blog we look at some of the ways you can counter toil with the help of better alerting systems in place. Are you an SRE or On-call engineer struggling to manage toil? Toil is any repetitive or monotonous activity that can lead to frustration within an incident management team. Also at the business level, toil doesn't add any functional value towards growth and productivity.

Read Post

Squadcast

Read more about Reduce Toil with Better Alerting Systems

A Day in the Life: Sarah the DevOps Engineer and the Beauty of AIOps

Apr 8, 2021 By Helen Beal In Moogsoft

This is the fourth in a series of blog posts exploring the role that intelligent observability plays in the day-to-day life of smart teams. In this post, Sarah and company discover how AIOps gives them "the time to save time!"

Read Post

Moogsoft

Read more about A Day in the Life: Sarah the DevOps Engineer and the Beauty of AIOps

Just call us "Major Incident Software Innovation of the Year"

Apr 8, 2021 By FireHydrant In FireHydrant

We won an award! We're excited to share that we were named the Major Incident Software Innovation of the Year 2020 at the MIM Awards. Our CEO, Robert Ross (better known as Bobby), accepted over video on our behalf (watch the video below). A lot happened for us in 2020 -- not only from winning new business, but growing as a team, and maturing our product. We're excited that MIM felt the same way about us and we're honoured to recieve this award!

Read Post

FireHydrant

Read more about Just call us "Major Incident Software Innovation of the Year"

Digital Transformation in Banking: Transforming Financial Services With Incident Management

Apr 7, 2021 By Vivian Chan In PagerDuty

Financial services institutions have been facing pressure to modernize their operations for years. But legacy architecture and processes—along with compliance regulations—have made rapid innovation difficult to achieve. Adding to this pressure are new, digital-first competitors who accelerate the need for financial services to deliver better digital customer experiences both more consistently and at scale.

Read Post

PagerDuty

Read more about Digital Transformation in Banking: Transforming Financial Services With Incident Management

Three fundamental tips for an effective event filtering in SIGNL4

Apr 7, 2021 By Matt In SIGNL4

Event and alert filtering matters because alert fatigue is one of the most crucial issues in alerting and alert management. SIGNL4 implements a lightweight and effective way of filtering events. The overall process is based on alert categories. Alert categories are applied using a keyword search across the entire payload of incoming third-party events. But assigning alert categories, e.g. for alert augmentation, is not filtering.

Read Post

SIGNL4

Read more about Three fundamental tips for an effective event filtering in SIGNL4

How Would an SRE Conduct a Postmortem on the Suez Canal Incident?

Apr 7, 2021 By JJ Tang In Rootly

The Suez Canal has been big news over the last couple of weeks. We wondered how a Site Reliability Engineer (SRE) might conduct a postmortem on what happened with the Ever Given, and what that might mean if a comparable incident occurred at a modern tech company.

Read Post

Rootly

Read more about How Would an SRE Conduct a Postmortem on the Suez Canal Incident?

Introduction to SLO, SLI and SLA

Apr 7, 2021 By Pruthvi In Spike

When you start researching how to improve the reliability of your software, you will soon run into terms like SLOs and SLAs. It can sound intimidating, but it's quite straightforward to understand. In this post, we will introduce these terms, the differences between them and how to start using them to make your systems more reliable.

Read Post

Spike

Read more about Introduction to SLO, SLI and SLA

SRE Leaders Panel: SRE Adoption as Organizational Transformation

Apr 6, 2021 By Blameless Community In Blameless

Blameless recently had the privilege of hosting SRE leaders Kurt Andersen, SRE Architect at Blameless, Vanessa Yiu, Executive Director, Enterprise Architecture at Goldman Sachs, and Tony Hansmann, Former Global CTO at Pivotal Software, Inc.

Read Post

Blameless

Read more about SRE Leaders Panel: SRE Adoption as Organizational Transformation

Shifting Security Left: Tools and Best Practices

Apr 6, 2021 By OnPage Corporation In OnPage

Software development pipelines typically cycle through key four processes—design, development, testing and software or update releases. Traditional pipelines perform quality and security tests only after completing the development phase. Since there is no such thing as a perfect code, there are always issues to fix. However, if significant architectural changes are needed, fixing them at the end of the process can be highly expensive.

Read Post

OnPage

Read more about Shifting Security Left: Tools and Best Practices

So you Want an SRE Tool. Do you Build, Buy, or Open Source?

Apr 5, 2021 By Emily Arnott In Blameless

As your organization’s reliability needs grow, you may consider investing in SRE tools. Tooling can make many processes more efficient, consistent, and repeatable. When you decide to invest in tooling, one of the major decisions is how you’ll source your tools. Will you buy an out-of-the-box tool, build one in-house, or work with an open source project? This is a big decision. Switching methods half-way through adoption is costly and can cause thrash.

Read Post

Blameless

Read more about So you Want an SRE Tool. Do you Build, Buy, or Open Source?

Behind the redesign of Spike.sh On-call

Apr 5, 2021 By Rajni Reddy In Spike

Background We recently released the biggest overhaul to one of the core features of Spike.sh - On-call schedules. Software teams use on-call schedules to designate first responders who will handle issues when they occur.

Read Post

Spike

Read more about Behind the redesign of Spike.sh On-call

5 Ways Unplanned Work Is Disrupting Your Business

Apr 5, 2021 By Steve Barrett In PagerDuty

Unplanned work is rising, with consequences ranging from unhappy customers and lost revenue, to employee churn and burnout. So what is the true business cost of wasted time? In this blog, we will explore how one employee’s wasted time can impact the whole company—from operations, to development and beyond.

Read Post

PagerDuty

Read more about 5 Ways Unplanned Work Is Disrupting Your Business

Product Update: Upgrade to Exporting your Retrospectives

Apr 2, 2021 By Blameless Community In Blameless

Blameless is excited to announce an enhancement to our Incident Retrospective tool! The Export feature now allows for customizable retrospectives.

Read Post

Blameless

Read more about Product Update: Upgrade to Exporting your Retrospectives

How SREs Can React to COVID-19's Impact on Incident Management

Apr 2, 2021 By Quentin Rousseau In Rootly

By adding new complexity to reliability engineering and making physical war rooms a thing of the past, COVID-19 has imposed permanent changes on incident management. Here’s how SREs can respond.

Read Post

Rootly

Read more about How SREs Can React to COVID-19's Impact on Incident Management

Enabling Customer Service With Full Visibility Into Customer-Impacting Issues

Apr 2, 2021 By Inga Weizman In PagerDuty

We are delighted to announce a new Status Dashboard for the Zendesk Customer Service integration. The dashboard enables customer service agents to have real-time visibility into major incidents that are impacting their customers within the Zendesk tool suite, so they can proactively update customers when an incident occurs.

Read Post

PagerDuty

Read more about Enabling Customer Service With Full Visibility Into Customer-Impacting Issues

Strategies to Reduce Alert Fatigue in Your SOC Team

Apr 2, 2021 By Ritika Bramhe In OnPage

In a SOC (security operations center), alerts originating from hundreds of systems compete to get attention. What ensues is a security analyst’s battle to beat alert fatigue while effectively defending their organization from cybersecurity threats. Alert fatigue is a major challenge faced by security operations center (SOC) teams. The stakes are even higher since they take on the enormous responsibility of maintaining networks and data systems.

Read Post

OnPage

Read more about Strategies to Reduce Alert Fatigue in Your SOC Team

Operations | Monitoring | ITSM | DevOps | Cloud

April 2021