Operations | Monitoring | ITSM | DevOps | Cloud

How to Structure an IT Help Desk

Managed service providers (MSPs) need an IT help desk to address and answer the technical questions of clients. In the modern MSP environment, the IT help desk is the primary source of contact between customers and knowledgeable, responsive support personnel. Successful help desks are customer oriented and encourage clients to report IT incidents when they occur.

Monthly Moo Update | September 2021

This has been quite the summer to remember as we continue to witness our customers achieve remarkable efficiencies through automation such as deep integrations with change pipelines to suppress alerts during maintenance windows and correlating alerts to create incidents with dynamic and evolving descriptions that dramatically improve Incident management processes.

Has the firefighting stopped? The effect of COVID-19 on on-call engineers

With digital becoming the primary channel for work, education, shopping, and entertainment in the last 18 months, it’s no surprise that workloads for technical teams and on-call engineers have increased. Data from PagerDuty’s inaugural platform insights report, The State of Digital Operations, highlights this reality. As of July 2021, the average number of events managed daily by PagerDuty is 37 million, with 61,000 of those being critical incidents.

New feature: Templates for Incident Management

At Spike.sh , we are obsessed with making incident management more accessible to dev teams everywhere. With this goal in mind, we are always looking for ways to reduce the friction while setting up the Spike.sh platform. When we saw customers asking our advice for creating effective on-call schedules and escalations, we knew we had to do more than just good documentation - we needed a way to share best practices with our customers in the product itself.

MTBF Is an Integral Part of Business Operations - Here's Why

In today’s fast-paced digital world, your customers expect your services to be available 24 hours a day, seven days a week. If your services are unreliable, these customers will likely take their business elsewhere — and spread the word. To retain their business, you must understand and optimize your service and system health to ensure your services are reliable. Gauging your service and system health requires much more than knowing whether they’re on or off.

What's new: Updates to Event Intelligence, mobile, and more!

As we near the end of the Summer season, we’re excited to announce a new set of updates and enhancements to the PagerDuty platform. These updates will help our users and customers: Make sure to view the latest PagerDuty Pulse or learn more from our community team and developer advocates who have launched new programs to help you learn more about our latest products and best practices.

Call Handling - Relieve the burden of your service desk and on-call staff

These days, I keep encountering inquiries from various customers on the topic of call handling. Due to the current transformation, triggered by the increased use of home offices, it is becoming more and more important to make on-call staff more accessible. Often the already overloaded service desk is used for this purpose. Of course, this leads to a) a deterioration in the quality of the service desk and b) delays between the receipt of the problem and the start of problem resolution.

Automate your LogDNA + PagerDuty Incident Workflow

LogDNA integrates with your PagerDuty instance to help trigger incidents based on log data coming in from your ingestion sources. This allows your teams to quickly understand when there are issues with your application, and where in the logs you can investigate to understand root cause. To help further accelerate your team’s ability to understand the state of your applications, we are introducing the ability to automatically resolve those PagerDuty Incidents directly from LogDNA.

Best practices to help retailers make the grade for the holiday season

It’s hard to believe we’re already talking about the return to school, but it’s set to be a big one. In fact, this year promises to be the biggest in the last five years. The National Retail Federation expects back-to-school spending to reach $37.1B , up from $33.9B last year. Back-to-college spending is also expected to rise, reaching $71B this year. This increase is buoyed by parents and students gearing up for their first in-person classes after a year of virtual learning.

Introducing the Spike.sh Alert Reliability Engine

At Spike.sh, our mission is to help dev teams understand and resolve production issues faster. At the core of this is our Alert Reliability Engine, whose job is to make sure that a team member always gets an alert on their preferred channel. Currently, we support 7 channels - phone call, SMS, mobile push notifications, email, Slack, Microsoft Teams and Discord. We wanted to give you a peek into how we achieve high deliverability across these channels.

How MBTA modernized incident response to reduce alert fatigue and improve collaboration

Citizens utilize mobile and consumer-facing applications in everyday life, so it’s no surprise that they demand seamless access and high availability of government services online. Whether it’s making payments or applying for benefits, citizens and constituents alike expect these services to be available around the clock.

How Squadcast Benefits On-call Engineers - Part 1

It is difficult to stay completely reliable in an always-on world. So it's very important to choose the right Incident Management solution that can solve your problems. In this blog, we have highlighted the benefits of Squadcast and why you should adopt it. “Being on-call sucks!" Often incident response teams use this phrase when talking about their on-call experiences. Despite using best practices for managing infrastructure, incidents do occur from time to time.

Dynatrace and xMatters Make Seamless Efficiency Possible - xMatters Demo

How can organizations integrate their tools into a platform that maximizes uptime and simplifies operations? Is it possible for the tools you already rely on to be more efficient? With Dynatrace and xMatters in tandem, the answer is yes! Join Rob Jahn, Technical Partner Manager at Dynatrace, Eric Maxwell, Solution Architect at xMatters, and Rutuja Rajwade, Partner Marketing Manager at xMatters, as they discuss how Dynatrace and xMatters can work together to make incident management and development processes more efficient.

How the technology you choose influences CloudOps maturity

As the world becomes increasingly digital-first, it’s more important than ever for organizations to keep services always-on, innovate quickly, and deliver great customer experiences. Uptime is money, so it’s no surprise that many have made the shift to cloud in recent years in order to make use of its flexibility and scale—while controlling costs. And while 2020 wasn’t easy for any organization, those that are thriving have embraced the digital mindset.

WIRES and xMatters: Efficient Collaboration On a National Scale

An update on how xMatters service reliability platform is improving animal rescue response times through WIRES in Australia. We are extremely grateful for xMatters support and are excited to share this update with the xMatters community. We have made so much progress with our wildlife rescue response systems since the devastating bushfires of 2019 and 2020, despite the continuing challenges of COVID-19.

Managed Service Provider - How AlertOps Helps MSP Scale Digital Transformation Initiatives.

In an era where speed, productivity, and user experiences matter most what are the incident management capabilities managed service provider need most to grow, transform and mature their digital operations, processes and serve more organizations, faster and more efficiently. Many of today’s enterprises still have operations that are largely manual, reactive and lack the in-house resources and expertise to undertake a digital transformation initiative.

What's New: Introducing Delay Notifications to Control Alert Fatigue

The OnPage team is pleased to announce a new feature to the enterprise web console: Delay Notifications. With this new addition, organizations have the option to queue messages for specific time periods, delivering messages at the end of the Delay Notification schedule. The latest feature is designed to alleviate alert fatigue and improve work-life balance for incident respondents.

The Top 4 Key Levers to Build Towards Long-Lasting Digital Operations Maturity

Digital operations maturity is a journey. The first step is to understand where you are, where you want to get to, and what’s keeping you from getting there. Only then can you make strategic decisions and lay out a plan for how to approach any hurdles and land where you want your organization to be. For many organizations, upleveling operational maturity requires investment in driving cultural change with fundamental shifts to operating models.

Full-cycle observability with the Elastic Stack and Lightrun

An application running in production is a difficult beast to tame. Most experienced developers–ones who spent enough late nights or Saturday mornings trying to break apart a nasty production bug–will try and create the clearest possible picture for their later selves while writing their code, so that they could understand what’s actually going on in the system during an incident.

Chapter Ten: In Which Sarah Resigns from Animapanions and Heads Off to Start Up a Competitor

This is the tenth chapter in The Observability Odyssey, a book exploring the role that intelligent observability plays in the day-to-day life of smart teams. In this chapter, our DevOps Engineer, Sarah, throws in the towel at C&Js and moves on to build her own business.

Chapter Eleven: In Which James Speaks with the Industry Analysts

This is the eleventh chapter in The Observability Odyssey, a book exploring the role that intelligent observability plays in the day-to-day life of smart teams. In this chapter, our IT Ops Leader, James, speaks with the analysts about what’s happening in the AIOps space.

Getting Started with Site Reliability Engineering

Site Reliability Engineer (SRE) is one of the fastest growing jobs in tech, with Linkedin reporting 34% growth YoY in 2020 and over 9000 openings in their Emerging Jobs Report. If you’re new to SRE and exploring it as a career path, understand that it can be a challenging but rewarding experience. Here are some quick tips on how you can get started with SRE and jump-start a rewarding career.

Strategies to Strengthen Nurse Mental Health and Safety

No job is easy, but the job of a nurse is even more challenging, especially during a global health crisis. Nurses are at a higher risk of developing burnout due to the psychological trauma and cognitive overload that comes with the nursing profession. The situation is further exacerbated when nurses take on more responsibility during a pandemic or other large-scale incidents.

SLOs, SLIs, and where to find them with Jacob Plicque III

Identifying the right the right Service-Level Indicators is mission-critical for any SRE team responsible for meeting Service-Level Objectives and reporting on them. Find out how to sift through mountains of metrics and fill gaps in your data in order to visualize SLIs that actually matter for effective error budget tracking and actionable alerts in Grafana. Presented by: Jacob Plicque III, Senior Engineer at Grafana Labs at Grafana East Coast Virtual Meetup - August 2021

Real-time digital operations management puts connected vehicles on the road to success

As technology advances and applications for the Internet of Things (IoT) continue to expand, industrial and manufacturing companies are embedding more digital systems into their operations. From smart factories and intelligent shipping to automation and 3D printing, Industry 5.0 has arrived.

How to Avoid the Executive 'Swoop and Poop' and Other Best Practices for Operational Maturity

We’re eating at restaurants again. We’re seeing family after too long apart. Some of us may even be returning to the office. But, that doesn’t mean that the pressure is off for digital services, and growing in operational maturity still remains top of mind. While the digital transformations have been taking place for the last two decades, COVID-19 added pressure to speed initiatives.

Are You Spending Enough on Cybersecurity?

Cybercriminals do not discriminate against the organization, people or industry they target. These actors look to exploit vulnerabilities in resources to intercept valuable data from small and medium-sized businesses (SMBs). Cyberattacks are inevitable, and organizations must have the right controls and information security systems to mitigate the impact of an attack.

Supercharging incident response with runbook automation

The global pandemic is estimated to have accelerated digital transformation by at least seven years—and it’s showing no signs of stopping. In fact, companies are investing even more into software-driven experiences. A recent Gartner forecast points to worldwide IT spending increasing 8.4% to $4.1 trillion in 2021, with much of that spend on mission-critical, customer-facing services.

We've raised a $23M Series B to help us get to a world where all software is reliable

At FireHydrant, we envision a world where all software is reliable, and we’re on a mission to help every company that builds or operates software get closer to 100% reliability. Today, we’re thrilled to announce that we’ve raised $23 million to help us further our goal.

Improving your team's on-call experience

Your engineers probably dislike going on-call for your services. Some might even dread it. It doesn't have to be this way. With a few changes to how your team runs on-call, and deals with recurring alerts, you might find your team starting to enjoy it (as unimaginable as that sounds). I wrote this article as a follow-up to Getting over on-call anxiety.

Timely Delivery with Enterprise Alert

Murphy’s Law states that anything that can go wrong, will go wrong. The challenge for most businesses is putting the right method of communication in place for when the inevitable happens. The only way to handle this is to expect the worst and then prepare for it. A key factor in deciding for any alerting solution is can my team be notified properly when a major outage happens .

Announcing our $6M investment to double down on IT incident and Reliability needs

When Squadcast was founded back in 2018, we had a concise yet clear goal—we wanted to make it as easy as possible for companies to manage their IT incident and reliability needs. In the spirit of continuing that mission, today I’m excited to announce our $6M fundraise led by DNX Ventures and backed by Wipro Ventures, Nexus Ventures, and Chiratae Ventures. We’re also pleased to announce the addition of DNX and Q Motiwala to our Board of Directors.

What's the ROI? How Operational Maturity Improves Customer and Team Satisfaction

Are we looking at the new normal now? In the last 18 months, organizations all over the world were compelled to undergo a rapid digital transformation and mature their operations to support services that were under unprecedented strain. Digital transformation allows companies to embark on large-scale cloud migrations and adopt modern development methods like DevOps and Agile.

Demystifying DevOps and SRE

How different are DevOps and SRE? Are they related to each other? In this blog, James Samuel sheds light on the similarities & differences between SRE & DevOps followed by the possible ways to structure an SRE team in your organization. One of the terms that people often find confusing is SRE and DevOps. People often ask, should I hire a DevOps Engineer or a Site Reliability Engineer? What is the difference between SRE and DevOps and which one do I need? In this post, I attempt to shed some light.

How PagerDuty Helps Manage Hybrid Infrastructure and Complex Ops Across Industries

If there’s one thing we learned from the 80+ sessions from Summit 2021, it’s that across the industries, companies are continuing to accelerate innovation in a bid to meet growing customer expectations of always-on services across all channels. In financial services, disrupting traditional banking or rethinking access to advisory services comes with operational and regulatory challenges.

Contextual Intelligence and Observability: Without the Former, You Really Don't Have the Latter

Observability is a hot term in the industry, but don’t let it fool you: having visibility into your organization's apps and services only gives you partial clarity into a system’s overall performance. To get a full understanding of your monitoring data, you need to apply contextual intelligence.

Hear From Product PagerDuty for Customer Service Operations Lightning Talk

Learn about what's new with PagerDuty for Customer Service Operations from the Summit 2021 Launch. Our Product team shares how you can benefit from our latest updates and enhancements and enjoy demos that were recorded live from Summit 2021 featuring the PagerDuty Salesforce Service Cloud Integration V3, New Customer Service SKU, and Round Robin Workflows (Round Robin Scheduling).

PagerDuty Pulse Q1 FY22 Full Webinar

In this edition of PagerDuty Pulse, you’ll get to view our most recent platform updates and enhancements (March 2021 – June 2021) that extend from AIOPs and automation to a variety of new integrations. Teams must leverage PagerDuty and Modern Digital Operations to automate the day-to-day toil of repetitive tasks, master modern operations with full-service ownership, seamlessly collaborate across the organization, and accelerate enterprise-wide response by enabling customer service operations and business stakeholders.

Less is more: Incident management and monitoring in hybrid IT infrastructures

Many companies are continuously modernizing their infrastructure – but there is no standard way for the perfect IT infrastructure. Still, hybrid architectures have become the status quo in enterprises. Almost all organizations have migrated at least parts of their assets to the cloud or run applications as cloud services. At the same time, businesses want to dovetail their IT architecture with software development and are therefore embracing dynamic infrastructures. ‍