Operations | Monitoring | ITSM | DevOps | Cloud

May 2021

The 7 SRE Principles [And How to Put Them Into Practice]

Whether you're just adopting SRE or optimizing your current processes, we can help. We’ll explain the 7 key principles of SRE and how to put them into practice. So, what are the SRE principles? The fundamental SRE principles are: SRE is a method that operates through principles. Instead of prescribing specific solutions, it guides you with best practices. These SRE principles help organizations decide what's best for them. Once you understand the principles, you can apply them in many areas.

Polystream Stopped Implementing Technology and Implemented A Mindset - Customer Stories

Polystream is changing the way we think about video game streaming, and with xMatters, they know that incidents won't keep them from achieving their goals. In this customer chat, join Tracey McGarrigan, Chief Marketing Officer at Polystream, Cheryl Razzell, VP of Engineering, and xMatters own Laura Meadows, VP EMEA Region, as they discuss Polystream's ongoing ambitions and how xMatters helps their growth. Plus, make sure you don't miss why Tracey would describe xMatters as Polystream's comfort blanket!

Deliver Real-Time Alerts From Facility Management Systems

Facility managers, including service technicians, are expected to operate their facilities safely to meet the expectations of customers. They focus on the smooth functioning and maintenance of many components that fall within the scope of their facility. Typical components include roads, pavements, HVAC and plumbing systems. As a facility manager, staying on top of these siloed and geographically dispersed systems can be challenging.

FireHydrant May 2021 Product Updates: The summer of integrations

With 50% of the US adult population vaccinated, there’s a lot to look forward to this summer, life no longer feels like it’s on hold, and we’re fully embracing that. Get your fire hoses ready, 'cause extinguishing incidents just got easier. We’re rolling out a summer full of new integrations, product releases, events, and more.

Be ready for anything in a world of digital everything

PagerDuty is a digital operations management platform that empowers the right action, when seconds matter. With over 500 integrations and powerful automation capabilities, we make it easy to stay on top of urgent, mission-critical work and keep your digital services always on. For the developers and IT teams working in real-time operations, PagerDuty makes sure you can focus on what matters most. And stay ready for what’s next.

Four things to consider when evaluating incident management platforms

When you’re feeling the stress and pain around incidents, making the decision to find an incident management tool is a no-brainer. But how do you choose the one that will work for you, your team, and your business? You might be asking yourself: Where do I start? What do I need to know? What questions do I ask? What are the options? How can I be sure we’re choosing the right tool?

What do site reliability engineers do?

Are you considering adopting SRE? We will explain the roles and responsibilities of an SRE team within your organization, and how to start building one. So what does an SRE team do? An SRE team is responsible for building software that improves the resiliency of systems, implementing fixes, responding to incidents, and automating processes whenever possible. Site reliability engineering is a holistic practice that incorporates various types of work.

Blameless Runbook Documentation is Now Generally Available!

At Blameless, our mission is to provide teams with the tools they need to operationalize SRE and embrace a culture of resilience. We help teams automate toil and adopt best practices across integrated incident management, comprehensive retrospectives, service level objectives, reliability insights, and more. We are very excited to announce that Blameless Runbook Documentation is now generally available for all customers.

Discover Everbridge Digital Wayfinding for Higher Education

Creating a positive visitor experience is a key component of the administrative health of a school. Despite advances in technology, campus visits have remained mostly formulaic. Digital Wayfinding takes mobile mapping technology the public is used to and applies it to your school, creating an easy-to-use, attractive, interactive tool for your visitors.

ITSM Buyers' Guide: 7 Use Cases to Define Your ITSM Goals

Attempting an upgrade or switch to a new ITSM tool is obstacle-ridden for IT directors. From having to address fears surrounding the cost of switching vendors to assessing service management maturity, building a case around why and how an ITSM can advance the business can be a harrowing feat. Thankfully, Info-Tech pulled together this selection guide.

Single Sign-On Now Available on OnPage Enterprise-Level Accounts

Single sign-on (SSO) services provide a unified view into applications, logins and devices through a secure identity cloud. SSO allows users to access SaaS-based applications through one simple login process. We, at OnPage, are excited to announce that we’ve extended our integration catalog to include SSO services like Okta and OneLogin. Through a single sign-on process, OnPage enterprise-level users can access the OnPage dashboard from their Okta and OneLogin accounts.

New Integration: Declare FireHydrant Incidents from Checkly Alerts

Streamlining your incident management process is what we do best, and one of the ways we do that is by acting as the connective tissue across all of your applications. We’ve partnered with Checkly to bring you a new integration that empowers you to detect problems and resolve incidents faster.

Use Datadog's Notebooks API to programmatically manage your notebooks

Datadog Notebooks simplify the way teams across an organization find and share knowledge. By bringing together live data and rich Markdown text, Notebooks help teams create powerful, data-driven documents—from runbooks and support playbooks to incident postmortems and data reports. And with collaboration functionalities like real-time editing and commenting, team members can simultaneously make changes to a document and gather feedback along the way.

Resilience in Action Episode 7: Killing Ops with Tony Hansmann

Resilience in Action is a podcast about all things resilience, from SRE to software engineering, to how it affects our personal lives, and more. Resilience in Action is hosted by Kurt Andersen. Kurt is a practitioner and an active thought leader in the SRE community. He speaks at major DevOps & SRE conferences and publishes his work through O'Reilly in quintessential SRE books such as Seeking SRE, What is SRE?, and 97 Things Every SRE Should Know.

If everyone is AIOps - which AIOps is right for you?

With so many IT vendors claiming they provide AIOps platforms, how do you understand the differences between them, and decide what flavor of AIOPs to choose for your organization? Join us in a CTO Perspective discussion with Elik Eizenberg, CTO and co-founder at BigPanda, to find the answer. Read the skinny for a brief summary, then either lean back and watch the interview, or if you prefer to continue reading, take a few minutes to read the transcript. Enjoy!

SRE vs. DevOps [Understanding Differences & Similarities]

Site Reliability Engineering (SRE) and DevOps share a goal of building a bridge between development and operations. We'll explore and compare both approaches. Wondering to yourself, which is better for your company, SRE or DevOps? Neither SRE or DevOps is “better,” exactly, since they’re similar yet different in a few key ways: SRE, or site reliability engineering, is a methodology developed by Google engineer Ben Treynor Sloss in 2003.

Make your Onboarding Experience Better with a Murder Mystery Game

Onboarding a new tool can be boring. Or stressful. Or both. When onboarding an incident response tool, it can be difficult to make sure that your team is getting the most from the experience. Do you opt for a run-of-the-mill meeting, or try to learn while in an incident? Neither option is ideal. That’s why Petal’s DevOps Engineer Michael Cole found a new way to get his team using Blameless for their incident response process.

SRE Availability Metrics

How available is your website, service, or platform? What must you monitor and measure to ensure availability? How do you translate uptime into availability? This chart has numbers that every Site Reliability Engineer (SRE) should know. Below the chart, you will find answers to commonly asked questions about SRE and associated metrics.

A Day in the Life: Intelligent Observability at Work with our SRE, Dinesh

When I asked Charlie for permission to attend this year’s AICon (virtual, natch) I thought it would be a shoo-in; learning’s part of my OKRs after all. But he never makes things easy and his ‘yes’ came with a caveat that’s typical when dealing with him. This time, he claimed he didn’t have the budget for the ticket (a likely story!) and I’d have to find another way to get one.

WTF is Incident Management? Post-Panel Wrap-Up

That's a wrap! We hosted "WTF is Incident Management" on May 12, 2021. We invited four very knowledgeable panelists to discuss how they define incident management, what changes they'd make if they could start again from scratch, how to manage team stress after an incident, and other subjects. Our panelists were: host Matt Stratton (Staff Developer Advocate at Pulumi), Emily Ruppe (Incident Commander at Twilio), Alina Anderson (Sr.

Enterprise Alert Alarm Center. A NOC's best friend.

Over time, Enterprise Alert continues to grow and more and more teams are starting to benefit from Enterprise Alert’s reliable alerting. As part of this process, Enterprise Alert almost always becomes a central component of the NOC and has practically trained the NOC admins. For this reason, here in support we rarely have the pleasure of presenting the features of our alarm center.

New Event Source - Website Monitoring

Enterprise Alert is constantly evolving to provide our customers with new ways to implement event sources and use new features. With version 9, several new features have been implemented that make it easier for customers to create alerts for specific processes and events. These include the new “Website Monitoring” event source.

Self-Service for Teams in Enterprise Alert

A few days ago I had an insightful conversation with one of our customers who inspired me to write this blog. He, like so many other customers, was facing the problem that his Enterprise Alert management overhead was increasing with each new team he added, as he had been managing resources such as event sources, notification channels and alert policies for the new teams as well. His question to us, therefore, was whether he could not also put these management tasks in the hands of the teams.

Enhance NOC Alerts With Incident Management and Alert Automation

In a network operations center (NOC), alerts originating from hundreds of servers, application monitoring systems, emails and ticketing services compete to catch a NOC analyst’s attention. NOCs face many challenges in parsing through alerts to identify actionable notifications and mobilize the right response team into action.

Understanding a Microsoft Service Outage

Maintaining business continuity when an issue arises has proven to be a challenge many organizations struggle with. A global pandemic being thrown into the mix in Q1 of 2020 (one that many businesses are still navigating through) introduced a new set of problems for both service providers and businesses reliant on those services.

Care Converge: Secure Clinical Communication and Collaboration

Everbridge’s CareConverge speeds diagnosis and care, enabling time and resource-constrained providers to manage capacity and deliver quality patient care in less time, while exceeding healthcare compliance standards and patient satisfaction. Whether responding to a daily, non-emergent clinical case or a high-acuity clinical case, collaboration across the health system is seamless, reliable, and HIPAA-compliant.

What is Opsgenie?

Opsgenie is an on-call and alert management and incident response solution to keep services always on. It empowers Dev and Ops teams to plan for service disruptions and stay in control during incidents. With over 200 deep integrations and a highly flexible rules engine, Opsgenie centralizes alerts, notifies the right people reliably, and enables them to collaborate and take rapid action.

Celebrities Explain WTF is Incident Management

Our friends Felicia Day, Steve Wozniak, and Brian Baumgartner help us explain what the heck incident management is. FireHydrant is the only comprehensive incident management platform that allows you to create consistency for the entire incident response lifecycle to focus on fighting fires faster. From alert to retrospective, tracking, communicating, and reporting on results: FireHydrant will automate the process so you can focus on resolution. Visit firehydrant.io to learn how you can manage the mayhem.

SRE Leaders Panel: Business Agility is what matters, SRE can help you get there

Blameless recently had the privilege of hosting SRE leaders Garima Bajpai, Founder at Community of Practice - DevOps Canada and Jason Fraser, Delivery Lead at VMware Tanzu to discuss the value of crisis during incident response, the best and worst tech transformations they’ve seen, how reliability impacts the flow of value, and more.

Concrete Steps to Reducing MTTR

In today’s data-centric world, metrics or numbers define all performance benchmarks. The time between when an event starts and ends shows how well a system can handle and process such events. One of such metrics is MTTR. MTTR usually stands for Mean Time To Resolution, but it has held several meanings over the years. MTTR is a metric used to measure how well a system can bounce back from errors and provide long-lasting solutions.

Monthly Moo Update | April 2021

I don’t know about you, but April traveled at the speed of light. A blink and it happened. Our teams have been working at the same speed throughout one of our favorite months of the year. With an incredible amount of updates, we’ve made our product even more transparent and easier to use. It’s not just our world-class documentation that enables you, it’s also the in-product visualizations and enablement that help guide you without you even realizing it.

Top SRE Toolchain Used By Site Reliability Engineers

We have compiled a list of the most popular and sought out tools (some you may have heard of) that SREs need in their toolkit - at every phase of a production system to keep up with SRE best practices Site reliability engineering (SRE) practices help organizations by ensuring smooth functioning of their deliverables with utmost reliability and resilience. These can be achieved by a set of well-defined tools that are deployed at every phase of the production system to keep up with SRE best practices.

OnPage Recognized in Gartner's Latest Report on CC&C Systems

Gartner’s latest “Quick Answer” report discusses how clinical communication and collaboration (CC&C) systems can enhance pandemic-related provider and patient engagement. Modern healthcare delivery organizations (HDO) invest in CC&C solutions to simplify communication among care teams consisting of physicians, nurses and critical support personnel. The OnPage team is pleased to be recognized as a vendor in Gartner’s latest CC&C publication.

Failover Conf 2021 Wrap-Up

That’s a wrap! Gremlin hosted Failover Conf 2: Fail Smarter on April 27, 2021. In attendance were over 500 SREs, developers, sales engineers, product managers, DevOps experts, C-level execs, and other reliability pros from around the globe! This year’s conference included discussions around the future of DevOps, strategies for building reliable teams, analyzing human error to create better systems, and more.

OnPage Showcased as One of Massachusetts' Top Messaging and Communication Companies

Cutting-edge messaging systems simplify communication and collaboration for organizations with complex communication needs. These systems are equipped with secure mobile messaging and a full suite of automation capabilities that can route notifications and voice calls across on-call teams. These platforms simplify on-call management through digital on-call schedules and escalation policies.

Ivanti Gives Voice to IT Incident Management Software

A protracted, exasperating customer service experience popped into my mind while reading this sentence in the Ivanti Voice data sheet: “One of the most frequent customer complaints about call centers is having to repeat information.” Ain’t that the truth. Here’s a brief personal experience.

Domain-agnostic and here to stay: Gartner outlines the current state and future of AIOps

Coined by Gartner in 2016, the term ‘AIOps’ refers to the combining of big data AI and machine learning to automate and improve IT operations processes. Back then, this very broad definition led to some confusion, with different IT vendors characterizing AIOps differently, depending on what they were actually offering.

Webinar (UK) - Silence the Noise: Simplify Your Crisis Response

Silence the Noise: Simplify Your Crisis Response, aims to educate you on simplifying the complexities of managing information during an incident. Since COVID, all organisations have experienced the cumbersome processes of managing a long term, on-going incidents This webinar will address how to simplify information management and apply these practices to a real life scenario.

DevOps vs. Agile

DevOps is a term for, “a cross-disciplinary practice dedicated to the study of building, evolving and operating, rapidly-changing resilient systems at scale.” (Jez Humble) There is no wall between development and operations so they work simultaneously and without silos. The system focuses on uniting the developmental and operations teams in a continuous process. Agile is a software development strategy that focuses on responding to change with cross-functional team communication.

Improve your Reliability with Blameless SLOs, Now Generally Available

Blameless is excited to announce that our SLO Manager is now generally available! SLO Manager is a new service added to the Blameless platform. This service helps SRE and engineering teams proactively make data-driven decisions about reliability efforts. According to a survey Blameless conducted, over 80% of organizations use SLOs or will in the next 1-2 years.

SLOs: What, Why, and How?

What are SLOs, why are they important, and how can I start crafting them? We get these questions every day. In response, we’re hosting a webinar titled, “SLOs: What, Why, and How?” May 3, 2021 at 1 PM PDT. Kurt Andersen (SRE Architect), Dan Genzale (Director of Infrastructure), and Nicolas Philip (Director PM) will be speaking with one another in a fireside chat about SLO best practices.