When was the last time you read a good book? How about the last time you listened to an interesting podcast? For many, the latter is likely the more popular pastime. With routines disrupted and people housebound, podcasts exploded in popularity during the lockdown.
If you're an engineer reading this, you might be wondering what I mean by the title. You might be a Site Reliability Engineer whose primary responsibility is to maintain the reliability of your company’s product/solution. You might be a software builder, a programmer responsible for building new capabilities and shipping them to production. All of these are important for any business to remain competitive.
Distributed systems are everywhere. Although many teams don’t think of their applications as distributed systems, if they’re developing using container-based microservices and serverless functions instead of a monolith, they’re creating a distributed system. This change also means that monitoring needs are becoming more complex.
Cyber security and physical security grew up at different times and in different neighborhoods. In fact, long before digital transformation was even a concept, physical security had staked out its corporate territory and was on the job protecting the company’s people, buildings, and other assets. Then, as the business world grew increasingly more reliant on information technology, digital security started flexing its muscles on its own turf.
IT organizations, such as managed service providers (MSPs), deploy incident alerting and on-call management solutions to accelerate software delivery and ensure seamless customer experiences. Incident alert management platforms orchestrate the distribution of alerts to ensure that technicians continue to maintain system uptime and minimize service disruptions.
As we welcome Fall and such a transformational time of the year, we’re excited to announce a new set of updates and enhancements to the PagerDuty platform. From updates to Runbook Automation, ChatOps and Customer Service Ops Applications, to PagerDuty Community Events, users, and customers can.
With our latest Android app update (3.1., build 242) you will never miss a critical SIGNL4 alert again. Even if your phone is muted or in do-not-disturb mode, SIGNL4 can now make a lot of ‘noise’ and wake you up reliably when a major or critical incident occurs. Here is how it works….
On-call scheduling software modernizes the way healthcare administrators assign responsibilities to care team members. The software helps create an equitable workforce among care teams and eliminates manual errors during the on-call scheduling process. Administrators can set up digital schedules to contact the right clinicians at the right time. This ensures that on-call providers quickly resolve patients’ issues to improve patient experience.
The effects of climate change mean we’re increasingly seeing black swan weather events impacting our working lives. From wildfires and hurricanes to the ever-present threat of earthquakes, 2021 has seen its share of crises. This obviously raises serious questions for companies about the safety of their workforces. As a global company, PagerDuty has employees across the world. When a disaster strikes, everyone needs to have the necessary training, resources and tools to act.
Net at Work is a German IT company with over 100 employees that provides its customers with solutions and tools for digital communication and collaboration. Their product NoSpamProxy offers reliable protection against spam and ransomware, legally compliant email encryption and more. Customers of Net at Work are using it as a SaaS solution, and it is being monitored with the agentless network monitoring software PRTG Network Monitor from Paessler AG.
At VMware, we make use of modern development and site reliability engineering (SRE) practices on a regular basis. And those of us who work on the VMware Tanzu Observability product marketing team regularly get exposure to various SRE teams that implement modern practices with the observability technology we create.
For retailers, uptime is money and issues can cost thousands of dollars per minute. With infrastructure comprising complex services such as payment gateways, inventory, and mobile applications, maturing digital operations is vital for ensuring services are always on and customers get the best experience.
Effective healthcare communication requires proper software and processes to ensure that the right person receives timely messages. Unfortunately, Divisions of Family Practice (DoFP), a large community-based network of physicians located in British Columbia, Canada, relied on a third-party answering service to connect long-term care facilities (LTCFs) with on-call providers.
In 2016, Google released the definitive book on Site Reliability Engineering (SRE) - a practice that had originated in the company to take care of a monumental problem - how to keep the Google services running with high reliability. Over the years, SRE has been widely adopted by dev teams across the globe and is a popular role at startups and enterprises alike. Here is a look at how search for SRE has trended over the years.
For a number of years, the debate on DevOps vs. ITIL has divided many technology teams. On the surface, both practices seem at odds with one another—DevOps harnesses the power of human collaboration and communication to support innovation, while ITIL utilizes a more systematic and structured approach to deliver service quality and consistency. But, if we take a deeper look, you’ll find that not only can DevOps and ITIL co-exist, they can even complement each other.
SRE and DevOps are closely related concepts, and many businesses can benefit from embracing both of them. Nonetheless, there are important distinctions between SRE and DevOps.
After you have stopped an incident from affecting your customers, you need a more thorough investigation in order to prevent similar incidents in the future. Postmortems record the root causes of an incident and provide insights for making your systems more resilient. At the same time, postmortems can be difficult to produce, since they require deeper analysis and coordination between teammates who are busy with the next development cycle.
As software development teams struggle with spotty, siloed software delivery cycles, the DevOps approach provides relief by unifying stakeholders to achieve faster, collaborative and continuous software delivery. However, the DevOps methodology fails if it does not address the issue of DevOps burnout. In this post, we’ll uncover strategies that DevOps teams can use to better manage their work environment.
Yippie! Our September update adds live call routing as well as a voice mailbox with notification feature to SIGNL4. All details can be found in this article.
In today’s world of digital everything, where customers are increasingly demanding instant updates when problems occur, it’s more important than ever to take immediate action. Seconds matter, and teams need to be empowered to proactively solve customer-impacting incidents as quickly as possible.
Quickly and efficiently manage your incidents with FireHydrant and Backstage!
We’re proud to announce that our front end FireHydrant plug in has been open-sourced as part of Backstage, an open platform for infrastructure tooling, services, and documentation created at Spotify. We introduce FireHydrant’s incident management and analytics in Backstage, where you can quickly and efficiently manage your incidents.
Our ecosystem continues to grow: we have added 10 new integrations within the last months. Integrations are the bridge between alert sources and on-call teams and have always been a top priority at iLert. They are one of the reasons why iLert is so easy to adopt for small and large companies alike.
Not too long ago, you would have needed development experience to oversee the delivery of scalable and reliable software. But with the rise of low-code and no-code tools, that requirement is now obsolete. What used to be hours of coding has turned into a few minutes of dragging and dropping.
2019 and 2020 were worlds apart. Our entire ways of working, living, socializing, and learning were changed almost overnight. Over the last 18 months, technical teams have had to double down on all their digital efforts to help their customers adapt to the new normal. At the same time, teams were responsible for more unplanned work than ever as incidents steadily rose. For the first time, we’ve created the State of Digital Operations Report which is based on PagerDuty platform data.
The OnPage clinical communication and collaboration (CC&C) system is universally adopted by midwifery care communities across the United States and Canada. OnPage is proud to provide a real-time, secure collaboration platform that allows midwives to improve patient experience. This article examines the continued widespread adoption and implementation of OnPage’s industry-leading CC&C system by midwifery care communities.
A comprehensive definition of SREs and Site Reliability Engineering, including what SREs do and what makes SREs different from other roles.
Public safety communication is vital to all local communities. At APCO 2021, Everbridge featured new products and advanced functionality of our Critical Event Management platform to showcase our ability to keep towns, cities, states, and other public agencies running, faster. Our partnership with RapidSOS and the corresponding showcase of Everbridge 911 Connect is one solution that organizations can use to better serve the public.
In the SIGNL4 web portal you can manually export historic alert reports as.csv files. In some cases it might be useful to export alert data programmatically. For example you can forward all alerts including specific parameters to InfluxDB and show the alert history in Grafana to recognize peaks, trends and abnormalities over time. You can even use AIOps to recognize certain trends automatically. By using the SIGNL4 REST API it is possible to export alert data automatically.
GigaOm’s latest Radar for AIOps solutions has just been released and it makes for compelling reading for anyone trying to maximize organizational performance in our digital world. Particularly if you’re down with DevOps.
What do IT security, production monitoring and technical field service have in common? In all scenarios there is need to notify the right people immediately in the event of technical malfunctions, urgent maintenance requests or emergencies to resolve the incident quickly and efficiently.
A few days ago I received an inquiry about a scripting problem from one of our longtime partners, to be exact our DCP Marc Handel from IT unlimited AG. In the exchange with Marc I realized that his idea to use the Enterprise Alert Scripting Host, the Windows Task Scheduler and CheckMK to realize a roundtrip monitoring could be interesting for the whole community. Especially for all our CheckMK customers.
Although conversation about observability often ignores SREs, SREs have a central role to play in observability success.
The novel coronavirus wasn’t the only history-making event of 2020. Last year cataloged the highest number of storms in the United States, as well as the worst wildfire season, with 9.5 million acres burned just in the western half of the country.
Site reliability engineers (SREs) are involved in scaling systems and making them reliable and efficient for organizations. But SREs often fail to build system resiliency when they do not have the right tools at their disposal. In this post, we’ll uncover five leading tools that SREs can use to drive the reliability and stability of computing systems. It also examines how SREs can use the tools to improve operations tasks and infrastructure processes.
Coming to this article you may be in two learning mindsets. You’re curious about building a service catalog and want to know some of the basics. Or you’re curious about FireHydrant’s philosophy around this growing space.
We've had a jam-packed year and it's only September. Here are some of the product releases we’ve had to date, from new features to updates for incidents, integrations, Runbooks, and more. Keep reading to see what’s new and improved with FireHydrant and what you can leverage for your team.
Many organizations use xMatters to keep their services running and reliable. From technology businesses to complex enterprises, one particular industry that has overwhelmingly benefited from the use of xMatters is healthcare. In healthcare, speed and effectiveness are vital. Incidents are critical, and quality patient care is the highest priority.