Although the fundamental concepts of site reliability engineering are the same in any environment, SREs must adapt practices to different technologies, like microservices.
Another day, another drama! This one, though, is very much of my own making. I have been wanting to try my hand at a bit of chaos engineering for some time now but C&Js just hasn’t been ready. Sarah’s been up for it too, though, at Animapanions. And now that our CIO, Charlie has seen MTTR drop across every single technology team, thanks to the rollout of Moogsoft and the new incident management system (kudos to James), it’s pilot day.
2020 heralded a year of increased complexity and customer demands, which isn’t going away. In this new normal, organizations will still be tasked with keeping up this break-neck pace. So, what did digital operations look like in 2020 compared to 2019?
Having a Status Page is like having a dog. A dog alerts you to an incident; sudden noise, approaching neighbor, squirrel… A dog sounds the alarm on an intruder. A dog even alerts you to maintenance by barking at every handyman, garbage truck, and gardener within sight. As a dog fetches the same stick over and over, so does a status page fetch the attention of your users – especially during a live incident – with each browser refresh they wait for the status to change.
In this new world of digital everything, new application versions usually mean that you’re going to get bigger and better features, more capabilities, and an uplifted user experience, right? When I talk to customers, many can’t wait to upgrade the PagerDuty integrations that they depend on to test new features. If you’re a PagerDuty for Slack user, the next-generation version of our Slack integration will certainly be an exciting development.
You've joined a company, or worked there a little while, and you've just now realised that you'll have to do on-call. You feel like you don't know much about how everything fits together, how are you supposed to fix it at 2am when you get paged? So you're a little nervous. Understandable. Here are a few tips to help you become less nervous.
Many sectors suffered during the COVID-19 pandemic, but the travel and hospitality industry was struck particularly hard as the world went into lockdown and governments urged us to stay home. According to the International Air Transport Association, global air passenger demand in 2020 was down a record 65.9% from the previous year, and the tourism industry saw an estimated loss of 100.8 million jobs worldwide.
Monitoring solutions are a vital component in managing an application’s environment. From the systems layer all the way up to the end user’s connection to the app, you want to find out how the platform is performing. Indicators like CPU, memory, the number of connections, and overall health help teams make informed decisions for guaranteeing uptime. Teams monitor metrics (short-term information) and logs (long-term information) mainly from a reactive perspective.
“SLO is a favorite word of SREs,” Grafana Labs Principal Software Engineer Björn “Beorn” Rabenstein said during his talk at KubeCon + CloudNativeCon NA 2019. “Of course, it’s also great for design decisions, to set the right goals, and to set alerting in the right way. It’s everything that is good.” So what happens when things go bad?
If you thought that the product announcements from PagerDuty’s largest event of the year, PagerDuty Summit 2021, was all we had in store for you, think again! We’re excited to announce that the July Release comes with a new set of updates and enhancements to the PagerDuty platform! You can learn about our latest capabilities via the Q1 PagerDuty Pulse or read below for the highlights.
While most businesses have an emergency preparedness plan in place, organizations have to wonder if their current plans are enough to defend against the growing list of major incidents and critical events affecting business. According to the 2020-21 Major Incident Management Annual report, an emergency preparedness plan isn’t enough to combat the growing threat landscape. To combat the rise in critical events, organizations must maximize operational resilience.
An effective monitoring system is paramount to smooth business operations. As the need for a fast, responsive software experience gains momentum, monitoring becomes an indispensable driving force. Monitoring systems enable IT teams to proactively observe the health and responsiveness of critical environments and applications. Without monitoring, organizations must depend on customers or internal departments to receive notice of system issues.
It's time to break down the silos separating SREs from security engineers.
CloudOps is on the up. This is in part due to the rapid acceleration of the shift to cloud that was caused by the pandemic. The shift allowed companies to innovate faster, enjoy greater flexibility and scalability, and become more cost efficient. Many organizations who rapidly adopted cloud or increased their usage now realize that they need to better manage their cloud investments in order to fully embrace these benefits.
The Best in Enterprise Resilience™ Certification program affirms your organization’s readiness to manage critical events across a number of domains.
Technical teams are under more pressure than ever to move faster, protect revenue and availability, and push mean time to resolve (MTTR) ever lower. However, teams frequently find themselves encumbered by complex, repetitive, and manual tasks, rather than innovating. When urgent incidents arise, organizations often have to wait for specific developers or subject matter experts (SMEs) to deploy a fix.
4 best practices for breaking down silos and establishing a culture of shared responsibility toward reliability.
We know commitment issues are the real deal, especially when it comes to significant and costly tech investments. Understanding how the market is performing and what’s up ahead is critical for investing in AIOps. Our crew is here to help you through the challenging decision-making days and offer up the best analyst guidance.
In my past experience as an SRE I’ve learned some valuable lessons about how to respond and learn from incidents. Declare and run retros for the small incidents. It's less stressful, and action items become much more actionable. Decrease the time it takes to analyze an incident. You'll remember more, and will learn more from the incident. Alert on pain felt by people — not computers. The only reason we declare incidents at all is because of the people on the other side of them.
Software is eating the world. Digital Transformation is top of mind for companies looking to meet ever-growing consumer demands and digitize manual processes. This isn’t unique to the technology industry. Ecommerce, finance, healthcare, and other industries are all moving in this direction.
James Beard, the pioneer of television cooking shows, once asked, “Where would we be without salt?”. Salt is often underrated, even though it is the ingredient that has the greatest impact on food and flavor in the modern world. It has its own taste, but also balances and enhances the flavor of other ingredients. Salt boosts sweetness and blocks bitterness, it has scientifically proven capabilities to intensify flavor compounds that are too subtle to detect (i.e.
We hope June was as good to you as it was to us. Our latest updates, available now, will keep you relaxing poolside this summer knowing that your monitoring, event correlation, and incident workflows are all connected and automated through the cloud. If you’re not relaxing with a little cloud coverage keeping you cool, then come check out Moogsoft to see how you can keep your services available and your customers happy, so you can get to relax with a little more time in your day.
When considering the fact that 2020 was a record breaker in the number of cyberattacks that occurred and the resulting cost to organizations that was incurred, it is clear that the state of cybersecurity readiness is not very encouraging, to say the least.
Healthcare institutions are increasingly implementing clinical communication and collaboration (CC&C) platforms to improve the productivity of care teams. Automated CC&C platforms perfect care orchestration plans to ensure providers have the means to satisfy the ever-changing needs of patients. Key features of CC&C platforms include real-time, secure mobile messaging and alerting; digital, intelligent on-call schedules; time-stamped message statuses; and automated alert escalations.
Rootly is on a mission to create a world where maintaining reliability is frictionless, delightful, and accessible to anyone. Making resolving and learning from incidents every organizations superpower.
I have been working on a couple of monitoring ideas for Cherwell. I didn’t see anything with a quick online search, and I enjoy authoring MPs to monitor applications, it is the closest I’ll get to 007. I’ve hit a major hurdle and I need to ask for a hand from the community. We have a lab environment that’s worked great while developing the Cherwell integration for Connection Center, however, it is not a good simulation for an actual deployment.
The time has come! Users in SIGNL4 can now be a member of multiple teams. This allows for staff to be on duty in multiple groups or departments in parallel and to receive related alert notifications for incidents that occur in the different teams. In addition, you can now also send Signls to multiple teams. All details are available in this article.
From network problems to computer failures, a variety of incidents can disrupt operations for systems in outer space.
SIGNL4 offers powerful duty scheduling for routing alerts to the right people at the right time. In some cases, customers use other tools as leading system for duty scheduling, e.g. SAP, Excel, etc. Here we describe how to import duty schedules from .csv files. If you use other tools or other formats you can first export your scheduled into a .csv file and proceed from there.
Recently we have received a lot of requests for Enterprise Alert to not only alert on critical situations but to also take a proactive approach to initiate, record and track those situations through ITSM tools such as ServiceNow and BMC Remedy. This post will center around what happens when critical systems fail and tickets are not being created in BMC due to a break in the workflow.
Yes, time travel is possible...through data. My ability to time travel began when I started coding at age 10. Back then, all of my code ran on my own little computer. Like many ten-year-olds, I coded to create and play games. I also coded cool graphics to accompany music to impress my friends and utilities for copying. I launched my first commercial website in 1996 and made 25 guilders, which was good money for a 15-year old. Life was so easy.
It’s been a month since Dinesh and I humbly high-fived leaving the meeting with Charlie and Lucia and they gave us the green light to roll Moogsoft out across the whole of C&Js and I’m feeling a little weary. Change is hard. I’ve also made it harder on myself by persuading Charlie we should also migrate our service desk solution.