Operations | Monitoring | ITSM | DevOps | Cloud

Alerting

Drive continuous improvement with shareable postmortems in Opsgenie

It’s a given that customers expect software and IT services to be high-performing and always on. And, because incidents and downtime will always be a thing, we believe that how you respond can make or break the customer experience. We’ve learned this lesson first hand while refining our own incident management process over the last decade.

It Came From Below

I’m going to assume most people who read this blog are familiar with PagerDuty. But just in case anyone isn’t, PagerDuty is a tool we use in IT to notify us if some predefined check has failed. Maybe a key process has died or maybe we’re not seeing our expected traffic volume or maybe our server has stopped responding to ping. Whatever it is, PagerDuty will relentlessly, remorselessly, and loudly notify whoever is on call that something needs attention.

Extending the Competitive Advantage in Telecom

The telecom industry has always seemed to navigate well through tech changes. As the industry has evolved, it’s managed to transform from landline to mobile carriers, then from voice calls to messaging and data-centric networks. In many developed markets telcos are creating ecosystems for the data-driven economy. The next frontier is shaping up to be one driven by machine learning (ML) and artificial intelligence (AI).

Achieve Better Accountability With Full-Service Ownership

Software teams seeking to provide better products and services must focus on faster release cycles. But running reliable systems at ever-increasing speeds presents a big challenge. Software teams can have both quality and speed by adjusting the policies around ongoing service ownership. While on-call plays a large part in this model, advancement in knowledge, more resilient code, increased collaboration, and practice also mean engineers don’t have to wake up to a nightmare.

What is a post mortem incident? How can we monitor this?

In particular, I liked very much the article that our colleague Sara Martin wrote in Pandora FMS blog about crisis management in information technology, these are the steps: Legend: “Jack’s Lantern (https://commons.wikimedia.org/wiki/File:Jack-o-lantern.svg) This article starts from point number five: when after a certain time of recovery the crisis has been solved and becomes a post mortem incident. This word comes from the Latin language and it means “after death”.