Operations | Monitoring | ITSM | DevOps | Cloud

That Rogers Outage is Going to be More Expensive Than You Think

On July 8 of 2022, the Canadian telecom company Rogers Communications suffered a major outage that impacted most of Canada for almost two days. This wasn’t completely unprecedented (they’d had an outage in 2021 that impacted their wireless servers for several hours) but the breadth and severity of this one is going to end up costing them far, far more than it seems at first glance.

A preview of our upcoming redesign

Earlier this year, we announced that one of our goals for this year is to bring the UI of Oh Dear to the next level. Behind the scenes, our team is working hard on a complete rewrite of our marketing website and app. We're currently targeting the end of September timeframe to launch our redesign. In this blog post, we'd like to give you a preview of the redesign.

New in Grafana 9.1: Trace to metrics allows users to navigate from a trace span to a selected data source

Traces, logs, and metrics provide inherently different views into a system, which is why correlating between them is important. With features like exemplar support, trace to logs, and span references, you can quickly jump between most telemetry signals in Grafana. With the release of Grafana 9.1, we’re improving Grafana’s ability to correlate different signals by adding the functionality to link between traces and metrics.

See the big picture with the Service Dependency Graph

Understanding the impact and scope of an incident when degradation occurs is critical for returning your service online. This requires modeling the many downstream and upstream relationships between your services. Our new Service Dependency Graph provides a shortcut – a way to surface dependencies quickly, understand the relationship between services, and determine the scope or impact of an incident.

How to Supercharge Your Website Monitoring in 5 Minutes or Less

I’m a recent entrant to the Website Monitoring game, but there is one thing I realized straight away: A Monitoring tool is only as good as it’s configured to be. Website monitoring is at its best when it’s reliable, informative, and efficient. When it gives you the information you need, when you need it, and the peace of mind to say “if I’m not being alerted, I know it’s still working.”

Using N-central for Server Hardware Monitoring

While it is fair to say that in recent years we’ve seen a shift to servers being deployed in the cloud through Microsoft Azure or AWS, I’m sure if you’re reading this today you still have a large percentage of physical servers under your management, including Hyper-V and ESXi hosts. N-central’s ESXi monitoring should automatically detect and monitor the hardware in these boxes, but what about the rest?

Top 5 IoT challenges and how to solve them

There are a number of challenges to surmount for enterprises in the IoT sector, including having a short time to market, airtight security, a versatile update mechanism for hardware and software and mastering device management. The more planning and practical steps that are taken to address key considerations, the faster an IoT project can get to market and make an impact on the world.

Shine Some Light on Your SNS to SQS to Lambda Stack

The combination of SNS to SQS to Lambda is a common sight in serverless applications on AWS. Perhaps triggered by messages from an API function. This architecture is great for improving UX by offloading slow, asynchronous tasks so the API can stay responsive. It presents an interesting challenge for observability, however. Because observability tools are not able to trace invocations through this combination end-to-end. In X-Ray, for example, the trace would stop at SNS.

What is Chaos Engineering? A Guide on Its History, Key Principles, and Benefits

Many organizations invest in high availability and disaster recovery for their key applications. Too many of these organizations, however, forego the most important aspect of this process—testing the failover process regularly. Whether gripped by the fear of downtime or dreaded DNS problems, development teams are frequently hesitant to test out what they’ve built in the real world.