Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Quantifying WordPress Performance Improvements with circonus-logwatch

Deriving meaningful insights from third-party logs has always been a difficult yet necessary task. Most analysis occurs after-the-fact, when something has gone wrong. Very few tools allow real-time monitoring of logs, so SREs have become accustomed to backfilling log data into various analysis tools. Postmortem log analysis is the de facto standard, yet should it be? Why shouldn’t you be able to monitor your server logs in real-time?

A Guide To Service Level Objectives, Part 2: It All Adds Up

Statistical analysis is a critical – but often complicated – component in determining your ideal Service Level Objectives (SLOs). So, a “deep-dive” on the subject requires much more detail than can be explored in a blog post. However, we aim to provide enough information here to give you a basic understanding of the math behind a smart SLO – and why it’s so important that you get it right.

Monitoring DevOps: Where are we now? [Infographic]

Our first DevOps & Monitoring Survey was conducted at ChefConf 2015. This year, we’ve created an infographic based on the facts and figures from our 2018 Monitoring DevOps Survey. The infographic provides a visual representation of the prevalence of DevOps, how monitoring responsibilities are distributed, metrics usage, and various aspects of current monitoring tools.

A Guide To Service Level Objectives, Part 1: SLOs & You

Four steps to ensure that you hit your targets – and learn from your successes. Whether you’re just getting started with DevOps or you’re a seasoned pro, goals are critical to your growth and success. They indicate an endpoint, describe a purpose, or more simply, define success. But how do you ensure you’re on the right track to achieve your goals?

Air Quality Sensors and IoT Systems Monitoring

2017 was a bad year for fires in California. The Tubbs Fire in Sonoma County in October destroyed whole neighborhoods and sent toxic smoke south through most of the San Francisco Bay Area. The Air Quality Index (AQI) for parts of that area went up past the unhealthy level (101–150) to the hazardous level (301–500) at certain points during the fire. Once word got out that N99 dust masks were needed to keep the harmful particles out of the lungs, they became a common sight.

Air Quality Sensors and IoT Systems Monitoring

Operating containerized infrastructure brings with it a new set of challenges. You need to instrument your containers, evaluate your API endpoint performance, and identify bad actors within your infrastructure. The Istio service mesh enables instrumentation of APIs without code change and provides service latencies for free. But how do you make sense all that data? With math, that’s how.

Less Toil, More Coil - Telemetry Analysis with Python

This was a frequent request we were hearing from many customers: "How can I analyze my data with Python?" The Python Data Science toolchain (Jupyter/NumPy/pandas) offers a wide spectrum of advanced data analytics capabilities. Therefore, seamless integration with this environment is important for our customers who want to make use of those tools.

Cassandra Query Observability with Libpcap and Protocol Observer

Opinions vary in recent online discussions regarding systems and software observability. Some state that observability is a replacement for monitoring. Others that they are parallel mechanisms, or that one is a subset of another (not to mention where tracing fits into such a hierarchy). Monitoring Weekly recently provided a helpful list of resources for an overview of this discussion, as well as some practical applications of observability.

Effective Management of High Volume Numeric Data with Histograms

How do you capture and organize billions of measurements per second such that you can answer a rich set of queries effectively (percentiles, counts below X, aggregations across streams), and you don’t blow through your AWS budget in minutes? To effectively manage billions of data points, your system has to be both performant and scalable. How do you accomplish that? Not only do your algorithms have to be on point, but your implementation of them has to be efficient.