Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Key Observability Scaling Requirements for Your Next Game Launch: Part III

So far in our series on scaling observability for game launches, we’ve discussed ways to 1) quickly analyze large volumes of telemetry data and, 2) ensure high-quality telemetry data for more effective analysis at lower costs. The best practices in these blogs outline best practices for scaling observability during game launch day – which is necessary to ensure high performance across all infrastructure components – to ensure no lag, no glitches, and no bugs.

Network Log Archiving = Perfect Backwards Visibility

Network monitoring is ideal for getting a real-time view of your connected environment, and with reports, you can look back in time too. Logs are key to this rear-view mirror look, as they contain all the data for all the elements you are monitoring. But without network log archiving, you can only look back so far. Did you know that according to an IBM/Ponemon study, it takes an average of 287 days to discover and contain a data breach?

How I monitor cloud application costs in one simple but powerful dashboard

Although there are many great tools out there to get on top of application monitoring, there’s one vital metric that’s often overlooked by us technical folks – cost. In the days of running apps on servers in private datacenters, the kit was a one-time purchase that the systems team had to deal with. But running apps in public clouds is a different story. Whether you’re running on VMs, containers in Kubernetes, or entirely serverless, execution of your code adds to the bill.

Code-level Application Monitoring for Every Developer

The monitoring, tooling, and observability space is crowded. It’s hard to keep track of what most tools in this category originally set out to do— but if we had to guess… they were probably built to support monolithic architectures with complex systems, to give Ops and IT a way to minimize the impact of an outage.

Inside the migration from Consul to memberlist at Grafana Labs

At Grafana Labs we run a lot of distributed databases. These distributed databases all make use of a hash ring in order to evenly distribute workloads across replicas of certain components. For a more detailed description of the architecture of our projects, check out our Mimir architecture docs.

Investigate critical alerts on the go with the Datadog mobile app

The Datadog mobile app provides real-time visibility into critical alerts, incidents, and application performance metrics across your entire environment, helping you troubleshoot directly from your mobile device. On-call engineers can quickly evaluate the conditions that triggered an alert, determine its urgency, and decide the next course of action—anywhere, anytime.

Defining and measuring your SLIs and SLOs

Customers expect that online services are available all the time. The truth is that outages happen to almost everyone because providing 100% service availability is challenging and costly. Creating reliable and profitable service is, amongst other things, finding the balance between application availability, costs and time to market. Faster feature delivery means less availability as constant changes to production may cause issues and introduce bugs.

Getting Started with OpenTelemetry: Three Companies Check Into OTel Observability

Comprehensive observability starts with good instrumentation. OpenTelemetry, aka “OTel,” sets a unified standard, enabling you to instrument your applications once, then send that data to any backend observability tool of choice. OpenTelemetry’s standard for generating and ingesting telemetry data is slated to become as ubiquitous as current container orchestration standards. Because of this, development teams are increasingly adopting OpenTelemetry to their applications.