SLOs with Prometheus done wrong, wrong, wrong, wrong, then right

SLOs with Prometheus done wrong, wrong, wrong, wrong, then right

Jan 10, 2024

We have Carson Anderson, Sr. DevOps Engineer at Weave HQ, talking about how they implemented SLOs using Prometheus, what went wrong, and how they fixed it.

This talk was given at "Last9 of Reliability" Discord community on 13th December. Join the community here: https://discord.com/invite/Q3p2EEucx9

Talk Description:

First thing's first: Yes, it really did take us 5 tries to implement our SLOs with Prometheus. While that may seem embarrassing, we are very happy to be able to share our SLO journey so that we can hopefully help you avoid the same mistakes.

So why did it take us 5 tries? In a word: Scale. We needed to handle 28 days worth of data for over 400 microservices and still have responsive dashboards and alerts. Luckily, Prometheus provides us with some amazing features to deal with large or slow queries. Unfortunately many of our first attempts met with serious failures when we misunderstood and misused those features.

This talk will walk you through all the phases of our SLO rollout. By the end you we hope to help you see the how to get the most value out of Prometheus while also illustrating some common pitfalls and how to avoid them.

Slides - https://docs.google.com/presentation/d/1m8MkJCX091omAfq-ZurT0Cnst5hOecwWPztfWyz_ENQ/edit#slide=id.g16caaf6ca07_0_1231

You can find Carson here:

Twitter: https://twitter.com/carson_ops

Linkedin: https://www.linkedin.com/in/carsonoid/

Github: https://github.com/carsonoid

Follow Last9:

Twitter: https://x.com/last9io
Linkedin: https://www.linkedin.com/company/last9/
Blog: https://last9.io/blog