Operations | Monitoring | ITSM | DevOps | Cloud

High Cardinality in ClickHouse at Scale: What Actually Breaks

ClickHouse swallows high-cardinality telemetry at ingest, then breaks at query time weeks later. Here is what fails, and how we keep it fast in production. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Klaudia Under the Hood: How We Built an AI SRE That Actually Earns Trust

In reliability engineering, being ‘mostly right’ is a liability. An AI SRE that sometimes misses the root cause or gives a confident, wrong answer at 2:17 AM has no place in an enterprise cloud environment. In this context, silence is better than noise. That’s the bar Klaudia is built to clear: genuine reliability that you can trust in production. The kind of reliability that earns a place alongside your best engineers. Getting there requires more than just a capable model.

ClickHouse LowCardinality: When It Helps and When It Hurts

ClickHouse LowCardinality cuts storage and speeds up queries on low-cardinality columns, but backfires on trace IDs. How to tell the difference. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Introducing the Rootly Agent

During an incident, ask the Rootly Agent anything and it'll respond (and act) based on context and your data. Use the Rootly Agent to: The Rootly Agent performs actions on your behalf, so it is bound by the permissions assigned to your user. It will also ask for confirmation before taking significant actions. Rootly admins can turn it on for their workplaces and start running incidents even more efficiently.

Should platform, SRE, and security merge into one function?

Platform, SRE, and security are three distinct functions in modern engineering orgs, each shaped by a different problem. SRE was the operations function's answer to scale: how to keep systems reliable when the systems get big. Platform answered a different problem: how to let developers ship without becoming infrastructure experts. Security drew the line on what could safely reach production.

Running AI at Enterprise Scale w/ Anthropic, Descope, Port, Rootly and Twingate

The debate about whether AI can write production code is over. Companies are handing work to fleets of agents, and for many, they write most of the code that ships to production. The next challenge is everything that happens once an entire engineering organization runs this way, at full speed. Teams that generate code 10x faster still review it at human speed, and that mismatch is now the constraint. Code ownership is also becoming an issue, as developers learn to trust agentic processes a little too much. When an agent breaks production, who is responsible?