Overview of AI Evaluation (The Context Window #05)

Jun 25, 2026

Can you actually trust an AI agent? In this pre-recorded episode of The Context Window, Nicole van der Hoeven sits down with Yas Ekinci, an engineer on the Grafana AI team, to talk about evals — how Grafana measures the quality and reliability of the AI it ships.
They get into the difference between online and offline evals, why reviewing AI-generated code has become the real bottleneck, the "final answer problem" of plausible-but-wrong outputs, and o11y-bench, Grafana's open benchmark for observability agents. Along the way: pass@3 vs pass^3, the evaluation loop, fact-based rubrics, and a look at where AI evaluation still falls short as an industry.
Plus the usual round-up: Grafana Assistant landing on OSS/self-hosted, the new Unprompted community blog, Warp going open source, and ChatGPT's mysterious goblin problem.
Chapters and links below. Got an evals question? Drop it in the comments and we'll get Yas to answer.

TIMESTAMPS:

00:00 – Welcome + meet Yas Ekinci

05:40 – Announcements: Assistant on OSS, the Unprompted blog, AI Weekly

08:06 – Last month in AI: Warp goes open source, the goblin problem, new models

13:15 – Why it's hard to trust AI: the review bottleneck

18:26 – Process vs. outcome: the "final answer problem"

21:58 – What are evals? Offline vs. online

28:30 – Offline evals, a deep dive

38:35 – o11y-bench: the open observability benchmark

40:44 – pass@3 vs pass^3: reliability over luck

50:45 – The evaluation loop explained

59:09 – Bring your own model + where eval still falls short

Links/resources:

o11y-bench (benchmark): https://github.com/grafana/o11y-bench
o11y-bench leaderboard: https://o11ybench.ai/
o11y-bench announcement blog: https://grafana.com/blog/o11y-bench-open-benchmark-for-observability-agents/
Anthropic, "Demystifying evals for AI agents": https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
Grafana Assistant on self-hosted/OSS (Nicole's how-to video/blog): https://youtu.be/0O0ZqrruGns
Grafana Assistant becomes available on-prem (announcement): https://grafana.com/whats-new/2026-04-21-grafana-assistant-becomes-available-on-prem/
AI Observability in Grafana Cloud: https://grafana.com/blog/ai-observability-for-agents-in-grafana-cloud/
Unprompted (Grafana Labs community blog on Medium): https://medium.com/grafana-labs
Matt Shumer, "Something Big Is Happening": https://shumer.dev/something-big-is-happening
Warp is now open source: https://www.warp.dev/blog/warp-is-now-open-source
OpenSpec: https://github.com/Fission-AI/OpenSpec
Harbor / terminal-bench (the framework o11y-bench is built on) : https://www.harborframework.com/

Learn about Grafana Assistant: https://grafana.com/docs/grafana-cloud/machine-learning/assistant/
Check out our AI blog, Context Horizon: https://gra.fan/ch
Learn how we handle your privacy and security for Grafana Assistant: https://grafana.com/docs/grafana-cloud/machine-learning/assistant/privacy-and-security/
Check out the pricing page - Assistant is included in the free tier too!: https://grafana.com/pricing/
Get started with the Grafana Cloud forever-free tier: https://grafana.com/g/cloud
Have a question? Ask Grot, your AI helper: https://grafana.com/grot/
Reach out in our community forums: https://gra.fan/communityyf

Thanks for watching!

👍 Was this video helpful? Like and subscribe to our channel for more videos.

Connect with Grafana Labs:
X: (https://www.twitter.com/grafana)
LinkedIn: (https://www.linkedin.com/company/grafana-labs/)
Facebook: (https://www.facebook.com/grafana)

#Grafana #Observability #assistant #ai #actuallyusefulai