Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar
By Lenny's Podcast
Summary
# 为什么 AI 评估是产品构建者最热门的新技能 | Hamel Husain & Shreya Shankar **视频信息** - **标题**: 为什么 AI 评估是产品构建者最热门的新技能 | Hamel Husain & Shreya Shankar - **描述**: Hamel Husain 和 Shreya Shankar 教授世界上最受欢迎的 AI 评估课程,并培训了 2,000 多名产品经理和工程师(包括 OpenAI 和 Anthropic 的许多团队)。在这次对话中,他们揭开了开发有效评估的流程,通过实际示例进行讲解,并分享有助于改进 AI 产品的实用技巧。 *您将学到:* 1. 评估(evals)到底是什么 2. 为什么它们已成为 AI 产品构建者最重要的技能 3. 创建有效评估的分步指南 4. 深入探讨错误分析、开放编码和轴向编码 5. 基于代码的评估 vs. LLM 作为裁判 6. 最常见的陷阱以及如何避免它们 7. 实施评估的实用技巧(初始设置后每周 30 分钟) 8. 对“感觉”与系统化评估之间辩论的见解 *由以下赞助:* Fin—排名第一的客户服务 AI 代理:https://fin.ai/lenny Dscout—UX 平台,可在每个阶段捕获见解:从构思到生产:https://www.dscout.com/ Mercury—简化财务的艺术:https://mercury.com/ *文字记录*:https://www.lennysnewsletter.com/p/why-ai-evals-are-the-hottest-new-skill *我的最大收获(付费通讯订阅者):* https://www.lennysnewsletter.com/i/173871171/my-biggest-takeaways-from-this-conversation *在哪里找到 Shreya Shankar* • X:https://x.com/sh_reya • LinkedIn:https://www.linkedin.com/in/shrshnk/ • 网站:https://www.sh-reya.com/ • Maven 课程:https://bit.ly/4myp27m *在哪里找到 Hamel Husain* • X:https://x.com/HamelHusain • LinkedIn:https://www.linkedin.com/in/hamelhusain/ • 网站:https://hamel.dev/ • Maven 课程:https://bit.ly/4myp27m *在哪里找到 Lenny:* • 通讯:https://www.lennysnewsletter.com • X:https://twitter.com/lennysan • LinkedIn:https://www.linkedin.com/in/lennyrachitsky/ *本期节目涵盖内容:* (00:00) Hamel 和 Shreya 简介 (04:57) 什么是评估(evals)? (09:56) 演示:检查房产管理 AI 助手真实轨迹 (16:51) 记录错误 (23:54) 为什么 LLM 无法取代人类进行初步错误分析 (25:16) 评估过程中的“仁慈的独裁者”概念 (28:07) 理论饱和:何时停止 (31:39) 使用轴向编码帮助分类和综合错误笔记 (44:39) 结果 (46:06) 构建 LLM 作为裁判来评估特定的故障模式 (48:31) 基于代码的评估与 LLM 作为裁判的区别 (52:10) 示例:LLM 作为裁判 (54:45) 将您的 LLM 裁判与人类判断进行测试 (01:00:51) 为什么评估是 AI 产品的 P.R.D.(产品需求文档) (01:05:09) 您实际需要多少评估 (01:07:41) 评估之后是什么 (01:09:57) 伟大的评估辩论 (1:15:15) 为什么“内部试用”(dogfooding)对大多数 AI 产品来说不够 (01:18:23) OpenAI 收购 Statsig (01:23:02) Claude Code 争议和语境的重要性 (01:24:13) 关于评估的常见误解 (1:22:28) 有效实施评估的技巧和窍门 (1:30:37) 时间投入 (1:33:38) 他们的综合评估课程概述 (1:37:57) 闪电战和最终想法 *LLM 日志开放代码分析提示:* _请分析以下 CSV 文件。其中有一个元数据字段,它有一个名为 z_note 的嵌套字段,其中包含我们正在进行的 LLM 日志分析的开放代码。请提取所有不同的开放代码。从 _note 字段中,提出 5-6 个可以从中创建轴向代码的类别。_ *引用的:* • 构建改进 AI 产品的评估系统:https://www.lennysnewsletter.com/p/building-eval-systems-that-improve • Mercor:https://mercor.com/ • Brendan Foody 在 LinkedIn:https://www.linkedin.com/in/brendan-foody-2995ab10b • Nurture Boss:https://nurtureboss.io/ • Braintrust:https://www.braintrust.dev/ • Andrew Ng 在 X:https://x.com/andrewyng • 进行错误分析:https://www.youtube.com/watch?v=JoAxZsdw_3w • Julius AI:https://julius.ai/ • Brendan Foody 在 X—“评估是新的 P.R.D.”:https://x.com/BrendanFoody/status/1939764763485171948 ...引用继续于:https://www.lennysnewsletter.com/p/why-ai-evals-are-the-hottest-new-skill *推荐书籍:* • 《弹子球》(Pachinko):https://www.amazon.com/Pachinko-National-Book-Award-Finalist/dp/1455563935 • 《苹果在中国》(Apple in China: The Capture of the World’s Greatest Company):https://www.amazon.com/Apple-China-Capture-Greatest-Company/dp/1668053373/ • 《机器学习》(Machine Learning):https://www.amazon.com/Machine-Learning-Tom-M-Mitchell/dp/1259096955 • 《人工智能:一种现代方法》(Artificial Intelligence: A Modern Approach):https://www.amazon.com/Artificial-Intelligence-Modern-Approach-Global/dp/1292401133/ _制作和营销由 https://penname.co/ 负责。_ _有关赞助播客的咨询,请发送电子邮件至 podcast@lennyrachitsky.com。_ Lenny 可能是讨论中公司的投资者。 - **频道**: Lenny's Podcast --- ## 主要收获 以下是观众可以从视频中学到的关键课程: * **评估(Evals)是衡量和改进 AI 应用的系统性方法**:它们本质上是对 LLM 应用的数据分析,通过创建指标来衡量性能并指导迭代和实验,而不是仅仅依赖“感觉”或“直觉”。(04:57) * **错误分析是评估的首要步骤,并且需要人工参与**:通过审查实际用户交互日志(“轨迹”),识别并记录问题(开放编码),可以发现 AI 应用的盲点,这是 AI 本身目前无法完成的。(16:51) * **LLM 在合成和分类错误笔记方面非常有价值**:在人工进行初步的自由格式笔记(开放编码)后,可以使用 LLM 将这些笔记归类到更广泛的类别(轴向编码)中,从而帮助识别最常见的故障模式。(31:39) * **LLM 作为裁判(LLM-as-judge)是一种自动化评估方法**:对于难以通过简单代码评估的复杂故障模式,可以使用 LLM 来评估其性能,但必须仔细构建提示并将其与人类判断进行校准,以确保准确性。(46:06) * **评估是 AI 产品的新型 P.R.D.(产品需求文档)**:它们是根据实际数据驱动的,能够识别出在产品开发初期可能未预料到的需求和故障模式,并需要持续迭代。(01:00:51) ## 智能章节 按发言时间顺序组织内容模块。每个章节包含一个一句话的简洁标题和描述。 * **00:00 - 04:57: 介绍嘉宾** 介绍播客主持人 Lenny Rachitsky,以及他的两位嘉宾 Hamel Husain 和 Shreya Shankar,他们是 AI 评估领域的专家,并教授着非常受欢迎的课程。 * **04:57 - 09:56: 什么是评估(Evals)?** 定义了 AI 评估,将其描述为衡量和改进 AI 应用的系统性方法,并将其与传统的软件工程单元测试进行区分。 * **09:56 - 23:54: 演示:房产管理 AI 助手评估** 通过一个房产管理 AI 助手的真实交互示例,演示了如何查看 AI 的日志(轨迹)并进行初步的错误分析和开放编码。 * **23:54 - 31:39: 人工错误分析与理论饱和** 讨论了为什么 LLM 目前无法完全取代人类进行初步的自由格式错误分析,并介绍了“仁慈的独裁者”概念以及何时停止收集错误笔记(理论饱和)。 * **31:39 - 44:39: 使用 LLM 进行轴向编码** 展示了如何利用 LLM 将人工记录的自由格式错误笔记(开放代码)进行分类和综合,生成更具可操作性的故障模式类别(轴向代码)。 * **44:39 - 54:45: 构建 LLM 作为裁判** 解释了 LLM 作为裁判的概念,以及如何构建用于评估特定故障模式的 LLM 提示,并强调了测试其与人类判断的一致性。 * **54:45 - 01:05:09: 评估作为 AI 产品的 P.R.D.** 讨论了评估如何成为 AI 产品的新型 P.R.D.,它们是根据实际数据驱动的,并且可以不断演变。 * **01:05:09 - 01:09:57: 评估的数量与后续步骤** 探讨了实际需要的评估数量,以及评估之后如何利用这些发现来改进产品,例如将评估集成到单元测试或生产监控中。 * **01:09:57 - 01:18:23: 评估的辩论与误解** 深入探讨了关于评估的争议,包括“感觉”与系统化评估的辩论,以及为什么内部试用(dogfooding)本身可能不足以进行全面的评估。 * **01:18:23 - 01:24:13: OpenAI 收购 Statsig 与 Claude Code 争议** 讨论了 OpenAI 收购 Statsig 的潜在影响,以及 Claude Code 团队声称不进行评估而依赖“感觉”的争议,并强调了语境的重要性。 * **01:24:13 - 01:33:38: 评估的常见误解、技巧与时间投入** 分享了关于评估的常见误解,如自动化工具的局限性,并提供了实用的技巧和关于实施评估所需时间投入的见解。 * **01:33:38 - 01:37:57: 综合评估课程概述** 介绍了 Hamel 和 Shreya 在 Maven 上提供的综合评估课程,包括课程内容、学习成果以及为学生提供的额外福利。 * **01:37:57 - 结束: 闪电战与最终想法** 进行了一个快速问答环节,嘉宾们分享了他们推荐的书籍、喜欢的节目、产品、生活座右铭以及对彼此的看法,并提供了最后的建议。 ## 关键语录 以下是视频中最具启发性/反常识性/令人难忘/影响力最大/发人深省的语录: * “为了构建出色的 AI 产品,你需要非常擅长构建评估。这是你能进行的最高投资回报的活动。” (00:00) * “评估(Evals)是一种系统地衡量和改进 AI 应用的方式。” (04:57) * “最常见的误解是:我们生活在 AI 时代,AI 能不能自己进行评估?但事实并非如此。” (01:24:13) * “目标不是完美地进行评估。目标是切实地改进你的产品。” (01:15:15) * “我最喜欢 Perk 是有一本 60 页的书,我们精心撰写,详细介绍了如何进行评估的整个过程。” (01:37:57) ## 故事和轶事 以下是演讲者分享的最有趣、最令人难忘和最令人惊讶的故事/轶事: * **房产管理 AI 的“幽灵”服务**:在审查房产管理 AI 助手的交互日志时,发现 AI 向用户承诺提供虚拟参观服务,但实际上该服务并不存在。这个例子突显了 AI 可能产生幻觉,以及人工审查的重要性,因为 AI 本身无法识别这种“产品负面气味”。(16:51) * **“感觉”与系统化评估的辩论**:Claude Code 团队声称他们不进行评估,只依靠“感觉”。但嘉宾们认为,这可能是因为他们依赖于底层模型的强大评估,并且可能在内部进行了某种形式的系统化错误分析,只是没有明确使用“评估”这个词。(01:09:57) * **AI 驱动的评估工具开发**:嘉宾们提到,他们的一些客户在认识到数据分析的重要性后,会花费几小时时间构建自己的简单 Web 应用来简化评估过程,这表明利用 AI 降低摩擦是可行的。(01:22:28) ## 提及的资源 * **Hamel Husain 和 Shreya Shankar 的 AI 评估课程(Maven)**: 世界上最受欢迎的 AI 评估课程,已培训超过 2,000 名产品经理和工程师。(00:00) * **Fin**: 排名第一的客户服务 AI 代理 (00:43) * **Dscout**: UX 平台,用于捕获每个阶段的见解 (00:51) * **Mercury**: 简化财务的艺术 (00:58) * **Nurture Boss**: 房产管理 AI 助手示例公司 (09:56) * **Braintrust**: 用于加载 AI 应用日志的工具 (11:46) * **Langsmith**: 用于加载 AI 应用日志的工具 (11:50) * **Andrew Ng**: 机器学习研究员,曾在 8 年前讨论过错误分析 (33:22) * **Julius AI**: 一个可以用于数据科学和 LLM 分析的笔记本工具 (44:39) * **Cloude (Claude)**: 用于分析 CSV 文件和生成轴向代码的 LLM (31:39) * **Gemini**: 用于将笔记分类到预定义类别的 LLM (44:39) * **Statsig**: AB 测试公司,被 OpenAI 收购 (01:18:23) * **Claude Code**: AI 编码代理,在评估辩论中被提及 (01:11:38) * **Codex**: OpenAI 的 AI 编码模型 (01:15:15) * **Pachinko (小说)**: Shreya 推荐的虚构类书籍 (01:37:57) * **Apple in China: The Capture of the World’s Greatest Company (书籍)**: Shreya 推荐的非虚构类书籍 (01:38:12) * **Machine Learning by Tom M. Mitchell (书籍)**: Hamel 推荐的机器学习教科书,强调奥卡姆剃刀原则。(01:38:31) * **Artificial Intelligence: A Modern Approach (书籍)**: Hamel 推荐的 AI 教科书,强调人类的独创性。(01:39:06) * **The Wire (电视剧)**: Shrea 和她的丈夫最近在观看的电视剧。(01:40:06) * **Frozen (电影)**: Hamel 在陪伴孩子时观看的电影。(01:39:37) * **Cursor**: AI 辅助编码工具,Shreya 喜欢使用。(01:40:48) * **Claude Code**: Hamel 喜欢的 AI 辅助编码工具,尤其赞赏其用户体验。(01:41:16) * **Lennybot.com**: Lenny 的 AI 助手,用于课程内容查询。(01:35:32)
Topics Covered
- Evals: The Highest ROI Activity for AI Products
- The Benevolent Dictator: One Domain Expert for Evals
- Why 'Vibe Checks' Fail: The Need for Systematic Evals
- Ground Evals in Data: Start with Error Analysis, Not Tests
- LLM Judges Must Be Binary: Yes/No, Not Rating Scales
Full Transcript
To build great AI products, you need to
be really good at building evals. It's
the highest ROI activity you can engage
in. This process is a lot of fun.
Everyone that does this immediately gets
addicted to it when you're building an
AI application. You just learn a lot.
What's cool about this is you don't need
to do this many, many times. For most
products, you do this process once and
then you build on it.
>> The goal is not to do evals perfectly.
It's to actionably improve your product.
>> I did not realize how much controversy
and drama there is around eval. There's
a lot of people with very strong
opinions. People have been burned by
evals in the past. People have done
evals badly, then they didn't trust it
anymore and then they're like, "Oh, I'm
anti- evals."
>> What are a couple of the most common
misconceptions people have with EVEL?
The top one is we live in the age of AI.
Can't the AI just eval it? But it
doesn't work. A term that you used in
your post that I love is this idea of a
benevolent dictator. When you're doing
this open coding, a lot of teams get
bogged down in having a committee do
this. For a lot of situations, that's
wholly unnecessary. You don't want to
make this process so expensive that you
can't do it. You can appoint one person
whose taste that you trust. It should be
the person with domain expertise.
Oftentimes it is the product manager.
Today my guests are Hamill Hussein and
Shrea Shankar. One of the most trending
topics on this podcast over the past
year has been the rise of evals. Both
the chief product officers of Anthropic
and OpenAI shared that eval are becoming
the most important new skill for product
builders. And since then, this has been
a recurring theme across many of the top
AI builders I've had on. 2 years ago, I
had never heard the term evals. Now,
it's coming up constantly. When was the
last time that a new skill emerged that
product builders had to get good at to
be successful? Hamill and Shrea have
played a major role in shifting evals
from being an obscure mysterious subject
to one of the most necessary skills for
AI product builders. They teach the
definitive online course on evals, which
happens to be the number one course on
Maven. They've now taught over 2,000 PMs
and engineers across 500 companies,
including large swats of the open AI and
anthropic teams along with every other
major AI lab. In this conversation, we
do a lot of show versus tell. We walk
through the process of developing an
effective eval, explain what the heck
evals are and what they look like,
address many of the major misconceptions
with eval, give you the first few steps
you can take to start building evals for
your product, and also share just a ton
of best practices that Hamill and Trey
have developed over the past few years.
This episode is the deepest yet most
understandable primer you will find on
the world of evals and honestly got me
excited to write evals. Even though I
have nothing to write eels for, I think
you'll feel the same way as you watch
this. If this conversation gets you
excited, definitely check out Hamill and
Shreas's course on Maven. We'll link to
it in the show notes. If you use the
code Lenny's List when you purchase the
course, you'll get 35% off the price of
the course. With that, I bring you
Hamill Hussein and Shrea Shankar. This
episode is brought to you by Finn, the
number one AI agent for customer
service. If your customer support
tickets are piling up, then you need
Finn. Finn is the highest performing AI
agent on the market with a 65% average
resolution rate. Finn resolves even the
most complex customer queries. No other
AI agent performs better. In
head-to-head bake offs with competitors,
Finn wins every time. Yes, switching to
a new tool can be scary, but Finn works
on any help desk with no migration
needed, which means you don't have to
overhaul your current system or deal
with delays in service for your
customers. And Finn is trusted by over
5,000 customer service leaders and top
AI companies like Anthropic and
Synthesia. And because Finn is powered
by the Finn AI engine, which is a
continuously improving system that
allows you to analyze, train, test, and
deploy with ease. Finn can continuously
improve your results, too. So, if you're
ready to transform your customer service
and scale your support, give Finn a try
for only 99 cents per resolution. Plus,
Finn comes with a 90-day money back
guarantee. Find out how Finn can work
for your team at f.ai. ai/lenny. That's
finn.ai/lenny.
This episode is brought to you by Doutt.
Design teams today are expected to move
fast, but also to get it right. That's
where Dout comes in. Dcout is the
all-in-one research platform built for
modern product and design teams. Whether
you're running usability tests,
interviews, surveys, or in the wild
fieldwork, Dout makes it easy to connect
with real users and get real insights
fast. You can even test your Figma
prototypes directly inside the platform.
No juggling tools, no chasing ghost
participants. And with the industry's
most trusted panel, plus AI powered
analysis, your team gets clarity and
confidence to build better without
slowing down. So if you're ready to
streamline your research, speed of
decisions, and design with impact, head
to dscout.com to learn more. That's
dscout.com.
The answers you need to move
confidently.
Hammel and Shrea, thank you so much for
being here and welcome to the podcast.
Thank you for having us.
>> Yeah, super excited.
>> I'm even more excited. Okay, so a couple
years ago, I had never heard the term
evals. Now it's one of the most trending
topics on my podcast essentially that to
build great AI products, you need to be
really good at building evals. Uh, also
turns out some of the fastest growing
companies in the world are basically
building and selling and creating evals
for AI labs. I just had the CF Merkore
on the podcast. So, there's something
really big happening here. Uh, I want to
use this conversation to basically help
people understand the space deeply. But
let's start with the basics. Just what
what the heck are EVALs? For folks that
have no idea what we're talking about,
give us just a quick understanding of
what an eval is. And let's start with
with Haml. Sure. Evals is a way to
systematically measure and improve an AI
application. And it really doesn't have
to be scary or unapproachable at all. It
really is at its core data analytics on
your LLM application in a systematic way
of looking at that data and where
necessary creating metrics around things
so you can measure what's happening and
then so you can iterate and do
experiments and improve. So that's a
that's a really good broad way of
thinking about it. If you go one level
deeper just to give people a very even
more concrete way of imagining and
visualizing what we're talking about
even if you have a example to show it
would be even better. What's a what's an
even deeper way of understanding what an
eval is? Let's say you have a real
estate assistant
you know application and it's it's not
working the way you want. it's not
writing emails to customers the way you
want or it's not uh you know calling the
right tools
or any number of errors and
before evals you would be left with
guessing you would maybe fix a prompt
and hope that you're not breaking
anything else with that prompt and you
might rely on vibe checks which is
totally fine and vibe checks are good
and you should do vibe checks
initially, but it can become very
unmanageable very fast because as your
application grows, it's really hard to
rely on vibe checks. You just feel lost.
And so eval help you create
metrics that you can use to measure how
your application is doing and kind of
give you a way to improve your your
application with confidence that you
have a feedback signal in which to
iterate against. So just to make it very
real. So imagining this uh real estate
agent maybe they're helping you book a
listing or go see an open house. The
idea here is you have this agent talking
to people. It's answering questions,
pointing them to things. As a builder of
that agent, how do you know if it's
giving them good advice, good answers?
Is it telling them things that are
completely wrong? So, the idea of eval
essentially is to build a set of tests
that tell you is how often are is this
agent doing something wrong that you
don't want it to do? And there's a bunch
of ways wrong you could define wrong. It
could be uh just making up stuff. It
could be uh just answering in a really
strange way. Uh the way I think about
eval and tell me if this is wrong just
simply is like unit tests for for code
and then you're smiling. You're like no
you idiot.
>> Oh that's not what I was thinking.
>> Okay. Okay. Tell me tell me how does
that feel as a metaphor?
>> So okay I like what you said first which
is we had a very broad definition. Evals
is a big spectrum of ways to measure
application quality. Now unit tests are
one way of doing this. Maybe there are
some non-negotiable functionalities that
you want your AI assistant to have and
unit tests are going to be able to check
that. Now maybe you also because these
AI assistants are doing such open-ended
tasks, you kind of also want to measure
how good are they at very vague or
ambiguous things like responding to new
types of user requests or you know
figuring out if there's new
distributions of data like new users are
coming and using your real estate agent
that you didn't even know would use your
product and then all of a sudden you
think like oh there's a different way
you want to kind of accommodate this new
group of people. So eval could also be
you know a way of looking at your data
regularly to find these new cohorts of
people. Evals could also be like metrics
that you know you just want to track
over time like you want to track people
saying yes thumbs up I liked your
message. Um you want to very very basic
things that are not necessarily AI
related but can go back into this
flywheel of improving your product. So I
would say on the end on overall right
unit tests are a very small part of that
very big puzzle.
>> Awesome. You guys actually brought an
example of inval just to show us exactly
what the hell we're talking about. We're
talking in these big ideas. So how about
let's pull one up and show people here's
here's what an eval is.
>> Yeah. Let me just set the stage for it a
little bit. So to echo what Shrea said,
it's really important that we don't
think of evals as just tests. It's a
common trap that a lot of people fall
into because they jump straight to the
test like let me write some tests and
usually that's not what you want to do.
You should start with some kind of data
analysis to ground what you should even
test. And that's a little bit different
than software engineering where you have
a lot more
expectations of how the system is going
to work. With LLMs, it's a lot more
surface area. It's very stochastic. So,
we kind of have a different flavor here.
And so, the example I'm going to show
you today, it's actually a real estate
example. It's a different kind of real
estate example. It's uh from a company
called Nurture Boss. I can share my
screen to show you their website just to
help you understand this uh use case a
little bit. So, let me share my screen.
So, this is a company that I worked
with. It's called Nurture Boss and it is
a AI assistant for property managers who
are managing apartments. And it helps
with various tasks such as inbound
leads, customer service, booking
appointments, so on and so forth, like
all the different sort of operations you
might be doing as a property manager. It
helps you with that. And so, you know,
you can see kind of what they do. It's a
very good example because it has a lot
of the complexities of a modern AI
application. So there's lots of
different channels that you can interact
through the AI with like chat,
text, voice, but also there's tool
calls, lots of tool calls for like
booking appointments, getting uh
information about availability, so on
and so forth. There's also rag
retrieval,
getting information about customers and
properties and things like that. So it's
pretty fullyfledged in terms of an AI
application
and so
they have been really generous with me
and uh allowing me to use their data as
a teaching example and so we have
anonymized it but what I'm going to walk
through today is okay let's create let's
do the first part of how we would start
to build evals for nurture boss like why
Would we even want to do that? So let's
go through the very beginning stage what
we call error analysis
which is let's look at the data of their
application
and first start with what's going wrong.
So I'm going to jump to that next and
I'm going to open an observability tool
and you can use whatever you want here.
I just happen to have this data loaded
in a tool called brain trust but you can
load it in anything you know it's not we
don't have a favorite tool or anything
in the blog post that we wrote with you
uh we
had the same example but in Phoenix
Arise um and I think Aman on your blog
post use Phoenix Arise as well and
there's also Langmith so these are kind
of like different tools that you can use
so what you see here on the screen. This
is logs from the application
and
let me just show you how it looks. So
what you see here is and let me make it
full screen. So this is one particular
interaction that a customer had with the
nurture boss application.
And what it is, it's a detailed log of
everything that happened. So it's it's a
it's called a trace and it's just an
engineering term for logs of a sequence
of events. It's been a the concept of a
trace has been around for a really long
time but it's especially really
important when it comes to AI
applications. So we have all the
different components and pieces and
information that the AI needs to do its
job and we are logged all of it and
we're looking at a view of that and so
you see here a system prompt. The
assistant prompt says you are an AI
assistant working as a leasing team
member at retreat at Acme Apartments.
Remember I said this is anonymized. So
that's why the name is Acme Apartments.
Your primary role is to respond to text
messages from both residents and
perspective uh both current residents
and prospective residents. Your goal is
to provide accurate helpful information
yada yada yada. And then there's a lot
of detail around guidelines of how we
want this thing to behave.
>> Is this their actual system prompt by
the way for this company?
>> It is. Yes. It's a real system prompt.
>> That's amazing because that's really
it's rare you see actual company
products system prompt. That's like
their crown jewels a lot of times. So
this is actually very cool on its own.
>> Yeah. Yeah. It's really cool. And you
know you see all these different sort of
features that they want to
or different use cases. So things about
tour scheduling, handling applications,
guidance on how to talk to different
personas, so on and so forth. And you
can see the user just kind of jumps in
here. It says asks, okay, do you have a
one-bedroom with study available? I saw
it on virtual tours. And then you can
see that the LM
calls some tools. It calls this get
individual's information tool and it
pulls back that person's information and
then it gets the community's
availability.
So it's, you know, it's querying a
database with the availability for that
apartment complex. And then finally, the
AI responds, hey, we we have several
one-bedroom apartments available, but
none specifically listed with a study.
Here are a few options.
Uh, and then it says, "Can you let me
know when one with a study is
available?"
And then it says, "I currently don't
have specific information on the
availability of a one-bedroom
apartment."
User says, "Thank you." And the AI says,
"You're welcome. If you have any more
questions, feel free to reach out." Now,
this is
an example of a trace, and this is we're
looking at one specific data point.
And so one thing that's really important
to do when you're doing data analysis of
your LLM application is to look at data.
Now you might wonder there's a lot of
these logs.
It's kind of messy. There's a lot of
things going on here. How in the hell
are you supposed to look at this data?
Do you want to just drown in this data?
How do you even analyze this data? So it
turns out there is a way to do it that
is completely manageable
and it's not something that we invented.
It's been around in machine learning and
data science for a really long time and
it's called error analysis. And what you
do is the first step in conquering data
like this is just to write notes. Okay?
So, you got to put your product hat on,
which is why we're talking to you
because product people have to be in the
room. Um, and they have to be involved
in sort of doing this. You know, usually
a developer is not suited to do this,
especially if it's not a coding
application.
>> And I'm just to mirror back why I think
you're saying that is because this is
the user experience of your product.
People talking to this agent is the
entire product essentially. And so it
makes sense for the product person to be
involved, super involved in this. Yeah.
So let's let's reflect on this
conversation.
Okay. A user asked about availability.
The AI said, "Oh, we don't really have
that. Have a nice day."
Now, for a product that is helping you
with
lead management,
is that good? Like, do you feel
like this is the way we want it to to
go?
>> Not ideal.
Yes. Not ideal. And I'm glad you said
that. A lot of people would say, "Oh,
it's great. Like the AI did the right
thing. It said we don't it looked it
said we didn't have available and it's
not available." But with your product
hat on, you know, that's not correct.
And so what you would do is you would
just write a quick note here. You would
say okay um you know you might pop in
here let me just and you can write a
note so every observability application
has ability to write notes and you
wouldn't try to figure out if something
is wrong in this applica you know in
this case it's kind of not doing the
right thing. Um but you just write a
quick note um should you know should
have handed off to a human
>> and as we watch this happening it's like
you mentioned this and you'll explain
more you're doing this this feels very
ma manual and unscalable but uh as you
said this is just one step of the
process and there's a system to this and
it's just the first part
>> and you don't have to do it for all of
your data you can you sample your data
and just take a look and it's surprising
amazing how much you learn when you do
this. Everyone that does this
immediately gets addicted to it and they
say, "This is the greatest thing that
you can do when you're building an AI
application." You just learn a lot.
You're like, "Hm, this is not how I want
it to to work." Okay. And so, um, that's
just an example. So, you write this note
and then we can go on to the next trace.
So, this is the next trace. I just
pushed a hot key on my keyboard. Let me
go back to uh looking at it.
>> And these tools make it easy to go
through a bunch and add these notes
quickly.
>> Yes. And so this is another one. Similar
system prompt. We don't need to go
through all of it. Again, we'll just
jump right into the user question. Okay.
I've been texting you all day. Maybe
it's is funny. Um
um
and
uh the user says please okay yeah this
one is you this one is just like an
error in the application where you know
um this is a text message application
and so
you know it's a tech the sorry the
channel through which the customer is
communicating is through text message
and it's just getting like really
garbled And you can see here that it
kind of doesn't make sense,
you know, like the words are being cut
off like in the meantime
and then the system doesn't know how to
respond because you know how people text
message, they like write short phrases,
they you know split split their sentence
across four or five different turns. So
in this case
>> you do with something like that.
>> Yeah. So this is a this is a different
kind of error.
>> This is more of hey we're not handling
this interaction correctly. This is more
of a technical problem.
um rather than hey the AI is not doing
exactly what we want. So we would write
down too like it's amazing you're
catching that too here otherwise you'd
have no idea this was happening.
>> Yeah you might not know this is
happening right and so you would just
say okay um you would write a note like
oh
conversation flow
is janky
because of text message and I like yeah
I like that I like that you're using the
word janky. shows you just how informal
this can be at this stage. Yeah, it's
supposed to be chill like just don't
overthink it. And there's some there's a
way to do this. So
the question always comes up, how do you
do this? Do you look at do you try to
find all the different problems in this
trace? What what do you write a note
about? And the answer is just write down
the first thing that you see that's
wrong, the most upstream error. Don't
worry about all the errors. just capture
the most the first thing that you see
that's wrong and stop and move on.
And you can get really good at this. The
first two or three can be very painful,
but you know, it doesn't we can, you
know, do a bunch of them really fast.
So, here's another one. And um let's
skip the system prompt again. And the
user asks, "Hey, I'm looking for a two
to threebedroom with either one or two
bats. Do you provide virtual tours?
and a bunch of tools are called
and it says, "Hi Sarah, currently we
have three bedroomedroom, two and a half
bathroom apartment available for $2,175.
Um, unfortunately we don't have any
two-bedroom options at the moment. We do
offer virtual tool tours. You can
schedule a tour blah blah blah. It just
so happens that there is no virtual
tour,
>> right? So um you know it is
hallucinating something that doesn't
exist and you would you kind of have to
bring your context as an engineer or
even your product content and say hey
this is kind of weird like you know we
shouldn't be telling person about
virtual tour when it's not offered. So
you would say okay uh you know offered
virtual tour
and you just you know you just write the
note.
So you can see there's a diversity of
different kinds of errors that we're
seeing and we're actually learning a lot
about your application
um in a very short amount of time.
>> One common question that we get from
people at this stage is okay I
understand what's going on. Can I ask an
LLM to do this process for me?
>> Great question. And I loved Hammel's
most recent example because what we
usually find when we try to ask an LLM
to do this error analysis is it just
says the trace looks good because it
doesn't have the context needed to
understand whether something might be
you know bad product smell or you know
not for example the hallucination about
scheduling the tour right I can
guarantee you I would bet money on this
if I put that into chat GBT and asked is
there an error it would say no did a
great
But Hamill had the context of knowing,
oh, we don't actually have this virtual
tour functionality, right? So, I think
in these cases, it's so important to
make sure you are manually doing this
yourself. Um, and we'll talk a we can
talk a little bit more about when to use
LLMs in the process later, but like
number one pitfall right here is people
are like, let me automate this with an
LLM.
>> Do you think they'll we'll get to a
place where where an agent can do this?
>> Oh, no, no, no. Sorry. There are parts
of error analysis that an LLM is suited
for which we can talk about later in
this podcast
>> but right now in this stage of free form
note takingaking is not the
>> place for an LLM.
>> And this is something you call open
coding this.
>> Yes, absolutely.
>> Uh another uh term that you used in your
post that I love and that's fits into
this step is this idea of a benevolent
dictator. Maybe just talk about what
that is and maybe Sha cover that.
>> Yeah. So Hamill actually came up with
this term.
>> Okay, maybe Ham will cover the answer.
>> No problem. And we'll actually show the
LM automation in this example because
we're going to take this example. We're
going to go all the way through.
>> Amazing.
>> And so and so um benevolent dictator is
just a catchy term for the fact that
when you're doing this open coding, a
lot of teams get bogged down in having a
committee do this. And for a lot of
situations that's wholly unnecessary
like
you know people get really uncomfortable
with okay you know we want everybody on
board we want everybody involved so on
and so forth. You need to cut through
the noise. Um in a lot of organizations
if you look really deeply especially
small mediumsiz companies there's really
like you can appoint one person whose
taste that you trust. Um, and you can
you can do this with a small number of
people and often one person. And that's
it's really important to make this
tractable. You don't want to make this
process so expensive that you can't do
it. You're going to lose out. So that's
the idea behind benevolent dictator is,
hey, you need to simplify this
across as many dimensions as you can.
Another thing that we'll talk about
later is when you goes to building an
LLM as a judge, you need a binary score.
You don't want to think about is this
like a one, two, three, four, five, like
assign a score to it. You can't. That's
going to slow it down. Just to make sure
this benevolent dictator point is is
really clear. Basically, this is the
person that does this note-taking and
ideally they're the expert on the stuff.
So, if it's law stuff, maybe there's
like a legal person that owns this. It
could be a product manager. Give us
advice on who this person should be.
>> Yeah, it should be the person with
domain expertise. So in this case you
know it would be the person who
understands the business of leasing
apartment leasing and has context to
understand if this makes sense. It's
it's always the domain expert like you
said okay for legal it would be a law
person for mental health it would be the
mental health expert whether that's like
a psychiatrist or you know someone else.
>> Cool.
>> Um oftentimes it is the product manager.
>> Cool. So the advice here, pick that
person. May not feel so super fair that
they're the one in charge and they're
the dictator, but they're benevolent.
It's going to go be okay.
>> Yeah, it's going to be okay. You're just
trying to It's not perfection. You're
just trying to make progress and
in get signal quickly so you have an
idea of what to work on because it can
become infinitely expensive if you're
not careful.
>> Yeah. Okay, cool. Let's go back to your
examples. Yeah, no problem. So this is
another example where we have
someone saying, "Okay, do you have any
specials?"
And the assistant or the AI responds,
"Hey, we have a 5% military discount."
User responds, "Can you," and it
switches a subject, can you tell me how
many floors there are? Do you have any
onebedrooms available or one bedrooms on
the first floor? And the AI responds,
"Yeah, okay. We have several one-bedroom
apartments available." And then the user
wants to confirm any of those on the
first floor. And how much are the
onebedrooms? And then also is is a
current resident. So it's they're also
asking, I need a maintenance request.
This is actually pretty like you could
see the messiness of the real world in
here. And the assistant just calls a
tool that says transfer call,
>> but it doesn't say anything. It just
abruptly does transfer call.
>> So it's pretty jank I would say like
it's just not you know another jank
>> another kind of jank a different kind of
jank. So you don't want to when you
write the open note you don't want to
say jank because what we want to do is
we want to understand what and when we
look at the notes later on we want to
understand like what happened. So you
just want to say um you know did not
confirm
call transfer
with uh with user.
It doesn't have to be perfect. You just
have to have a general idea of what's
going on.
>> Cool.
>> So okay. So let's say we do and we Treya
and I we recommend doing at least a
hundred of these. The question is always
like how many of this do you do? And so
there's not a magic number. we say 100
is because we know that as soon as you
start doing this once you do 20 of these
you will automatically find it so useful
that you will continue doing it. So we
just say 100 to mentally unblock you so
it's not intimidating like don't worry
you're only going to do 100
and there is a a term for that of so so
the right answer is keep looking at
traces until you feel like you're not
learning anything new
should talk about
>> yeah so there's actually a term in data
analysis and quant qualitative analysis
called theoretical saturation
So what this means is when you do all of
these processes of looking at your data
when do you stop? It's when you are
saturating or you're not uncovering any
new types of notes, new types of
concepts or nothing that will like
materially change the next part of your
process. Um, and this kind of takes a
little bit of intuition to develop. So
typically people don't really know when
they've reached theoretical saturation
yet. That's totally fine. When you do
two or three examples or rounds of this,
like you will develop the intuition. A
lot of people realize like, oh, okay,
like I only need to do 40. I only need
to do 60. Actually, I only need to do
like 15. I don't know. Like depends on
the application and develops like how
depends on how savvy you are with error
analysis. For sure.
>> And your point about you probably want
to you're going to want to do a bunch. I
imagine it's because you're just like,
oh, I'm discovering all these problems.
I got to see what else is going on here.
>> Exactly. And I promise at some point
you're like not going to discover new
types of problems.
>> Yeah. Awesome. So let's say you did a
100 of these. What's the next step?
>> Yeah. Okay. So you did 100 of these. Now
you have all these notes. So this is
where you can start using AI to help
you. Um you So the part where you looked
at this data is important. Like we
discussed, you don't want to automate
this part too much. Humans will still
have jobs. This is a takeaway here.
That's great.
>> Yes. Just reviewing traces. At least
there's one job left for now.
>> Yeah. So, yeah, exactly. Um, and so,
okay, you have all these notes.
Now, to turn this into something useful,
you can do basic counting. So, basic
counting is the most powerful analytical
technique in data science uh because
it's so simple and it's kind of
undervalued
um in many cases. And so, it's very
approachable for people. And so the
first thing you want you want to do is
take these notes and you can categorize
them with an LLM. And so there's a lot
of different ways to do that. Right
before this podcast, I took three
different
uh coding agents or you know uh AI tools
and had it categorize these notes. So
one is okay, I uploaded into a cloud
project. I uploaded a CSV of these notes
and I just exported them directly from
this interface. Um, there's a lot of
different ways to do this, but I'm I'm
showing you the simple stupid way, the
most basic way of doing things.
And so I dumped the CSV in here and I
said, "Please analyze the following CSV
file. There's and I told it there's a
metadata field that has a note in it."
But what I said is I used the word open
codes. I said, "Hey, I have different
open codes
and that's a term of art. That's um LMS
know what open codes are and they know
what axial codes are because it is a it
is a concept that's been around for a
really long time. So those words help me
shortcut like what I'm trying to do."
>> That's awesome. And the end of the end
of the prompt is telling it to create
axial codes.
>> Yes, creating a codes. So what it does
is
>> so maybe it's worth talking about what
are axial codes or like what's the point
here right you have a mess of open codes
right and you don't have 100 distinct
problems actually mo many of them are
repeats but because you phrased them
differently right and in that you
shouldn't have tried to create your
taxonomy of failures as you're open
coding you just want to get down what's
wrong and then organize okay what's the
most common failure mode so the purpose
axial code basically is just a failure
mode. It's like the label or category.
And what our goal is is to get to this
clusters of failure modes and figure out
what is the most prevalent. So then you
can go and run and attack that problem.
>> That is really helpful. Basically,
you're just synthesizing all these
categories
and themes.
>> Super cool. And we'll uh include this
prompt in our show notes for folks so
they don't have to like sit there and
screenshot it and try to type it out
themselves.
>> Yeah, great idea.
Um, and so Claude, you know, went ahead
and analyzed the CSV file, decided how
to parse it, blah, blah, blah. We don't
need to worry about all that stuff. But
it came up with a bunch of axial codes.
Basically, axial codes are categories
like Shrea said. So one is okay
capability limitations,
misrepresentation,
pro processing, protocol violations,
human handoff issues, communication
quality.
It created these categories. Now, do I
like all the categories? Not really. I
like some of them. It's a good first
like stab at it. I would probably rename
it a little bit because some of them are
a bit too generic. Like what is
capability limitations? That's a little
bit too broad. That's not actionable. I
want to get like a little bit more
actionable with it so that if I do
decide it's a problem, I know what to do
with it. But we'll discuss that in a
little bit. Um, so you can do this like
with anything. And this is the dumbest
way to do it, but dumb sometimes is a
good way to get started. So,
>> and and this is what LM are really good
at, taking a bunch of information and
synthesizing.
>> Absolutely. Synthesizing for us to make
sense of, right? Note that, you know,
it's not telling us, it's not
automatically proposing fixes or
anything. That's our job.
>> But, you know, now we can wade through
this mess of open codes a lot easier.
Another thing that's interesting here in
this prompt to generate the axial codes
is you can be very detailed if you want,
right? You can say I want each axial
code to actually be you know some
actionable failure mode and maybe the
LLM will understand that and propose it
or I want you to group these open codes
by you know what stage of the user story
that it's in. So this is where you can
you know be creative or do what's best
for you as a product manager or engineer
working on this and that will help you
do the improvement later.
>> Okay. So there's no definitive prompt of
here's the one way to do it. You're
saying there's you can iterate, see what
works for you.
>> Absolutely.
>> It's interesting the tools don't want to
do this or or do they try and they just
don't do a great job?
>> No, I don't think they do it. We've been
screaming from the rooftops, please,
please do this. I do think it's a little
bit hard, right? Like part of this whole
um experience with the EVOS course
Hamill and I are teaching are like a lot
of people don't actually know this.
>> So maybe it's that people don't know
this and they don't know how to build
tools for it. Um, and hopefully we can
demystify some of this magic.
>> And just to double click on this point,
like this is not a thing everyone does
or knows. This is something you two
developed based on your experience doing
data analysis and data science at at
other companies.
>> Well, I want to caveat. We didn't invent
error analysis. We don't actually want
to invent things. That's a bad that's
bad signal. If somebody is coming to you
with a way to do something that's like
entirely new and not grounded in
hundreds of years of theory and
literature then you should I don't know
be a little bit wary of that. But what
we tried to do was distill okay what are
the new tools and techniques that you
need to know make sense of the LLM error
analysis and then we created a
curriculum or structured way of doing
this. So this is all very tailored to
LLMs, but you know the terms open
coding, axial coding are grounded in um
social science.
>> Amazing. Okay. Like what's funny about
you guys doing this is I just want to go
do this somewhere. I don't have I don't
have an AI product to do this on, but
it's just like oh this would be so fun.
Just sit there and find all the problems
I'm running into and categorize them and
then try to fix them. Delightful.
>> I love that.
>> Haml pulled up a video. What do you got
going on here?
>> Yeah. So I pulled up a video just to
drive home Shrea's point like we are not
inventing anything. So what you see on
the screen here is Andrew Ang one of the
famous machine learning uh researchers
in the world who have taught a lot of
people frankly machine learning and um
you can see this is a 8-year-old video.
So and he's talking about error analysis
and so this is a technique that's been
used to analyze stochastic systems for
ages. Um, and it's some it's something
that you're just using the same machine
learning ideas and principles is
bringing them in into here because
again, these are stochcastic systems.
>> Awesome. Well, one thing we're working
on getting Andrew in the podcast. We're
chatting. So, that'll be really fun. Uh,
two, I love that my other my podcast
episode just came out today is in your
feed there, and it's standing out really
well in that feed. So, I'm really happy
about that thumbnail.
>> Very nice. Yeah, the recommendation
algorithm is
>> Yes. Here we go. I hope you click on
that. Don't don't screw my algorithm.
Okay, cool. So, we've done some
synthesis. What's I know we're not going
to go through the entire step. This is
like you have a whole course that takes
many days to learn this whole process.
What else do you want to share about how
to go about this process?
>> Okay, so you can you can do this through
anything and you know I've used the same
thing works just fine in chat GPT. The
same exact prompt. You can see it it
made axial codes. I really like using
Julius AI. Um it's one of my favorite
tools. Julius is a is a kind of his
third party tool but uses notebooks. I
personally like Jupyter notebooks a lot
and so it's more of a data science thing
but a lot of product managers are kind
of learning notebooks nowadays and it's
kind of cool it's like a fun playground
where you can like write code and look
at data but we don't have to go deeply
into that just wanted to mention you can
use a lot you know AI is really good at
this so let's go to the fun part here we
go so now we have all the a we have
these axial codes so the first thing I
like to do I have these open codes right
and I have the axial codes that let's
say
you know the like that we assigned from
the cloud project or the chat GPT and so
what I do is I collect them first and I
take a look like does these axial codes
make sense and I look at the
correspondence between the different
axial codes and the open codes and I and
I go through an exercise and I say hm do
I like these these codes like can I make
them better? Can I refine them? Can I
make them more specific? Um you know
instead of like being generic I make
them very specific in actionable. So you
see the ones that I came up with here
are tour scheduling rescheduling issues
human handoff or transfer issue
formatting error with an output
conversational flow. We saw the
conversational flow issue with the text
messages.
uh making follow-up promises not kept
and and so basically what I can do what
you can do now is like you have these
axial codes
and um so I just collect them into a
list. So this is an Excel formula just
collect these codes into a list. So now
we have a commaepparated list of these
codes. And then what you can simply do
is you could take your notes that you
have those open codes and you can tell
an AI and this is using Gemini and AI
just for simplicity. This is like the
you know again we're trying to keep it
simple categorize
uh the following note into one of the
following categories. That's way this
for folks watching there's like I like
all these different prompts and formulas
you're showing. This is like the Google
Sheets AI AI prompt.
>> Yeah.
>> And so basically what you can do is you
can then have you can categorize your
faces into one of the buckets
and that's what we have here. We have
categorized all those problems that we
encountered into one of these things.
>> And this is automatic which is very
exciting. I mean the AI is doing it. So
this also drives home the point that
your open codes have to be detailed,
right? You can't just say janky because
if the AI is reading janky, it's not
going to be able to categorize it. Even
a human wouldn't, right? It would have
to go and remember why you said janky.
>> So it's important to be, you know,
somewhat detailed in your open code.
>> Okay. So avoid the word janky is a good
rule of thumb
>> or other words.
>> Okay.
>> I was being funny.
>> Yeah. Okay. What are some of those other
words just to that come that people
often use that you think are not good?
>> I don't think it's specific words. I
think it's just people are not detailed
enough in the open code. So it's hard to
do the categorization.
>> Great. And by the way, the reason you
have to map them back is because say
clott JPT gave you suggestions and you
changed them and iterated on them. So it
doesn't you can't just go back and say
cool what are in each bucket.
>> Yeah. Yeah.
>> Great. That's a really good question
actually. It's good to iterate and think
on about it a little bit like do I like
these open codes? Do these actually make
sense to me? Just like anything that AI
does, it's really good to kind of put
yourself in the middle
>> just in the loop. Still space.
>> Yes. Great.
>> Yeah.
>> One of the things that I like to do in
this step if I'm trying to use AI to do
this labeling is also have a new
category called none of the above. So an
AI can actually say none of the above in
the axial code and that informs me,
okay, my axial codes are not complete.
Like let's go look at those open codes.
Let's figure out what some new
categories are or figure out how to
reword my other axial codes.
>> Awesome. And what's cool about this is
you don't need to do this many many
times. Like no,
>> for most products, you do this process
once and then you build on it, I
imagine, and you just tweak it over
time.
>> Absolutely. And it gets so fast. Like
people people do this like once a week
and you can do all of this in like 30
minutes and like suddenly your product
is like so much better than if you were
never aware of any of these problems.
>> Yeah. It's absurd to feel like you don't
you wouldn't know this is happening like
watching this happening. I'm like how
could you not do this to your product?
>> A lot of people have no idea.
>> Most people Yeah.
>> Yeah. We we'll talk about that. There's
a whole debate around this stuff that we
want to talk about. Uh okay cool. So you
have this you have the sheet. What comes
next?
>> Okay. So here's the big unveil.
>> This is the magic moment right now.
>> So we have all these codes we that you
know we applied the ones that we like on
our traces. Now you can do the tada. You
can count them. So here's a pivot table
and we just can do pivot table on those
and we can count how many times those
different things occurred. So what do we
find? found on this on these like traces
that we categorized, we found 17
conversational flow issues. And I really
like pivot tables because you can do
cool things. You can like double click
on these. You can say, "Oh, okay. Let me
let me take a look at those." But that's
going into an aside about pivot tables,
how cool they are. But um um you know
now we have just a nice rough cut of
what are our problems and now we have
gone from chaos to some kind of thinking
around oh you know what these are my
biggest problems I need to fix
conversational issues you know maybe
these human handoff issues it's not
necessarily the count is the most
important thing you know that might be
something that's just really bad and you
want to fix that. But okay, now you have
some way of looking at your problem and
now you can think about whether you need
evals
uh for for some of these. So you know
with the
you know there might be some of these
things that
might be just dumb engineering errors
that you don't need to write an eval for
because it's very obvious on how to fix
them.
um maybe the formatting error with
output. Maybe you just forgot to tell
the LLM how you want it to be formatted
and like you didn't even say that in the
prompt. So like just go ahead and fix
the prompt maybe, you know, and we can
decide like okay, do you want an uh do
you want to write an email for that? You
might be you might still want to write
an email for that because you might be
able to test that with just code. You
could just test the string. Does it have
the right formatting potentially without
running an LLM? So, there's a
costbenefit trade-off to eval. You don't
want to get carried away with it. Um,
but you want to start, you want to
usually ground yourself in your actual
errors. You don't want to skip this
step.
And so, the reason I'm kind of spending
so much time on this is like this is
where people get lost. they go straight
into eval like let me let me just write
some tests and that is where things go
off the rails. Um so let's let's okay so
let's say we want to tackle one of these
things.
So for example
uh let's say we want to
tackle this human handoff issue and
we're like hm I'm not really sure how to
fix this. like that's a kind of
subjective sort of judgment call on, you
know, should we be handing off to a
human and I don't know immediately how
to fix it. It's not super obvious, per
se. Yeah, I can like change my prompt,
but I'm not like sure. I'm not 100%
sure. Well, that might be sort of an
interesting um thing for an LLM as a
judge, for example. So, there's
different kinds of evals. One is
codebased
which you should try to do if you can
because they're cheaper. You don't have
to, you know, LM as a judge is something
it's like a meta eval. You have to eval
that eval to make sure the LM that's
judging is doing the right thing, which
we'll talk about in a second.
So, okay, LM as a judge, that's one
thing. Okay, how do you build an LM as a
judge? Before we get into that actually
just to make sure people know exactly
what you're describing there these two
types of evals. One is you said it's
code based one is LLM as judge. Maybe
Shrea just help us understand what that
what a codebased eval even is. It's just
like it's like essentially a unit test.
Is that a simple way to think about it?
>> Maybe eval is not the right term here
but think like automated evaluator. So
when we find these failure modes, one of
the things we want is like, okay, can we
now like go check the prevalence of that
failure mode in an automated way without
me manually labeling and doing all the
coding and the grouping and I want to
run it on thousands and thousands of
traces. I want to run it every week.
That is okay. You should probably build
an an automated evaluator to check for
that failure mode. Now when we're saying
codebased versus llm based, we're saying
okay so maybe I could write like a
python function or a piece of code to
check whether that failure mode is
present in a trace or not. And that's
possible to do for certain things like
you know checking the output is JSON um
or you know checking that it's markdown
or checking that it's short like these
are all things you can capture in code
or you can approximately capture in
code. uh when we're talking about LLM
judge here, we're saying that this is a
complex failure mode and we don't know
how to evaluate in an automated way. So
maybe we will try to use an LLM to
evaluate this very very narrow specific
failure mode of handoffs.
>> So just to try to mirror back how you're
describing, you want to test what your
say agent or AI product is doing. You
ask it a question, it gets back with
something. One way to test if it's
giving you the right answer is if it's
consistently doing the same thing that
you could write a code to te to tell you
this is true or false. For example, will
it ever say there's a virtual tour? So
you could ask it
>> is do you provide virtual tours?
>> It says yes or no and then you could
write code to tell you if it's correct
based on that specific answer. But if
you're asking about something more
complicated and it's not binary, you
almost need like in a in a one world you
need a human to tell you this is
correct. The solution to avoid humans
having to review all this every time
automatically is LLM's replacing human
judgment and you call it LLM as judge.
The LM is being the judge if this is
correct or not.
>> Absolutely. You nailed it. Um, so people
always think like, oh, like this is at
least as hard as my problem of creating
the original agent
>> and it's not because you're asking the
judge to do one thing, evaluate one
failure mode. So the scope of the
problem is very small and the output of
this LLM judge is like pass or fail. So
it is a very very tightly scoped thing
that LLM judges are very capable of
doing very reliably
>> and the goal here is just to have a
suite of tests that run before you ship
to production that tell you things are
going the way you want them to the way
your agent is interacting. Beautiful
thing about LLM judge is you can use
them in unit test or CI sure but you
could also use it online for monitoring
right like I can sample like thousand
traces every day run my LLM judge real
production traces and see what the
failure rate is there this is not a unit
test right but still now we get like an
extremely specific measure of
application quality
>> cool that's a really great point because
a lot of people disse for being this
like not real life thing. It's a thing
that you test before it's actually in
the real world and
>> what's actually happening in the real
world. You're saying you could actually
you should actually do exactly that.
Test your real thing running in
production and it's like a daily hourly
sort of thing you could be running.
>> Totally.
>> Awesome. Okay. Uh Hamill's got a a
example of an actual LM is Judge Eval
here. So let's take a look.
>> I love how Shrea really teed it up um
for me. So thank you so much. So what we
have is a LM as a judge prompt for this
one specific failure. Like Shrea said,
you would want to do one specific
failure and you want to make it binary
because we want to simplify things. We
don't want hey like score this on a
rating of one to five like how good is
it? That's just mostly in most cases
that's a weasel way of like not making a
decision. Like no, you need to make a
decision. Is this good enough or not?
Yes or no? can be painful to think about
what that is, but you should absolutely
do it. Otherwise, this thing becomes
very untractable. And then when you
report these metrics, no one knows what
3.2 versus 3.7 means.
>> This is Yeah, we see this all the time
also. And even with like expert curated
content on the internet where it's like,
oh, here's your LLM judge evaluator
prompt. Here's a one to seven scale. And
I always think I always text Hamill
like, "Oh no, like now we have to fight
the misinformation again because we know
somebody's going to try it out and then
come back to us and say, oh, I have 4.2
average." And we're going to be like,
"Okay,
it's wild how much drama there is in
Eval's space. We're going to get to
that." Oh man. This episode is brought
to you by Mercury. I've been banking
with Mercury for years and honestly I
can't imagine banking any other way at
this point. I switched from Chase and
holy moly what a difference. Sending
wires, tracking spend, giving people on
my team access to move money around so
freaking easy. Where most traditional
banking websites and apps are clunky and
hard to use, Mercury is meticulously
designed to be an intuitive and simple
experience. And Mercury brings all the
ways that you use money into a single
product, including credit cards,
invoicing, bill pay, reimbursements for
your teammates, and capital. Whether
you're a funded tech startup looking for
ways to pay contractors and earn yield
on your idle cash, or an agency that
needs to invoice customers and keep them
current, or an e-commerce brand that
needs to stay on top of cash flow and
excess capital, Mercury can be tailored
to help your business perform at its
highest level. See what over 200,000
entrepreneurs love about Mercury. Visit
mercury.com to apply online in 10
minutes. Mercury is a fintech, not a
bank. Banking services provided through
Mercury's FDIC insured partner banks.
For more details, check out the show
notes. Okay, so this is your judge
prompt. There's no one way to do it.
It's okay to use an LLM to help you
create it, but again, put yourself in
the loop. Don't just blindly accept what
the LLM does. And in all of these cases,
that's what we did. Like with the axial
codes, we kind of iterated on this. You
can use an LLM to like help you create
this prompt, but make sure you read it.
Make sure you edit it, whatever. This is
not necessarily the perfect prompt. This
is just the stupid like very keeping it
very simple just to show you the idea is
like, okay, for this handoff failure,
um, you know, I said, okay, I want you
to output true or false is binary. It's
a binary judge. That's what we
recommend. And then we then I just go
through and say, okay, like when should
you be doing a handoff? And I just list
them out like, okay, explicit human
request ignored or looped.
>> Uh some policy mandated transfer,
sensitive resident issues, tool data
unavailability,
same day walk-in or tour requests, you
know, you need to talk to a human for
that. So on and so forth, right? And so
the idea is like now that I know that
this is a failure
from my data, I'm interested in
iterating on it because I know this is
actually happening all the time. And
like Sha said, like it would be nice to
have a way not only to evaluate this on
like the data I have, but also on
production data just to get a sense of
like well what scale is this happening?
Let me find more traces. Let me have a
you know a way to iterate on this. And
so we can take this prompt and I'm going
to use a spreadsheet again.
So
the first step is okay when I'm doing
this judge I wrote the prompt. Now a lot
of people stop there and they say okay I
have my judge prompt we're done good
like let's just let's just ship it and
let's uh the prompt says if the judge
says it's wrong it's wrong. They just
like accept it as the gospel be like
okay the LM says it's wrong it's it must
be wrong. Don't do that because that's
the fastest way that you can have evals
that don't match what's going on. And
when people lose trust in your evals,
they'll lose trust in you. So, it's
really important that you don't do that.
And so, one before you release your LM
as a judge, you want to make sure it's
aligned to the human. So how do you do
that is you actually you have those
axial codes and you want to like measure
your judge against the axial code and
say like hey did does it agree with me?
Does my own judge does it agree with me?
Just measure it. And so what we have
here is okay I say assess this LLM
trace. Again I'm using just spreadsheets
here. Assess this LLM trace according to
these rules. And and the rules are just
the prompt that I just showed you.
and I ask it okay is there a handoff
error true or false.
So then this column
let me just zoom in a bit. Column H I
have okay did this error occur in column
G is whether I thought the error
occurred or not. You can see
>> you're going through it manually. You do
that.
>> Yeah. Yeah. And which we already did. We
we already went through it manually. So
we don't it's not like we have to do it
again because we kind of have that cheat
code from the axial coding. We already
did it.
>> Um you might have to go through it again
if you need more data and there's a lot
of details to this on like how to do
this correctly. Um you want to split
your data and do all these things so
that you're not cheating but I just want
to show you the concept
and basically um what you can do is
measure the agreement. Now,
>> one thing you should know as a product
manager is a lot of people go straight
to this like agreement. They say, "Okay,
my judge agrees with the human
at some percentage of the time."
Now, that sounds appealing, but it's a
very dangerous metric to use because a
lot of times errors have um you know,
they only happen on the on the long tail
and they don't happen as frequently. So,
like if you only have the error 10% of
the time,
then you can easily have 90% agreement
by just having a judge say,
uh, it passes all the time. Does that
make sense? So like 90% agreement might
look good on paper but it might be
misleading
>> and that's rare. It's a rare.
>> Yeah.
>> So you know as a product manager or
someone even if you're not doing this
calculation yourself if someone ever
reports to you agreement you should
immediately ask okay tell me more like
you need you know you know need to look
into it to give you more intuition. Here
is like a matrix okay of this specific
judge in the Google sheet. And this is
again a pivot table just keeping it dumb
and simple is okay on on the uh rows I
have what did the human think what did I
think did it have an error true or false
and then did my judge have an error true
or false
>> the intuition here is exactly what
Hamill said right you need to look at
each type of error so when the human
said false but the judge said true or
vice versa so those non-green diagonal
here and if they're too large then go
iterate on your prompt make it more
clear to the LLM judge so that you can
reduce that misalignment you want to get
to a point where most you're going to
have some misalignment that's okay we
talk about in our course also how to
code correct that misalignment but in
this stage if you're a product manager
and the person who's building the LLM
judge eval has not done this they're
saying like oh it agrees these 75% of
the time we're good. They don't like
have this matrix and they haven't
iterated to make sure that these two
types of errors have gone down to zero,
then it's a bad smell. Go and ask them
to go fix that.
>> Awesome. That's a really good tip is is
what to look for when someone's doing
this wrong.
>> Yeah.
>> Actually, could you take us back to the
LLM as judge prompt? I just want to
highlight something really interesting
here. I've had some guests on the
podcast recently who've been saying eval
are the new PRDS. And if you look at
this, this is exactly what this is. Like
product managers, product teams, right?
Here's what the product should be.
Here's all the requirements. Here's like
the how it should work. They built the
thing. And then they test it manually
often. What's cool about this is this is
exactly that same thing. And it's
running constantly. It's telling you
here's how this agent should respond in
very specific ways. If it's this, this,
this is, do that. If it's this, this is
that, do that. And so it's exactly what
I've been hearing again and again. You
could see right here. This is like the
purest sense of what a product
requirements document should be is this
eval judge that's telling you exactly
what it should be and it's automatic and
running constantly.
>> Yeah, absolutely. And it's kind of
derived from our own data. So of course
it's a product manager's expectations.
What I find a lot of people miss is they
just put in what their expectations are
before looking at their data. But as we
look at our data, we uncover more
expectations that we couldn't have
dreamed up in the first place. And that
ends up going into this prompt. So that
is interesting. So it's not so your
advice is not skip straight to evals and
LLM as judge prompts before you build
the product. Still write traditional one
pagers P prds to tell your team what
we're doing, why we're doing it, what
success looks like, but then at the end
you could probably pull from that and
even improve that original PRD if you're
evolving the product uh using this
process.
>> I would go even further to say you're
going to improve. It's going to change.
You're never going to know what the
failure modes are going to be upfront
and you're always going to uncover new,
you know, vibes that you think that your
product should have, right? You don't
really know what you want until you see
it with these LLMs. So, you've got, you
got to be kind of flexible, have to look
at your data, have to PRDs are a great
abstraction for thinking about this, but
it's not the end all be all. It's going
to change.
>> I love that. And Haml's pulling up some
cool research report. What's this about?
>> Oh, this is one of the coolest research
reports you can possibly read if you
want to know about evals. So, it was
authored by someone named Shrea Shankar.
>> Oh my god.
>> And her collaborators and so it's called
who validates the validator.
>> That is the best name for a research
I've ever heard. That's so good.
>> Thank you.
>> So I I should let Treya talk about this.
I think the one of the most important
things to pay attention in this paper
are the criteria drift.
>> Yeah.
>> And what she found.
>> So we did this super fun study when we
were kind of doing user studies with
people who were trying to write LLM
judges or just validate their own LLM
outputs. And we were this was I think
this was before evals was like extremely
popular I feel like on the internet.
This was we did this project like late
2023 was when we started it. But then
the thing that really was burning in my
mind as a researcher is like why is this
problem so hard? We've been having
machine learning and AI for so long.
It's not new. But suddenly this time
around everything is really difficult.
So we just did this user study with a
bunch of developers and we realized okay
what's new here is that you can't figure
out your rubrics up front. People's
opinions of good and bad change as they
review more outputs. They think of
failure modes only after seeing 10
outputs they would never have dreamed of
in the first place. And these are
experts, right? These are people who
have built many LLM pipelines and now
agents before and just you can't ever
dream up everything in the first place.
Um, and I think that's so key in today's
world of AI development.
>> Okay, that is a really good point. It's
very much reinforcing what we were just
talking about and that's why H will pull
this up is just okay
>> behind it.
>> Yeah. Okay, great. You still got to do
product the same way, but now you have
this really powerful tool that make
helps you make sure what you've built is
correct. Uh it's not going to replace
the PR process.
>> Cool.
>> How many evals of these? How many say I
don't know Lanas judge prompts do you
end up with usually say I don't know
like I know obviously depends complexity
of the product but what's like a number
in your experience
>> for me like between four and seven
>> oh that's it
>> it's not that many because a lot of the
failure modes as Haml said earlier can
be fixed by just fixing your prompt you
just didn't think to put it in your
prompts and now you put it in your you
shouldn't do an eval like this for
everything just the the pesky ones that
um you've described your ideal behavior
in your agent prompt, but it's still
failing.
>> Got it. So, say you found a problem, you
fixed it. In traditional software
development, you'd write a unit test to
make sure it doesn't happen again. Is
your insight here is don't even bother
writing an eval around that if it's just
gone.
>> I think you can if you want to, but the
whole game here is about prioritizing.
You have finite resources and finite
time. You can't write an eval for
everything. So prioritize the ones that
are the more pesky areas
>> and probably the ones that are most
risky to your business if they say
something like Mecca Hitler Grock and
>> cool okay so that's that's very uh
relieving that this because this is this
prompt is like a lot of work to really
think through all these details
>> but it's a lot of one-time cost right
now forever you can run this on your
application
>> right
and I want to say okay data analysis is
super powerful is going to drive lots of
improvements very quickly to your
application. We showed the most basic
kind of data analysis which is counting
which is accessible to everyone. You can
get you know more
sophisticated with the data analysis.
There's lots of different ways to sample
look at data.
We kind of made it look easy in a sense,
but there's a lot of skills here to do
to it well. Um, you know, building an
intuition and a nose for how to sort
through this data. For example, let's
say I find conversational issues, this
like conversational flow issues.
Maybe if I was trying to chase down this
problem further, I would think about
ways to find other conversational flows
flow issues that I didn't code. You
know, I would maybe dig through the data
in several ways. Um, and there's, you
know, different ways to go about this.
It kind of it's very similar, if not
almost exactly similar as kind of
traditional analytics techniques that
you would do on any product. give us
just a quick sense of what comes next
and then let's talk about the debate
around eval.
>> So what comes next after you've built
your LLM judge? Well, we find that
people just try to use that everywhere
they can. So they will put the LLM judge
in unit tests as you and they will know
like oh here are some example traces
where we saw that failure because we
labeled it. Now we're going to make
those part of unit tests and make sure
that every time we push a change to our
code these tests are going to pass. They
also use it for online monitoring.
People are making dashboards on this and
I think that's incredible. I think like
the products that are doing this, right,
they have a very sharp sense of how well
their application is performing. Um, and
people don't talk about it because this
is their moat, right? So people are not
going to go and share all of these
things because makes sense, right? If
you are an email writing assistant and
you're doing this and you're doing it
well, you don't want somebody else to go
and build an email writing assistant and
then kind of get you out of business.
So, I really want to stress the point
that it's like try to use these
artifacts that you're building wherever
possible online repeatedly. Um, use them
to drive improvements to your product.
Often times Haml and I will kind of will
tell people how to do this up to this
very point and it clicks for people and
then they like never come back again. So
either they have I don't know quit their
jobs, they're not doing AI development
anymore or they know what to do from
here on out. Um I think it's the latter
but um I think it's very powerful. Like
just watching you do this really opened
my eyes to what this is and how
systematic the process is. I always
imagine you just sit on a computer,
okay, what are the things I need to make
sure work correctly. And what you're
showing us here is here's it's a very
simple step by step
>> based on real things that are happening
in your product. How to catch them,
identify them, prioritize them, and then
>> absolutely
>> catch them if they happen again and fix
them.
>> Yeah, it's it's not magic. Like anyone
can do this. you're going to have to
practice the skill like any new skill
you have to practice but you can do it.
Um and I think what's very empowering
now is that product managers are doing
this and can do this and can really
build very very profitable products with
this skill set.
>> Okay, great segue to a debate that we
kind of got pulled into that was
happening on on X the other day. Uh I
did not realize how much controversy and
drama there is around eval. There's a
lot of people with very strong opinions.
uh Sabat Shrea give us just a sense of
the two sides of the debate around the
importance and value of Ebells and then
give us your perspective.
>> Yeah. So all right I'll be a little bit
placating and I say I think everyone is
on the same side. I think the
misconception is that people have very
rigid definitions of what eval is. For
example, they might think that eval is
just unit tests or they might think that
eval is just the data analysis part and
no online monitoring or any no
monitoring of product specific metrics
like actually number of chats engaged in
or whatnot. Um so I think everyone has a
different mindset of evals going in. And
the other thing I will say is that
people have been burned by evals in the
past. So I think people have done evals
badly. One concrete example of this is
they've tried to do an LLM judge, but it
has not aligned with their expectations.
They only uncovered this later on and
then they didn't trust it anymore. And
then they're like, "Oh, I'm anti-EVOS."
And I 100% empathize with that because,
you know, you should be anti- liker
scale LLM judge. I absolutely agree with
you. We are anti- that as well. So, a
lot of the misconception stems from two
things, right? like people having a
narrow definition of evals and then
people not doing it well and then
getting burned and then wanting to avoid
other people making that mistake and
then unfortunately X or Twitter is like
a medium where you know people are
misinterpreting what everybody is saying
all the time and you just get all these
strong opinions of like don't do eval
it's bad we tried it it doesn't work
we're clawed code or you know whatever
other famous product and we don't do
evals and there's just so much nuance
behind all of it because because a lot
of these applications are standing on
the shoulders of evals coding agents is
a great example of that claude code
right they are standing on the shoulders
of claude
base model not base but the the
fine-tuned claude models have been
evaluated on many coding benchmarks
can't can't argue against that
>> and just to double just to make clear
exactly what you're talking about where
one of the heads I think maybe the head
engineer of cloud code went on a podcast
and he's like oh we don't do evals we
just vibe we just look at vibes and
vibes meaning they just use it and feel
if it's right or wrong
>> and I think that kind of works so
there's two things to that right one is
they're standing on the shoulders of the
evals that their colleagues are doing
for coding
>> of the cloud foundational model
>> absolutely right we know that they
report those numbers because we see the
benchmarks we know who's doing well on
those other thing is they are actually
probably very systematic about the error
analysis to some extent. I bet you that
they are monitoring who is using Claude,
how many people are using Claude, how
many chats are being created, how long
these chats are. They're also probably
monitoring in their internal team.
They're dog fooding. Anytime something
is off, they maybe have a queue or they
send it to the person developing claude
code and this person is implicitly doing
some form of hair error analysis that
Hamill talked about. All of this is eval
right there's no world in which they are
just being like I made cloud code I'm
never looking at anything um and
unfortunately right when you don't think
about that or talk about that I think
that the community most of the community
is beginners right or people who don't
know about evals and want to learn about
it um and it sends the wrong message
there now I don't know what cloud code
is doing obviously um but I would be
willing to bet money that they're doing
something in the form of evals
>> we'll also say that coding agents are
fundamentally very different than other
AI products because the developer is the
domain expert. So you can short circuit
a lot of things and also the developer
is using it all day long. So there's a
type of dog fooding and type of do
domain expertise that is you know you
can collapse the activities.
You don't need as much data. you don't
need as much feedback or exploration
because you know so your eval process
you know should look different
>> because you're seeing the code like you
see the code it's generating you can
tell this is great this is terrible
>> yeah yeah and so and so I think a lot of
people had generalized coding agents
because coding agents are the first AI
product released into the wild and I
think it's a mistake to try to
generalize that at charge.
>> The other thing is yeah, engineers have
a dog fooding personality. There are
plenty of applications where people are
trying to build AI in certain domains
and and they don't have a dog fooding
for like doctors for example are not out
there trying to get all the most
incorrect advice from AI
and be tolerant and receptive to that.
So it's very important to keep I think
these nuance things in mind. So what I'm
hearing from you Shre is interestingly
is that if you if humans on the team are
doing very close uh data analysis, error
analysis, dog fooding you like crazy and
essentially they are the human evals and
you're describing that as that's within
the umbrella of eval. So you could do it
that way if you're very if you have time
and motivation to do that or you could
set these things up to be automatic.
>> Absolutely. And it's also about the
skills, right? people who work at
Anthropic are very very highly skilled.
Um they've been trained in data analysis
or software engineering or AI and
whatnot right?
>> And you know, you can get there. Anyone
can get there of course by like learning
the concepts, but
>> most people don't have that skill right
now.
>> Do fooding is is a dangerous one only
because a lot of people will say they're
dog fooding. They're like, "Yeah, we dog
fooded." But are they really? And a lot
of people aren't really dog fooding it
at that visceral level that you would
need to to have to close that feedback
loop. So that's the only caveat I would
add. There's also this kind of feels
like straw man argument of eval versus
AB tests.
>> Talk about your thoughts there because
that feels like a big part of this
debate people are having like do you
need eval if you have AB tests that are
testing production level metrics. So AB
tests are again another form of eval I
imagine right like when you're doing an
AB test right you have two different
experimental conditions and then you
have a metric that quantifies the
success of something and you're
comparing the metric and again right an
eval in our mind is systematic
measurement of quality some metric um
you can't really do an AV test without
the eval to compare um
>> so maybe maybe we just have a different
weird take on it Yeah. Okay. So, what
I'm hearing is like you consider AB test
as part of the suite of evals that you
do. I think when people think AB test,
it's like we're changing something in
the product. We're going to see if this
improves some metric we care about. Is
that is that enough? Why do we need to
test every little feature? Like if it's
impacting a metric we care about as a
business, we have a bunch of AB tests
that are just constantly running.
>> This is now a great point. Um, so I
think a lot of people prematurely do AB
tests because you know they've never
done any error analysis in the first
place. They just have hypothetically
come up with their product requirements
and they like believe that you know we
should test these things. Um, but it
turns out right when you get into the
data as Haml showed that like the errors
that you're seeing are like not what you
thought the errors might be. They were
these like weird handoff issues or like
I don't know like the text message thing
was strange. Um, so I would say that
like if you're going to do AB tests and
they are powered by actual error
analysis as we've shown today, then
that's great. Go do it. Um, but if
you're just going to do them, which we
find that people try to do, just want to
do them based on like what you
hypothetically think is
why is important, then I would encourage
people to go and like rethink that and
kind of ground your hypothesis.
>> Do you have thoughts on what stats sig
going to do at OpenAI? Is there anything
there that's interesting? I'm just like
that was a big deal a huge acquisition
AB test company people are like oh AB
test the future uh thoughts you know
just to add to the previous question a
little bit is why is there this debate
AB testing versus evals I think
fundamentally
eval
is people are trying to wrap their head
around
what how to improve their applications
and fundamentally um you need to do data
science you data science is useful in
products like looking at data doing data
analytics there's many different suite
of tools and um you don't need to invent
anything new sure you don't need like
necessarily the whole breath of data
science and it looks slightly different
just slightly with LLMs
um you know you might your tactics might
be different and so really what it is is
like using analytic tools uh to
understand your product Now people say
the word eval is trying to kind of like
carve out this new thing and saying oh
eval and in AB testing but if you zoom
out it's it's the same data science as
before and I think that's what's causing
the confusion is hey we need data
science thinking and AI product just you
know is helpful to have that thinking in
AI products like it is in any product uh
is my take on that. So yeah, that's a
really good take. Like I think just the
word eval triggers people now.
>> And if you just call it we're just doing
air analysis using doing data science to
understand where our problem break our
product breaks and just setting up tests
to make sure we know
>> that's boring. It sounds boring.
>> No, no, no. We need a mysterious term
like evals to to really get the momentum
going. Your question about stats. Um I
think it's very exciting to be honest. I
don't know much about it because, you
know, I just imagine that they're this
company that many there's a tool that
many people use and maybe it just so
happened that OpenAI acquired them. I'm
sure they've been using them in the
past. Um, I'm sure OpenAI's competitors
>> are using Stat Sig as well.
>> So maybe there is something strategic in
that acquisition. I have no idea. I
don't know anything there. But I think
those are really the bigger questions
for me than you know is this
fundamentally changing AV testing or
making evals more of a priority. I I
think they've always been a priority. I
think Open AAI has always been doing
some form of them and OpenAI has gone so
far like historically speaking as to
like go and look at all the Twitter
sentiment and try to do some sort of
retrospective on that and then tie that
back to their products. Like they're
certainly they're doing some amount of
evals before they ship their new
foundation models, but they're going so
much beyond and being like, "Okay, let's
find all the tweets that are complaining
about it, all the Reddit threads that
are complaining about it that go try to
like figure out what's going on." So, it
goes to show that like eval are very,
very important. No one has really
figured it out yet. People are using all
the available sources signal that they
can to improve their products. What I
will say is I'm really hopeful that it
might shift the our creative focus
within OpenAI. Hopefully up until now a
lot of the big labs understandably
focused on general benchmarks like MMLU
score, human eval which are very
important for foundation models and you
know those not very related to product
specific evals like the ones we talked
about today but like handoff and stuff
like that like those you know they tend
not to correlate.
>> Yeah, they don't correlate with math
problem solving. Sorry to say.
Exactly. And so um you know if you look
at the eval products let's say the ones
up until recently that some of the big
labs have they don't have error
analysis. They have gener a suite of
generic tools cosign similarity
hallucination score whatever and that
doesn't work. It's a good first stab at
it. It's okay. you know, at least you're
doing something getting people maybe
it's like getting people to look at
data, but uh eventually what we hope to
see is okay some a bit more data science
thinking in this like eval process.
Hopefully the tools we'll get to
>> Haml and I should not be the only two
people on the planet that are promoting
like a structured way of thinking about
application specific evals.
>> It's like mindboggling to me. Why are we
the only two people doing this? The
whole world,
what's wrong? Um, so I hope that, you
know, we're not the only people and that
more people catch on.
>> Well, the fact that your course on Maven
is the number one highest grossing
course on Maven, clearly there's demand
and interest and there's more people I
think on your side. Interestingly, uh,
just an example you've been sharing on
Twitter that was I think is informative.
Everyone's been saying how Cloud Code
doesn't care about eval. They're all
about vibes and everyone's like what and
they're the best coding agent out there.
So clearly this is right. More recently,
there's all this talk about codecs, open
AI codecs being better and everyone's
switching and they're so pro eels.
>> I know.
>> What? Yeah. So,
>> gets me every time. The internet's so
inconsistent.
My favorite thing was um like yesterday,
I believe like a couple of lab mates and
I were out getting like dessert or
something and somebody said like, "Oh,
um do you like Codex or Claude better or
whatever?" And the other person said,
"Oh, I like Claude." And then someone
else said, "But the new version of
Codeex is better." And then the first
person said, "Oh, but the last I checked
was two days ago, so maybe my the
thoughts maybe I'm not up to date." And
I was like, "Oh my god,
>> so true.
>> This is the world we live in." Oh my
god.
>> Okay. So, I want to ask about just top
misconceptions people have with Evals
and top tips and tricks for being
successful. So, maybe just share one or
two each of each. So let me just start
with misconceptions and maybe I'll go to
the Hamill first. Just what are a couple
of the most common misconceptions people
have with Eval still. The top one is hey
I can just buy a tool plug it in and
it'll do the eval for you. Why do I have
to worry about this? We live in the age
of AI. Can't the AI just eval it? That's
the most common misconception. And
people want that so much that people do
sell it, but it doesn't work.
So that's the first one. Shoot, we need
humans still. Great. I think that's
great news.
>> The second one that you know I see a lot
is
hey um just not looking at the data you
know. So
in my consulting people come to me with
problems all the time and the first
thing I'll say is let's go look at your
traces
and you can see the kind of their eyes
pop open like what do you mean like yeah
let's look at it right now and they're
surprised that I am going I'm going to
go look at individual traces
um and we always it always 100% of the
time learn a lot and figure out what the
problem is. And so, um, I think people
just don't know how powerful looking at
the data is like we showed on this
podcast.
>> I would agree with that.
>> Those are the top two. Okay. Is there
anything else or those are those are the
ones like solve those problems?
>> Oh, those are definitely And then I
guess the one I would add is there's no
one correct way to do evals. There are
many incorrect ways of doing evals,
but there are also many correct ways of
doing it. And you got to think about
where you are at with your product, how
many how much resources you have, um,
and figure out the plan that works best
for you. It'll always involve some form
of error analysis as we showed today,
but how you operationalize those metrics
is going to change based on where you're
at.
>> Amazing. Okay. What are a couple just
tips and tricks you want to leave people
with as they start on their eval journey
or just try to get better at something
they're already doing?
>> So tip number one is just don't be
alarmed or don't you know be scared of
looking at your data. The process we try
to make it as structured as possible.
There are inevitably you know questions
that are going to come up. That's
totally fine. You might feel like you're
not doing it perfectly. That's also
fine. The goal is not to do eval
perfectly. It's to actionably improve
your product. And we guarantee you no
matter what you do, you're doing parts
of these process. You're going to find
ways of actionable improvement. And then
you're going to iterate on your own
process from there. The other tip that I
would say is we're very pro- AI. Use LLM
to help you organize any thoughts that
you have throughout this entire process.
So this could be everything ranging from
like initial product requirements,
right? Figure out how to organize them
uh for yourself, figure out how to
improve on that product requirements doc
based on the open codes that you've
created, right? Like don't be afraid to
use AI in ways that you know present
information better for you.
>> Sweet. So don't be scared. Uh use LMS as
much as you can throughout the process,
>> but not to replace yourself.
>> Right. Okay. Great. Still jobs. Great.
Hello.
>> Yeah, let me actually share my screen.
So, when I show something,
>> so to piggyback off what Treya said is
if you heard any phrase in this podcast,
you've probably heard look at your data
more than anything else. And so, it's so
important that we teach that you should
create your own tools to make it as easy
as possible. So, I showed you some tools
when we're going through the live
example of like how to annotate data.
Most of the people I work with, they
realize how important this is and they
vibe code their own tools or they we
shouldn't say vibe code. We they just
they make their own tools and it's it's
cheaper than ever before because you
have AI that can help you and AI is
really good at creating simple web
applications that can show you data that
have, you know, that can write to a
database. It's very simple. And so for
the nurture boss use case, we wanted to
remove all the friction of looking at
data. And so what you see here is just
some screenshots of
uh what the application that they
created looks like. It's just okay. They
have the different channels, voice,
email, text. Um they have the different
threads. They hid the system prompt by
default. Little quality of life
improvements. And then they al actually
had this axial coding part here where
you can see okay in red the count of
different errors. They automated that
part in a nice way and they just they
created this within a few hours. Um and
so it's really hard to have a
one-sizefits-all thing for looking at
your data. You don't have to go here
immediately,
but something to think about is make it
as easy as possible because again, it's
the most powerful activity that you can
engage in. It's the highest ROI activity
you can engage in. And so, um, you know,
with AI, yeah, just remove all the
friction. That's amazing. And again, I
think the ROI piece is so important. We
haven't even touched on this enough. The
goal here is to make your product
better, which will make your business
more successful. Like this isn't just a
little exercise to catch bugs and things
like that. Like this is the way to make
AI products better because the
experience is how users interact with
your AI. Absolutely. If any, you know,
we teach our students, hey, when you're
doing these evals, if you see something
that's wrong, just go fix it. Like the
whole point is not to have eval suite
where you can point at edit it and say
oh look at my evals no just fix your
application make it better do you know
if it's obvious do it so totally agree
with you amazing how long a question I
ask but this is I think something people
are thinking about how long do you spend
on this like how long does it usually
take to do the first time
>> I can answer for myself for applications
that I work with usually I'll spend
three to four days really working with
whoever to do initial rounds of error
analysis, like a lot of labeling, feel
like we're in a good place to create the
spreadsheet that Hamill had and
everyone's kind of on board and
convinced and even like a few LLM judge
evaluators. Um, but this is a onetime
cost. Once I figured out how to
integrate that in unit tests or I have
like a script that automatically runs it
on samples and I will create a cron job
to just do this every week. I would say
it's like I don't know I find myself
probably spending more time looking at
data because I'm just data hungry like
that. I'm so curious. I'm like I've
gained so much from this process and
it's like put me above and beyond in any
of my you know collaborations with
folks. So, I want to keep doing it, but
I don't have to. I would say like maybe
30 minutes a week after that.
>> So, it's a week essentially, a week
essentially up front and then like 30
minutes to keep improving and adding to
your suite.
>> Yeah, it's really not that much time. I
think people just get overwhelmed by how
much time they spend up front and then
thinking that they have to keep doing
this all the time.
>> Amazing. Is there anything else that you
wanted to share or leave listeners with?
Anything else you want to kind of double
down on as a point before we get to our
very exciting lightning round? So I
would say this process is a lot of fun
actually. So it can it's like okay
you're looking at data. Oh it sounds
like you're annotating things. Okay
actually like so I was just looking at a
client's data yesterday. The same exact
process. It's a application that sends
emails recruiting emails to try to get
candidates to apply for a job. and
we decided to start looking at traces.
Jump right into it. Hey, let's look at
your traces. The we looked at a trace.
The first thing I saw was
this like email that is worded like
given your background blah blah blah
blah blah. So I asked the person right
away and this is where putting your
product hat on and just being critical
and this is where the fun part is. I
said, you know what, I hate this email.
like do you like the email given your
background when and when when I receive
a message given your background I just
delete that. So I'm like what is this
given your background with machine
learning and blah blah I'm like this is
a generic thing like so I asked the
person like hey you know can we do
better than this like this is kind of
like a this is like a sounds like
generic recruiting and they're like oh
yeah maybe yeah like it's the AI because
they were like they were proud of it
like the AI is doing the right thing is
sending this email with the right
information with the right link with the
right name everything and so that's
where the fun part is is like put your
product hat on and get into like is this
really good? Something I want to make
sure we cover before we get to a very
exciting light round is this is just
scratching the surface of all the things
you need to know to do this well. Uh I
think this is the best primer I've ever
seen on how to do this well. Nice.
>> But you I think we did it. But you guys
teach a course that goes much much
deeper for people that really want to
get good at this and take this really
seriously. share what else you teach in
the course that we didn't cover and what
else you get as a student being part of
the the course you teach at Maven.
>> Yeah, I can talk about the syllabus a
little bit and then Haml can talk about
all the perks. Um, so we go through a
life cycle of error analysis, then
automated evaluators, then how to
improve your application, like how do
you create that flywheel for yourself.
We also have a few special topics that
we find like pretty much no one has ever
heard of or taught before, which is
exciting. One is how do you build your
own interfaces for error analysis. So we
kind of go through actual interfaces
that we've built and we also live code
them on the spot for new data. Um and we
show kind of how we use cloud code
cursor or whatever we're feeling in the
moment that day to build these
interfaces. Um and we also talk about
kind of broadly cost optimization as
well. So, we a couple of people that
I've worked with um they got to a point
where their evals are very good, their
product is very good, but it's all very
expensive because they're using like
state-of-the-art models. So, how can we
kind of replace certain uses of the most
expensive GPT5 models with, you know, 5
nano, 4 mini, whatnot, and save a lot of
money but still maintain the same
quality. So, we also give some tips for
that. Um, Haml, you want We also have
many perks.
>> Yeah. Talk about the perks.
>> Okay. The perks.
>> So, my favorite perk is there's a60page
book that's meticulously written that
we've created that walks through the
entire process in detail of how to do
evalu.
So, you don't have to sit there and take
all these notes. We've done all the hard
work for you and we've like documented
it in detail.
um you know and organize things. So that
is really useful. Another really
interesting thing and something that I
got the idea from you Lenny is okay this
is an AI course.
Education shouldn't be this thing where
you you're only watching lectures and
doing homework assignments. So students
should have access to an AI that also
helps them. So what we've done is we've
uh you know just like there's the
Lennybot that you have
>> yeah lennybot.com
uh we have made the same thing with the
same software that you're using and we
have put everything we've ever said
about eval into that. So, every single
lesson, every office hours, every
Discord chat, any blogs, papers,
anything that we've ever said publicly
and within our course, we've put it in
there and we've tested it with um a
bunch of students and they've said it's
helpful. Um so, we're giving all
students 10 months free unlimited access
to that alongside the course.
>> Amazing. And then you'll charge for that
later down the road is the idea. I just
take one month at a time. I don't know
what we're doing.
>> Eight months and then we'll have to
figure it out. I was thinking we should
have this whole interview should have
just been our bots talking to each
other.
>> That's amazing. I I would watch that
only for like 10 minutes then I don't
know what they're talking about.
>> Yeah, maybe maybe 30 seconds. Did you
guys train it on the voice mode by the
way? That's my favorite feature of this
of Deli's product. If not, you should do
that.
>> Oh,
>> I think I'm I can't remember. Um I
should I should look at it. Definitely
sure. Now that we have this podcast
episode, you could use this uh content
to train it. It's 11 laps powered. It's
so good. Uh okay. So that's how do they
get to I guess that's okay. They get to
that once they become a uh they enter
your course. So there's no
>> sign up for the course and then you'll
get a bunch of emails. Everything will
be clear hopefully.
>> Amazing. Okay. We also have a discord
>> of all the students who have ever taken
the class and that discord is so active
>> I I can't go on vacation without getting
notified on the plane or
>> bitter sweet bittersweet. Incredible.
Okay. Uh with that we've reached our
very exciting lightning round. I've got
five questions for you. Are you ready?
>> Yes. Let's go.
>> Let's do it. Okay. So, I'm going to
bounce between you two. Share something
if you want. You can pass if you want.
First question. Shrea. What are two or
three books that you find yourself
recommending most to other people?
>> So I like to recommend a fiction book
because life is about more than eval. So
recently I
read Pachenko by uh Ninjan Lee uh really
great book and then I also am currently
reading Apple in China which the name of
the author is slipping my mind but this
is kind of more of an exposition written
by a journalist on how Apple did a lot
of manufacturing processes in Asia over
the last um couple several decades. Very
eye opening.
>> Amazing. Haml.
>> Yeah, I have them right here. Uh, so the
so I'm a nerd. Okay, so I'm not as cool
as RA is. So I actually have like
textbooks which are like my favorite. So
this one is very classic one. Machine
learning by Mitchell. Now, it's kind of
theoretical, but the thing I like about
it is it um really drives home the fact
that AAM's razor is prevalent not only
in like science but also in machine
learning and AI. So, a lot of times the
simplest and also engineering. So like a
lot of times the simpler approach
generalizes better. And so that's the
thing I kind of internalized deeply from
that book. And um also really like this
one. So another textbook. I told you I'm
a nerd. This is like also an very old
one.
>> Wow.
>> Um, and this is like, you know, nor
algorithms and it's just like I really
like it because it's just human
ingenuity and it's very like lots of
clever useful things.
>> They're down the street.
>> Computing.
>> I'm at Berkeley.
>> The people that did that research.
>> Yeah.
>> Yeah. Textbook authors.
>> Super cool. Oh, man. Nerds. I love it.
Okay, next question. Favorite recent
movie or TV show? I'll jump to Haml
first.
Okay. So, I'm a dad of two parents. I
have two parents.
>> I don't get to Oh, sorry. Uh, two kids.
So, yeah, I'm a dad of two kids
>> and I don't really get the time to watch
any TV or movies. So, I watch whatever
my kids are watching. So, I've watched
Frozen like three times in the last
week.
>> Only three? Oh, okay. In the last week.
Okay. Yeah. So, that's
>> great. That's my Frozen. I love it.
Okay. Sure. Yeah.
>> Yeah. I don't have kids so I can give
all these amazing answers actually. So
my husband and I have been watching The
Wire recently. We never actually saw it
growing up. So we started watching it
and it's great.
>> I feel like everyone goes through that
eventually in their life. They decide I
will watch The Wire.
>> I know. So we are in that
>> year of your life. I It's great. I It's
such a great show. Oh man. And but it's
so many episodes and everyone's an hour
long.
>> I know. We get through like two or three
a week. So,
>> so we're very slow.
>> Worth it. Okay, next question. Do you
have a favorite product you've recently
discovered that you really love? And
we'll start with Shrea.
>> Yeah, I really like using cursor.
Honestly, now Cloud Code. Um, I'll I'll
say why. So, I think as a I'm a
researcher more so than anything else. I
write papers, I write code, I build
systems, everything. And I I find that
like a tool I'm so bullish on AI
assisted coding because like I have to
wear a lot of hats all the time. Um and
now I can be more ambitious with the
things that I build um and write papers
about. So I'm super excited about those.
Cursor was my entry point into this. Uh
but I'm starting to find myself trying
always trying to keep up with all these
AI assisted coding tools.
>> Haml. Yeah, I really like cloud code and
I like it because I feel like the UX is
outstanding. Um, there's a lot of love
that went into that. Um, it's just it's
just really impressive as a terminal
application that is that nice.
>> Ironic that you two both love cloud code
when it's just built on vibes.
>> I think it's false. It's not just built
on vibes.
>> There we go. Okay, two more questions.
Uh Hamill, do you have a favorite life
motto that you find yourself using and
coming back to in work or in life?
>> Keep learning and think like a beginner.
>> Beautiful. Shria,
>> I like that. Uh for me, it's to always
try to think about the other side's
argument. I find myself sometimes just
encountering arguments on the internet
like this recent eval debates and like
really think okay put myself in their
shoes. There's probably a generous take,
generous interpretation, and I think
we're all much stronger together than if
we start picking fights. My vision for
Evals is not that Hamel and I become
billionaires. It is that everyone can
build AI products and we're all on the
same page.
>> Slash Everyone becomes billionaires.
>> Yes.
>> Yes.
>> Amazing. Final question. When I have two
guests on, I always like to ask this
question and I'll start with Hamill.
What's something about Shrea that you
like most? What do you like most about
Shrea? And I'm going to ask her the same
question in reverse. Yeah, Shrea is one
of the wisest people that I know,
>> especially for being so young relative
to me. I feel like she's like much wiser
than I am, honestly. Seriously. Um,
she's very grounded and has like a very
even perspective on things.
And so I'm just really impressed by that
all the time.
>> Sure. Yeah.
>> Yeah. My favorite thing about Hamill is
his energy. I don't know anybody who
consistently maintains momentum and
energy like Hamill does. Um I often
think that like I would start caring
much less about Evals if not for Hamill.
And uh everyone needs a Haml in their
life for sure. M. Oh, well, we all have
a Haml in our life now. Uh, this was
incredible. This was everything I'd
hoped it'd be. I feel like this is the
most interesting, in-depth uh,
consumable uh, primer on emails that
I've ever seen. I'm really thankful
YouTube made time for this. Uh, two
final questions. Where can folks find
you? Where can they find the course? And
how can listeners be useful to you? Uh,
I'll start with Shrea.
>> Yeah. Uh, you can reach me via email.
It's on my website. If you Google my
name, that is the easiest way to get to
my website. You can find the course. If
you Google AI evals for engineers and
product managers or just AI evals
course, you'll find it. Um, we'll send
some links hopefully after this so it's
easy. And how to be helpful. Two things
always for me. One is ask me questions
when you have them. I will try to get to
them.
R respond as soon as I can. The other
one is tell us your successes. One of
the things that keeps us going is
somebody tells us like what they
implemented or what they did a real case
study and Haml and I get so excited from
these um and it really keeps us going.
So please share.
>> Yeah. Uh it's pretty easy to find me.
I'm my website is hamill.dev.
I'll give you the the link. Um you can
find me on social media, LinkedIn,
Twitter. Um thing that's most helpful is
to echo what Shrea said, we would be
delighted if we're not the only people
teaching evals. We would love other
people to teach evals. And so any kind
of blog posts,
writing especially that as you go
through this and learn this that you
want to share, we would be delighted to
help reshare that or amplify that.
Amazing. Very generous. Thank you so
much for being here. Uh, I really
appreciate it and you guys have a lot
going on. So, so thank you.
>> Thanks Lenny for having us and for all
the compliments.
>> My pleasure. Bye everyone.
>> Thank you so much for listening. If you
found this valuable, you can subscribe
to the show on Apple Podcasts, Spotify,
or your favorite podcast app. Also,
please consider giving us a rating or
leaving a review as that really helps
other listeners find the podcast. You
can find all past episodes or learn more
about the show at lennispodcast.com.
See you in the next episode.
Loading video analysis...