AI Evaluations Clearly Explained in 50 Minutes (Real Example) | Hamel Husain
By Peter Yang
Summary
## Key takeaways - **Directly analyze traces for potent insights**: Instead of relying solely on automated metrics, directly analyzing around 100 AI conversation traces and annotating issues provides more actionable insights into specific failures and user pain points. [04:41], [08:46] - **Spreadsheets simplify complex AI evaluation**: A simple spreadsheet can be used to organize annotated trace data, categorize issues, and analyze findings using pivot tables, making the evaluation process accessible and understandable. [09:46], [14:14] - **Binary pass/fail beats 1-5 scores for AI judges**: When using an LLM as a judge, binary pass/fail outputs are more effective than 1-5 scores, as they reduce complexity and avoid the ambiguity of average scores that are difficult to act upon. [24:44] - **Beware of 'agreement' metrics in AI evals**: High agreement between an AI judge and human labels can be misleading if the error rate is low; it's crucial to examine true positive and true negative rates instead. [28:52] - **Continuous evals in production are key**: After developing reliable judges, implement them in production to continuously monitor AI performance on live traffic, allowing for ongoing debugging and improvement. [35:54] - **Focus on real problems, not generic metrics**: Generic metrics like helpfulness or toxicity scores can be misleading; prioritize identifying and addressing the specific, real-world problems occurring in your application's data. [44:02]
Topics Covered
- Raw AI Eval Scores Are Meaningless and Unactionable
- Iterative Manual Review: Improving AI Evaluation Through Practice
- Manual Traces Provide Deeper AI Insights Than Automated Scores
- The most valuable AI eval process is annotation, not fancy tools
- Why Likert Scales (1-5) for AI Evals are Often a Bad Idea
Full Transcript
this looking at data counting, you can get insane value by just doing that. And
that's the one part that everyone skips. I'm gonna use a spreadsheet to drive home
the fact that this process can be dead simple and you can get immense value
out of it. When you see an average score of 3.2 versus 3.7, no one
really knows what the hell that means. It's not really actionable, honestly. They're like, oh,
it's like getting better. Honestly, like nobody really knows whether it's getting better or not.
So as a product manager, if you ever see the word agreement, You need to
pause and be like, hmm, let me dig into this, please. If people don't trust
your evals, they won't even trust you. You're done. Okay, welcome everyone. My
guest today is Hamel. Hamel has trained over 2000 PMs and engineers from companies like
OpenAI, Anthropik, and Google on how to run AI evaluations. He teaches the most popular
course on this topic at May event. So really excited to dig into his best
practices and And I feel like I made a lot of mistakes and have a
lot of assumptions that Hamill can help to spell. So welcome. Really happy to be
here. Excited to talk about evals. All right. So there's been a lot of online
debate about are evals valuable or not, blah, blah, blah, on Twitter. And let's just
make this really practical. Let's talk about evals for a real product. Do you have
a product example that you want to talk about throughout? Yeah. I do, yes. So
here's a company that I've been working with. Let me share my screen. Sure. So
just to set the stage, Nurture Boss is a AI-powered property management assistant.
And so it's a really interesting use case because it's actually one of the best
teaching examples. And I love using it because it's messy and there's enough
complexity to where it's realistic. And so the question comes like, oh, okay, like how
do you do eval? So a lot of times... When we talk about evals, we
can show toy examples, but sometimes those are oversimplified and it's hard for people to
generalize that. Like, how am I going to do that for my app? My app
is complex. I have other things going on. Well, that's what we're going to show
today. Actually, we're going to look at Nurture Boss' data together and we're going to
do a minimal set of evals. We're going to do that very quickly. Yeah, that'd
be awesome. So I know on your podcast, you had Aman on already. You may
have talked about Arise. There's a lot of different solutions out there. The ones that
I come across the most in practice are Arise, Braintrust, Langsmith. Those are the kind
of three popular ones. One of the observability platforms that NurtureBoss used in the beginning
was Braintrust. They actually created their own, but I'm going to show you their data
in Braintrust because that's where we have it anonymized. Trace is basically like the chat
conversations, right? With NurtureBoss? Yeah. Okay, I'll show you what a trace is. The best
way is just to look at it rather than me trying to define it. So
I'm just going to open one. So trace is a log of all of the
history of a particular interaction that your user might be having with an AI, including
all of the internal things that might be going on that the user doesn't see.
Okay, so this is an example of a trace. And so what we see here
is we have the system message. So you are an AI assistant working... work as
a leasing team member at the Aero. And you have different instructions on how to
interact with the resident or the prospective resident. For example,
how to deal with maintenance requests or what to do with prospective
residents and tour scheduling and all kinds of things like applications. And
there's a lot of instructions here. We don't need to read all of them. You
can scroll down. And we see here, where is the building located is
one of the questions the user is asking. And so there's a tool call made,
get communities information. We can expand it and it's pulling back some information. It's making
a tool call. The tool call returns with some information, which you can see here.
And then the assistant is, is giving an answer. You know, this, the arrow is
located at this address and then it's, the user is saying, I'm interested in a
two-bedroom, what's available? And actually, then it stops. It just dead ends.
And there is a failure of some kind. It just wasn't surfaced to the
user. So this is not necessarily the most interesting error. This is just
something that you would see in the real world. Now, this is an example of
a trace. So when Jacob came to me, I said, okay, Jacob, this is what
we're going to do. We're going to look at traces and he's like he's like
what do you mean we're gonna look at traces this seems like we're going in
the wrong direction like this is gonna take forever hamel like this must be like
what what are you like what are you doing he said just trust me we're
gonna look at about 100 traces and just we're just gonna write notes about what
is going wrong the first one may be painful but we're gonna get really good
at it by the time we get to the 10th one we're gonna be really
fast so we got to this one we're like oh like something happened here um
the conversation got cut off um it failed, we had to do some investigation to
figure out what's happened, but we just wrote a note. And so in this case,
we can make an annotation. So I would write a note saying like, hey, there
was an error that was not surfaced
to the user. Okay, just write a note. And the conversation
dead ended. The point is not to do a full root cause analysis. Just observe
what's wrong. Okay. And that's all you got to do is do this. Do it
like a hundred times. Yeah. Do this a hundred times. Now, Jacob's like, what the
hell are we doing? Why are we doing this? Like I hired you Hamill for
evals and you are like a monkey looking at this data. And like, I don't
understand what is going on. Like, what is the purpose of this Hamill? It was
a very expensive consulting engagement. And we were looking at data. I said, no, trust
me. We are going to, this is going to lead somewhere. Well, I mean, you're
basically human labeling this data, right? I'm basically human labeling this data. So let's look
at a couple of more. I think it will become clear why this is valuable
really fast. Okay, so this is another trace. You are a leasing team member at
the Meadowbrick Gardens, and your name is Taylor, so it's giving it a name. Your
goals are to answer questions. Okay, so it's really the same thing. You have, you
know, you're fielding calls from residents, actually prospective residents,
because this looks like it's like a sales funnel type of mode where you're like
encouraging customers to schedule tours. So let's go to the first user message. So we
got the system message out of the way. The assistant says this call, so this
is over voice now. Okay. This call may be recorded, blah, blah, blah. And the
only thing the user says is preview program. I have no idea what that is.
Maybe it's something in the system problem, but I personally don't recall necessarily what that
is. The assistant says, it looks like you might be interested in learning more about
community and possibly scheduling a tour. Would you like to schedule a tour? He just
says no. The assistant says, it seems like you might have a specific question or
need assistance or something related to our apartment. How can I assist you? Can I
tell to a representative? Obviously, they're trying to talk to a person. Sure, I can
connect to the representative. Would you like me to do that now? Yes. Great. I'll
connect you to representative one moment. Would you like me to connect you? I'm already
frustrated. Like, yeah, the person to a human being already like so many questions like,
are you sure? Are you sure? Blah, blah. So, okay. Like ADU, the ones that
I do, it sounds like you might be interested. Yes. Could you clarify your query?
I just want to talk. We've all had this experience at one point in other
than our lives. Yeah. Every time I call. Yeah. Yeah. This is not, you know,
I want to throw my phone out the window whenever I have this interaction with
an AI. So, okay, I understand. I'll connect to you. And finally, a tool call
is made where we're transferring the call. So we feel the pain of the user.
We do, honestly. And it's pretty clear, like, hey, so this
is a clear error. In this case, I'll write the note here just like it
did a lot into the other one. And you're not trying to, like, write solutions.
You're just trying to write, like, the note of the problem. Yeah, I'm not trying
to debug it. I'm not trying to say like why it happened. I'm not trying
to root cause a solution. I'm just journaling why it happens. So you do this.
It took us, frankly, it took us like an hour to do 100 traces. It
didn't take us long at all. Okay. But by the end of it, we knew
a lot. Like we learned a lot and we learned way more than you could
possibly learn by trying to throw any kind of automated solution at this problem.
So if you tried to come at NurtureBoss and put a hallucination score, toxicity score,
coherent score, whatever you want to call it, it would not have given us anywhere
close to this insight that we have right now. We identified very specific failures and
things that are painful. By looking at the data, we immediately could see that. Did
you use some AI to summarize the trace data that you labeled? Very good
question. So... while you shouldn't use AI to like do the, you should definitely look
at the data, put your hands on the data. You can use AI to help
you do the analysis of it. So what happens is I exported all of
these logs plus the notes into a spreadsheet. Okay. Okay. And you don't have to
use a spreadsheet. I'm going to use a spreadsheet to make it, to drive home
the fact that this process can be dead simple and you can get immense value
out of it. So, What I did is I exported like all the notes. So
in this column, column A, you have all the notes that I took. So like,
you know, one note is user was probably asking about lease terms or maybe deposit,
not about specials. And the AI was talking about specials. Or another one was the
AI offered virtual tours, but there is no virtual tour. Okay. You know, so all
these are different. And there's the disparate messaging one that we just saw. And so
all these different notes that it took. So what you can do is you can
take these notes and you can do something really stupid, simple, is you
can dump them into like Claude or ChatGPT or whatever. And I say, okay, please
analyze the following CSV file. There's a metadata field, which is a message field called
ZNote that contains open codes. So this is like some terminology here. Open codes is
just a fancy word for those comments that I made. Okay. Just the notes for
analysis of LLM logs that we are conducting, please extract all of the different open
codes from the Z note field proposed five to six categories that we can create
axial codes from. So the axial code is another piece of terminology. That doesn't mean
we just want to group them into cat. We want to group these notes when
it just classified them. Okay. This like open code axial code thing is actually some
really old technique from social sciences. And it's been also used in machine learning for
decades. And so that's why we're using this terminology as a shortcut to give to
the LLM. Because the LLM knows exactly what this means. They're like, oh, I know
exactly what you're doing. You're trying to do this technique. Got it. Okay, so it's
better than just saying categorize this stuff. It's just some shortcut technique. Yeah.
It gives it some specificity. By saying open code, axial code, it knows what my
goal is. Because there's a lot of... when I use that terminology in this technique.
You can also say categorize, though. I mean, there's no... You can start wherever you
want. So if you want to say categorize, start with categorize. Totally fine. The point
is make progress and get value as fast as you can. I don't want to
be too prescriptive. But the point is, like, you can sort of... So, like, okay,
it kind of iterated a lot on how to open the CSV, so we can
skip that. But it gave me, like, you know, some... categories.
And I didn't necessarily like all of them. And I could have been, you know,
I kind of went back and forth a bit to refine the categories that made
sense to me. And I just like took, I wrote down some categories actually. And
here's some categories here. So I said like, okay, tour scheduling, rescheduling issue,
human handoff or transfer issue, formatting error with the output that had some formatting errors.
Yeah. Like putting markdown inside text messages, for example. Conversational
flow issues. So that's like the text thing where it's just abrupt, you know, the
flow. Making up promises not kept, like rescheduling and things like that. Or
not rescheduling, but other kinds of promises. And then there's another, I usually have another
field called none of the above, but I didn't do that here, just out of
simplicity. And so what you can do is then you can kind of go back
and forth. And what I did is like in the spreadsheet, you know, it can
use AI. So there's AI in this Google spreadsheet. Google sheets have AI. You can
have the AI formula. It's very handy to show you something. Categorize the
following note into one of the categories. So you're just like using the LLM to
like categorize it. And you can see I'm just categorizing these notes that I took,
like the problems that I found. I'm just putting into categories. Right. I didn't know
that Google Sheets has this AI feature already. Okay. They're usually very slow at this
stuff. It's cool. It can be slightly janky, but it's okay. It's lightweight and you
don't have to use any tools and everyone can understand how to do this. It
demystifies the whole process of what I'm doing. Because if I open some code, you
might think, oh, you need to be a software engineer to do this or something.
And no, you don't. We can use English all the way. So now you have
categorized all of these things. And now we can use one of my favorite tools,
PivotTables. So pivot tables, if you haven't seen them before,
it's really handy in spreadsheets. So you can just count how many times each of
these categories occurred. And we can see just at a high level,
hey, oh, okay, we're having this conversational flow issue is happening quite a bit. We
also have this human handoff transfer issue. And you can kind of get a sense
right away what the problem is. Now, it is likely that before you even get
to this count, you already know. You've looked at 100 traces, you know in your
gut. You're like, okay, you know what? I need to fix this human transfer thing
right now. You're like, I don't even need to do a data analysis, but it's
quick. This takes less than a few minutes, honestly. And it just gives
you some grounding and lets you see, you go from this massive, I don't know
what's going on, to okay, like I have some idea about the problems that I
have. Okay. And you have some starting point. Does that make sense? Yeah.
Yeah, this is brilliant, dude. But let me ask you this question. Yeah. So in
this case, the agent was already live in production before he started doing all this
stuff, right? But that's not the ideal approach, right? Like ideally you want to do
some of this before or maybe dog food with your team first or like some
users try it. Absolutely. How do you? Yeah. Yeah. So the best case scenario is
you dog food it with like, you know, some friendly customers, you've dog food it
yourself. That's going to be really good. You can also synthetically generate inputs into your
system. So basically what you can do is like, think about plausible user
questions they may have in, you know, and try to come up with some hypothesis
of where your system will break. Yeah. But there's a certain way to do that.
You don't want to just, ask an LLM, hey, come up with plausible questions for
a prospective tenant that might be looking to rent an apartment. The right way to
do it is to come up with some categories. So it's to come up with
some dimensions. And what I mean by that is, let's say, let's think about what
good dimensions might be for a nurture boss. So for nurture boss, you might have
like resident, you can have, okay, Like type of customer maybe? Department class
maybe? So like luxury, standard, something else.
I don't know what that is. I'm not that creative to think about on the
fly, but what was the thing you said? Just type of customer, right? You can
get the tenant manager versus the actual resident, right? Depending on who you're talking to.
Yeah. Resident, manager. Yeah.
And you can think of, put your product hat on. So like, by the way,
this whole process is very product oriented. Like, so,
you know, when you read the trace, it's not so much about engineering. It's putting
your product kind of hat on and saying, is this the experience that you want
your user to have? Does this actually make sense? When it comes to like these
dimensions I'm talking about, you kind of putting your product hat on and saying, okay,
what are the different personas? What are different categories, different dimensions that you may want
to consider? Yeah. And then what you want to do is like, you know, you
would take the kind of combination of these, you know, so like luxury
resident, luxury for resident, luxury for manager, standard for resident, standard for manager. And you
would feed those, we call it dimensions, into an LLM, say, okay, these are the
different dimensions. For this, for every one of these dimensions, generate plausible user queries.
Got it. That's like way better than just asking an LLM. Yeah. Just asking. Because
if you just ask an LLM, it'll be a lot more homogeneous. And you don't
want to have homogeneous inputs. You want to explore the space of inputs. And so
these brainstorming dimensions helps you to kind of make sure you're exploring
the space, being thoughtful about exploring the space. Does that make sense? Got it. Yeah,
you want to find all the edge cases. Yeah, got it. And that's just scratching
the surface of this. There's a lot more to generating synthetic data that we could
probably get into here, but there's like more advanced ways to generate synthetic data or
things to think about in terms of being more adversarial, how to come up with
hypotheses to help you break your system, so on and so forth. But the short
answer is user system and you can use LLMs to help you help basically
pretend as synthetic users. That makes sense. Okay, let's go back to the categories you
have on the left. So now you have these, I mean, they're not, they're basically
problems, right? Problems with the product. And now what we're going to do with this
stuff, is this how we come up with our criteria or should we just start
fixing the issues or? Very good question. Very good question. So
when we first did this, what the top issue was, date handling. It's like, and
it was very clear, you know, the user wanted to schedule an appointment.
And it was always getting the date wrong. And it was very clear, like, oh,
this is so dumb. Like, the LM doesn't know what today's date is. We just,
like, forgot. They, like, forgot to put that in the prompt. It's like, oh, do
you really need an eval for that? Maybe not. Like, you know, you don't want
to eval, Matt. You don't, like, necessarily want to do evals because, like, it feels
good. The whole purpose of anything is to make your product better. and to iterate
and move fast. And so for that one, we're like, well, let's see.
Let's just give it what today's date is. And that problem basically went away,
unsurprisingly. So we didn't really need an eval from that. Other things that are more
subjective are, so it's a cost-benefit trade-off. So there's two kinds of
evals. One is LLM as a judge, which we are going to build together. Another
one is code-based eval, where you don't really need an LLM as a judge. It's
some kind of assertion that you can make. And that's very cheap compared to LLM
as a judge. And so for the date one, we actually did
a code-based eval, which is like we had some test cases and we're able to
test, like, does the date that's coming out equal to the expected date? And that
was very cheap. We didn't have to do LLM as a judge. Got it. But
that was really easy to fix. Now, something like, hey, you should be handing off
to a human. Okay, that one, we don't know exactly. We
did have rules for that already, but the LLM is struggling, and we don't really
know how we're going to do it. That seems like a really good use case
for an LLM judge. And also, the eval is going to provide tons of value,
even though it's expensive, more expensive, because we're going to iterate against it a lot
to make products. Got it. And so we say, okay, like let's, you know, okay,
we need an LLM judge for the human handoff. Let's go ahead and do it.
It's an important problem also. Yeah. Yeah. The main problem that people encounter when doing
LLM as a judge is they just prompt in another LLM to kind of judge
what your LLM did. And they say, is it good? Now that should be suspicious,
right? Like, why is that okay? Like, Why are you just going to tell another
LLM to tell me if it's okay? Like, I don't know. And you would be
right to be very suspicious of that. And there is an answer to that. And
the answer is you can measure the judge. So it's like, it's a meta evaluation
problem is like, you need to measure if the judge is good. It's very important.
You don't want to skip that step because if you have a bunch of judges,
LLM judges floating around and your stakeholders are like, you're reporting them. On the
dashboard, your stakeholders are looking at them and everyone's like, oh, the judge, you know,
no one's going to understand what you're using in LM judge anyways. They're just going
to look at your metric. And when there becomes a significant, there's enough gap between
reality and your metrics, no one's going to trust you anymore. You want to avoid
that. You want, because like, if people don't trust your evals, they won't even trust
you. You're done. Yeah, exactly. Yeah, you can't make it have like perfect score for
something, but it's actually totally wrong, right? Yeah, yeah. Yeah. Yeah. And so, okay, so
how do you go about this? Well, thankfully, when you're doing this axial coding stuff,
right, you actually have identified really good test cases or some reasonable test cases that
you can use that are labeled. You already labeled them as a human. So you
have some ground truth. And these are things that you can use to calibrate an
LLM judge to see if you can create a judge that is good enough. Okay?
Okay. So that's what- Good enough is just like, good enough is like how close
it doesn't match to human labels, right? That's kind of what- How close does it
match to human label? Yeah. So that's what we're going to do next is we're
going to think about, first, we're going to write the prompt. And this is like
the dumbest prompt. I'm not saying like, this is a good prompt. This is just
a prompt. And the point is not to like have a prompt recipe or some
like magic thing. It's just to iterate. Okay. And you want to just specify- kind
of the requirements of like, in this case, what is a good handoff? Or when
should you be doing a handoff and when should it happen? And so, you know,
you are scoring a leasing assistant to determine if there was a handoff failure. There
should be handoff if any of these things occur, you know, or sorry,
there is a handoff failure if any of these things occur, like if a human
requested to be handed off. but you just ignored them or looped through it too
many times, that's a failure. Got it. Yeah. And there's a list of these seven
failures. You don't have to read all of them, but you get the idea. And
we also say when there's not a failure, just out of completeness. And we say
we want to return exactly true or false binary. So
it's worth lingering on this for a moment. So it's very important for an LLM
judge that you output a binary score. 99% of the time, you don't want to
export like a Likert scale or a score of one to five
or some kind of score because that introduces tremendous complexity. Yeah.
You know, LNs are not good at continuous scores, number one. Number two is the
output is not going to be clear. When you see an average score of 3.2
versus 3.7, no one really knows what the hell that means. Yeah, yeah. And
it's not really actionable, honestly. They're like, oh, it's like getting better. Honestly, like nobody
really knows whether it's getting better or not. I found that when you try to
hide behind a score, you're not really making a decision. And like what you're trying
to, the frame here is, is this feature good enough to ship? Yes or
no? Make a decision. What is the line? There is a line somewhere inside. Like
there has to be. Right. And so we don't want to score wherever possible. You
want to simplify it. The score just makes it too complex. Yeah. It's like a
fake science, you know, it's like, you know, false precision, right? Like who knows? Yeah,
it can be. Yeah, it can be. There's some cases like there's some evals where
you want a score when you get, when you go very narrowly into certain aspects
of things like, you know, when you try to have evals for retrieval, search and
things like that, like different components, then the scores make sense. But for this, like
LM as a judge case, in the overall sense, like, no. And why no explanations,
though? Like, why don't you want to explain why I'm marked? So, explanations are actually
usually good. So, you know, what we teach is you want explanations and
then a score. But this is like a spreadsheet. Okay, yeah. We just want it
to be tractable. If I try to give an explanation, then it would like, and
you know, the model here in the spreadsheet isn't the most powerful one they give
you. So it was going all over the place. So I was just trying to
simplify it here, but yeah, explanation can be good. It can help you debug the
AI model and you want to give a structured output. You want like a few,
you want to usually output two fields. Like you want it to output like an
explanation and a binary score. And then you can use the explanation to kind of
help you debug. what went wrong with the LMS thinking. Oh, so you're actually going
to do this LM judging using the Google sheet model. Yes. I'm going to stay
in the Google sheet because our goal is to demystify everything and to make it
very clear, like what is actually happening by using a spreadsheet all the way down.
So, uh, okay. So we have this element judge prompt, and now we can go
back to those traces. So we have like, this is the original trace column a,
this is an adjacent format. Um, And then we have sort of this LLM as
a judge. So this is like for one error. So you want them to be
scoped usually. So we have this LLM as judge just for the handoff error. And
we have the formula, assess this LLM trace according to these rules. Okay. And
then it's the LLM judge prompt that I just showed you. That's all it is.
And then it's giving us true or false. Okay. So we have true or false
here. And then this column H is, is the LLM judge
handoff, like what it said, the binary score, true or false, is there an error?
And then this is the human label, column G, there's an error. So we have
these two labels and we already did column G before. That was kind of happened
for free because of the axial coding in the open coding, the process. Oh, so
you have like another AI? No, the human label is like the notes, right? So
you have like an AI. Yeah, those are like the results of the notes. Okay.
You know, like basically I said, hey, if the axial code is human handoff or
transfer issue, I just said it's true. Got it. Got it. Okay. And so you
can then see how aligned the LLM is to the human. That's the main thing
that you want to test. Now, one thing you want to stay away from. So
a lot of people go to just calculate agreement. Intuitively, it makes sense,
right? Like let me calculate the agreement between the LLM. And the human. And the
human. It seems like a plausible metric. It seems like, oh, that sounds reasonable. Okay,
like agreement, sure. The problem with agreement is most errors,
hopefully if your system is not jank, most errors are not happening kind of at
the tail. So this human handoff error is not happening every single time. It's maybe
happening 15% of the time or 10% of the time, right? Got it. And so
if something is happening 10% of the time, How can you agree with it? So
it's like if something, if a system is saying something is failing 10% of the
time, you can agree with it 90% of the time by just saying it never
fails. You'll be in 90% agreement. And 90% agreement seems
really good on paper. You go into like a stakeholder meeting. It's like, yeah, I
have a judge, you know, 90% agreement. Okay, that sounds good. No,
that sounds really bad, actually, potentially. You need to really dig into that. So as
a product manager, if you ever see the word agreement, you need to pause and
be like, hmm, let me dig into this, please. And so you
need to measure two quantities. One is, and there's different
terms, but true positive rate and true negative rate. So Those are just, and there's
different words for it. Sensitivity, specificity, precision, recall, different words, but true
positive rate, true negative rate. And so true positive rate is what is your ability
to successfully identify the failures? Like when the failures
actually happen and what's your ability to successfully identify when failures don't happen? And that's
a better, those two quantities are kind of better than agreement because you, they will
show you when something is off. Like, you know, and so to make this more
concrete, because it can be a lot in your head, like, oh, what am I
saying right now? Like, why isn't that? Right. And so let's go here to this
confusion. It's called a confusion matrix. It's funny that it's called a confusion matrix. Sometimes
it causes confusion, but hopefully today it won't cause confusion. What you have here is
like, okay, in this column, you have the human label. Okay. True or false,
false and true. And then in this, going across here, you have the LM judge
label where the green diagonal is where they both agree. Yeah. Okay, because this is
like 100 traces we have. So when the human says it's false, the
LM judge agrees with it, okay, like out of, you know, the 73 times. But
then when the human says it's false, the LM judge thinks there is an error
18 times. Interesting. There is a different... there's different kinds of
errors. And this is what I'm talking about here. You don't want to just go
out in agreement. You want to know what the true positive rate, true negative rate
is. Now, how do you know what a good true positive, true negative rate is?
There is no magic bullet there. That's a business decision. Like what
top, what level of judge are you is okay. Is like okay for you in
the, in the most base case, you just need to do a sanity check. Like,
does it make sense? Okay. Like, You know, does it seem okay? Calculate the true
positive rate, calculate the true negative rate. Is one of them like really bad? Okay,
then maybe you don't want to use that. Is it really low? Or, you know,
just look at the confusion matrix and do whatever, you know, and you can use
a spreadsheet and say, hmm, is this okay? Like, am I okay with this kind
of error? You know, give yourself an intuition. Oftentimes, I would
say for most people who haven't or not, used to true positive rate to negative
rate, it takes some time for it to click. Yeah. Even I have to think
about it sometimes, honestly, even I've been doing this for years just to like ground
myself. I mean, I think the confusion matrix is actually way more clear than the
percentages. I mean, yeah. I think 18, there's 18 words marked as
true when it's false. Yeah. Yeah. So there's a problem when it's false, you have
like, you know, where it's false is where there's, um, out of these 91 times,
you have like 18 of these 91 times, it has this specific error. Is that
okay? So basically 18 times it actually did successfully handoff to the human
support, but the LM thinks it did not, or there's too many terms or something.
Yeah, yeah. 18 times LM thinks there is an error when there's not. Is that
okay? And so different situations you might be... Like the false
positives are not as expensive as the false negatives. You know, so like you might
be okay with catching things, like catching more errors that don't actually exist. You
just want to make sure you do catch all of them. So then what do
you do with this 18? Like do you look back at the traces, see what
happened, and then you try to modify the prompt? Yeah, yeah. So what you do
is you can look at these like 18 and you can, you know, you can
say, okay, like what happened here? And you can iterate and you keep iterating. a
bit on the prompt. Yeah. And oftentimes it's quite straightforward.
Sometimes not as much. But one thing I did leave out here is a lot
of times in the LMS judge, you want examples. I didn't put examples here because
I just wanted to keep it simple. Once you start putting examples in the prompt,
you do have to split the data set a bit. And you can't
just, you can start overfitting to your data. So like if I put all of
these traces in my prompt, it would get a hundred percent because it would
know the answer exactly. Right. So like, you don't want to do that. And, and
so you want to hold aside some data to make sure you're not cheating yourself.
And, you know, so that's, so we don't have to get into all that from
a product manager perspective. The best thing you can do is like, just know, let's
have a trigger in your mind about agreement and just don't, Just ask some clarifying
questions like, okay, agreement is 90%. What is the baseline error rate? If they say
10%, you know that agreement 90%, you're like, this is really bad.
Like something went wrong here. And this is like pretty common, right? For teams running
evals, they just have like an agreement store. They don't have the TPR or anything.
Very extremely common. Yeah, this is a reason I'm making a big deal out of
it is because we just see it so much. that it's worth calling out.
Okay, got it. All right, so now we have some judges live, and
what's the next step? You want to put his judges into production to run all-time?
So now you have this really... So let's say you have the judge, like this
human handoff score judge. That's right. And you like it enough. So now you have
this powerful tool that you can use. Number one, you can... you can set aside
some data, you can put it in CI, you can have a test. Anytime you
make a change to code or whatever, you can test how good you're doing on
this human handoff problem. But also you can run your judge in production. You can
run it on a sample or a large portion of your production traces. And you
can see where this handoff failure is happening. And you can debug it even more.
You can say, I want to find... all of the places where a handoff failure
is happening. I want to find a lot more situations where it's happening. And you
can put, you can do production monitoring of it, of problems. You can see, you
can use these judges to kind of run on a sample of traffic. You can
know like, Hey, our handoff problems happening. Yeah. You know, so on and so forth.
And you can build this suite of evals over time. Okay. Most of the time
people ask me how many evals I have usually have, under a dozen.
I don't really have that many because I'm pretty parsimonious about the evals that I
need. It depends, like, sometimes I have more than that. It depends how expensive they
are. It takes some work to maintain this stuff as well. You know, for the
LLM judge, the code-based ones, not so much because you don't have to do all
this, like, you know, because I don't have to do too many human, this, like,
human label stuff because, like, in the code-based stuff, there is a right in the
answer. Yeah. And that's like, that's called a reference based eval. And this is a
reference free eval. So depending on what kind of eval it is, it'll, you know,
there's a, there's like a total budget sort of roughly in my mind of like,
okay, how many you should have. So let's say you have like, you know, five
or six judge evals in production. And, and you just like, so basically in production
just means that like, like this, um, human handoff judge, like it just roundly samples,
like out of a hundred conversations, it looks like five or something. And it kind
of gives a pass fail. Yeah, it depends how many you have. Like, it depends
how many, what kind of scale you're at. You know, if you're serving, like, billions
of users a day or something, then you probably don't want to run an LM
judge across, like, everything. You know, it really depends. Like, you can get a lot
of data from just sampling. But if you have, like, very low amount of data,
like, you're only serving, like, thousands of users a day, then just run the whole,
I don't know, just, like, score the whole thing. Yeah, I mean, it's probably not
that expensive. So it really depends. And then you have a dashboard that has basically
like TPR and TNR for each judge or something? Yeah. And so what you can
do is actually like you can bake this into a score. There are ways to
like combine these TPR, TNR, and there's like F1 score and stuff like that, that
weight them equally or whatever. You can get into this. Usually do a report one
score. That's probably beyond the scope of, I would have to go into a lot
of like data science to like talk about how to do that. But usually there's
one score that report. And actually there's like all these evaluators and I try to
have like one aggregate score. Yep. That is like aggregate across all of them just
to give me a sense. And then I can drill in and see, okay, what's
going on. And when do you do like human labels? Like, you know,
cause you did in the beginning with looking at a hundred traces, but like when
do you do human labels again? Oh, so always do them all the time. Do
it, like revisit it. It depends on the dynamics of the system. So it depends,
like anytime there's like big changes, I'll definitely do it again. And I'll also do
it on a regular cadence. Let's say like once a week, once a month. It
depends like how fast the systems are changing and what the scale of the system
is. But I'll do it like on both the cadence and also, and then you
get better every time, but also you can build tools that help you do this.
So one of the things that we talk about in our course is okay, we
use like brain trust. We did this here. We did this a spreadsheet or whatever.
For nurture boss, what they ended up doing, and I took some screenshots and I
put it in my blog. So let me just share that with you. So they
actually built their own annotation tool because it's so valuable of a process. Probably the
most valuable process of evals is this like, is the annotation and counting.
Even if that's all you do, You don't build any judge. You don't do any
eval. You don't do whatever. You can get insane value by just doing that. And
that's the one part that everyone skips. They try to go directly to eat whatever.
Fancy stuff. Yeah. And so this is a screenshot
of the tool they built from themselves. And they built this in less than four
hours. Because this is the type of thing that AI is really good at. Helping
you build. So it's like, okay, you can see this is a trace viewer. You
have these selectors for different channels. You have... This is their interface. The
system prompt is hidden by default. They can just add notes here. And then they
had... They baked it into their tool where they just did axial coding for them.
You see it just tried to do it for them and it gave them a
count. They loved it. There's a video in this blog post here. This
is Jacob. And he's... Like, super happy. Yeah, he looks happy now, yeah. He
talks about how, like, he did this and how, like, the impact that it had
on his... On NurtureBoss. So... Okay. Yeah, like, you can get really
fast. So, like, you know, with these, like, your own tools, it gets ridiculously fast.
And it's not painful at all. Yeah. But basically... Yeah. I mean, but basically you
should still like basically manually, like before I make a major update or prompt or
something, like I just do a manually look at the, you know, like the traces
and like, just make sure like everything makes sense. Right. Before I ship, ship anything.
Yeah. I mean, you don't have to do it on every single prompt update. You
don't have to look at men. You don't have to manually look at traces. You
can just, you can run your evals that you have and do that way. Just
make sure you're looking at your traces every so often. Yeah, because it's like... It
might be mysterious, like, oh, like, how often and how many traces? So we tell
people, look at 100 traces minimum. And the reason we say that, that is not
a magic number. We always find that if we don't give a number, people don't
start. And then when we give a number, when people get into, like, let's say
20 or 30 traces, they keep going until they... So we tell people,
like, the term is called theoretical saturation. And that just means you keep doing the
activity until you... It's like a diminishing returns. Yeah, until you aren't learning
anything new. Got it. So we find that people, once they start this, they kind
of get addicted to it and they find it so valuable that they just do
it. So just keep it behind like 100 traces as a goal. Okay, this is
actually a great conversation. Maybe I have to take down the video of Aman because
this is actually a great conversation. Because so first of all, so... TRDR is that
the trace, looking at the actual conversations of whatever AI product you have, is the
most valuable thing. And kind of like counting on labeling that, right? Yes. And like,
okay, so let's just wrap up by dispelling some myths, okay? Yeah. So I'm going
to put a statement out there and maybe you can tell me why it's right
or wrong. Okay. So one thing that I... that I thought was right was like,
you know, you want to do an eval for a new product and then you
get your team together and then you're like, oh, you know, like, should we do
helpfulness or should we do toxicity or should we do, like, what should we do?
And what is the right criteria? Like, what is good toxicity and bad toxicity? But
that doesn't seem to be right based on what we just talked about. Yeah, that's
right. So a lot of people go straight to helpfulness, toxicity score. Yeah. It's a
very appealing idea. A lot of vendors, they sell that. They're like, hey, don't worry
about evals. You just plug and play this tool. We got you. You don't have
to worry about it. Just like push this button and then we'll give you a
dashboard. Don't worry. The fundamental problem is generic prompts and those generic things
usually don't match to the most important problems that are actually occurring in your
application. Like super generic and they lead you astray.
And they actually waste your time because you spend a lot of mental energy looking
at those metrics and looking at the dashboards and talking about the dashboards and having
meaning about the dashboard. And all of that could have been directed towards real problems
that are actually happening. Now, there is a right way to use generic metrics. There's
like an advanced Jedi trick that you can do. Once
you have learned error analysis, it will make sense automatically. And what you can do
is you can take your hallucination score. You can score this generic hallucination score on
all your traces and you can sort the traces by the hallucination score. Okay. And
you can see, you can do error analysis and see, does the top hallucination score
like, okay, you can start doing like smart sampling with these different generic scores. You
can use the generic, these like all these generic scores as like sampling mechanism to
see like, is there anything interesting there? And what you'll find is sometimes there is
interesting stuff. Sometimes it's not quite like hallucination, but something else. And you can kind
of see if any of these scores are helpful, but you shouldn't just report the
scores. You should never report the scores as is. Probably shouldn't use the scores, but
you can use them as meta tools. Got it. Okay. But like it was way
more important to identify problems wrong with your product. Yes. Okay. Then another thing is
like, and maybe this is more a question is, like how much of the stuff
that we just walked through should we do before we even launched a product? You
know, like, like, should we try to have like a bunch of judges set up
and like, you know, do a bunch of synthetic stuff? Like how, how much, because
once you launch, you actually get real sick, not right? I mean, yeah. How much
of the stuff should we do? Yeah. I wouldn't get carried away with evals, especially
in the beginning. I would, I would definitely look at lots of data and looking
at data, um, includes using it yourself okay i mean if like it says if
if you're building a tool for yourself like you are and n equals one user
so you don't need to like just use it yourself and you know the air
you're doing error analysis like just by being alive and using your tool like if
you're actually using it it's fine you don't need to do all this stuff it's
like when it kind of gets beyond the scale of your comprehension when it starts
to you know There's like lots of users or lots of things going on. Different
use cases. Yeah. Then you might think about, okay, then you can see like where
that data might be helpful. Or maybe, you know, you can roll out to like
5% of users and, you know, like maybe they get a shitty experience, but then
you can start getting real data to improve. Yeah. Then nothing beats real data. Got
it. And like the Likert stuff, is it just completely useless? Or like, are you
dogmatic about it? The one to five stuff? Should you stay away from it? I
would say I haven't seen... You have to be extremely disciplined to use it
correctly. Okay. You have to have very clear rubrics. You have to make sure
everyone is calibrated on that rubric. And it usually doesn't go well. And
for most companies, it adds tons of complexity. I would say it's exponential complexity
relative to binary scores. And so I just haven't seen it done in most cases
correctly. There's some rare exceptions where it does work. okay but it's usually
like you know when i press teams to say hey like can we just make
this a binary score like is there a point where this is like good enough
versus not good enough we're able to do it but dude then where does this
stuff come from like why do teams keep doing this stuff yeah because it's a
kind of an appealing idea right to like we've all been graded on like we
have a grading system from school a through f like You know, nothing is black
or white. Like we want to like have this like high fidelity sort of assessment.
But the problem is like, what do you even do with this high fidelity assessment?
Yeah, it just makes you feel like false precision, right? It's like, you know, I
got a three versus a four. Like what does that even mean? Like humans can't
even tell a difference between a three and a four. Yeah, yeah, yeah. Most people
can't. And it gets lost in the sauce. And, you know, it's just like it's
already complicated enough. You need to really reduce complexity in this whole thing and be
pragmatic about it. All right, dude, this is a super awesome conversation. So I guess
let's go back to the third debate. Do evals matter? It does matter if you
do the property, if you're actually solving real problems. Yeah, that's a really good question.
Do evals matter? I would say so evals don't exist in a silo. If you
just try to do, if you try to evals max and get carried away with
evals, it will probably hurt you. What you want to do is definitely ground yourself
in the data analysis, in the looking at the data part. And like, you know,
everyone says, look at your data. I think it's hard to know what that means.
And what we went through today actually shows you what it means and hopefully demystify
what it is to look at data. But it should be, it's like a very
tightly coupled with evals. And I say that it is evals, like this data, this
looking at data counting. I just say, because you can't do evals without it. Yeah,
it's not like the super sexy part of it, but yeah, it's the most important
part of it. Yes. Yeah, got it. Makes sense. All right, dude. All right, man.
I think you convinced me to take a core course now. So can you talk
about your course and we're going to teach it and if you have a discount
for folks. We are teaching a course on evals where we walk you through the
end-to-end detailed process on how to do evals correctly. We go into subjects like,
okay, how do you evaluate your rag systems? How do you evaluate retrieval? How do
you evaluate agents? How do you deal with all kinds of edge cases that you
might encounter? How do you do this effectively? How do you actually read a trace
and save yourself from all the complexities that might happen? How do you get through
this effort? And we've taught over 2000 students, including lots of people
from Google, OpenAI, things like that. You know, the big labs are really interested in
this because, you know, they focused on foundation model benchmarks, but we're talking about application
specific evals. Like if you're building an application, what is that? And that eval is
very different. Yeah. And so I teach the course with Shreya Shankar. Shreya Shankar has
been writing about evals as well for years now, and it's been doing a lot
of research in the space. So we both have a machine learning and data science
background. as well as software engineering background. You know, and the course is four weeks
long. We give students lots of resources. So we have over nine hours of office
hours. Yeah. We give students a AI evals
assistant. So it's like everything that we've ever said about evals, publicly, in the course,
blog posts, talks, papers, you name it. We've put that in an AI and we
give that to you as like an assistant as well. So, you know, it's a,
it's a modern course. And you got to have evals on top of that too.
Yeah. And we're doing evals. Yeah. We are doing evals on top of that. This
is the first time, this is the first cohort we're doing it for. So the
next one coming up in October. And so we also give people 160 page book
on evals that you can take with them. So there's a lot of resources. There
is a, it's a good community and we're offering Peter's community 35% off. Awesome, dude.
So please use the link in the, in the description. Awesome. Yeah, dude, I personally
learned a lot from this. I need to reevaluate how I do evals. So yeah,
I hope to see folks there. Definitely want to take the course in October. Thanks
so much, Hamel, and sharing your knowledge. And keep dropping knowledge, man, like on social
media. Yeah, thank you so much. Thanks for having me.
Loading video analysis...