TLDW logo

AI Evaluations Clearly Explained in 50 Minutes (Real Example) | Hamel Husain

By Peter Yang

Summary

## Key takeaways - **Directly analyze traces for potent insights**: Instead of relying solely on automated metrics, directly analyzing around 100 AI conversation traces and annotating issues provides more actionable insights into specific failures and user pain points. [04:41], [08:46] - **Spreadsheets simplify complex AI evaluation**: A simple spreadsheet can be used to organize annotated trace data, categorize issues, and analyze findings using pivot tables, making the evaluation process accessible and understandable. [09:46], [14:14] - **Binary pass/fail beats 1-5 scores for AI judges**: When using an LLM as a judge, binary pass/fail outputs are more effective than 1-5 scores, as they reduce complexity and avoid the ambiguity of average scores that are difficult to act upon. [24:44] - **Beware of 'agreement' metrics in AI evals**: High agreement between an AI judge and human labels can be misleading if the error rate is low; it's crucial to examine true positive and true negative rates instead. [28:52] - **Continuous evals in production are key**: After developing reliable judges, implement them in production to continuously monitor AI performance on live traffic, allowing for ongoing debugging and improvement. [35:54] - **Focus on real problems, not generic metrics**: Generic metrics like helpfulness or toxicity scores can be misleading; prioritize identifying and addressing the specific, real-world problems occurring in your application's data. [44:02]

Topics Covered

  • Raw AI Eval Scores Are Meaningless and Unactionable
  • Iterative Manual Review: Improving AI Evaluation Through Practice
  • Manual Traces Provide Deeper AI Insights Than Automated Scores
  • The most valuable AI eval process is annotation, not fancy tools
  • Why Likert Scales (1-5) for AI Evals are Often a Bad Idea

Full Transcript

this looking at data counting, you can get insane value by just doing that. And

that's the one part that everyone skips. I'm gonna use a spreadsheet to drive home

the fact that this process can be dead simple and you can get immense value

out of it. When you see an average score of 3.2 versus 3.7, no one

really knows what the hell that means. It's not really actionable, honestly. They're like, oh,

it's like getting better. Honestly, like nobody really knows whether it's getting better or not.

So as a product manager, if you ever see the word agreement, You need to

pause and be like, hmm, let me dig into this, please. If people don't trust

your evals, they won't even trust you. You're done. Okay, welcome everyone. My

guest today is Hamel. Hamel has trained over 2000 PMs and engineers from companies like

OpenAI, Anthropik, and Google on how to run AI evaluations. He teaches the most popular

course on this topic at May event. So really excited to dig into his best

practices and And I feel like I made a lot of mistakes and have a

lot of assumptions that Hamill can help to spell. So welcome. Really happy to be

here. Excited to talk about evals. All right. So there's been a lot of online

debate about are evals valuable or not, blah, blah, blah, on Twitter. And let's just

make this really practical. Let's talk about evals for a real product. Do you have

a product example that you want to talk about throughout? Yeah. I do, yes. So

here's a company that I've been working with. Let me share my screen. Sure. So

just to set the stage, Nurture Boss is a AI-powered property management assistant.

And so it's a really interesting use case because it's actually one of the best

teaching examples. And I love using it because it's messy and there's enough

complexity to where it's realistic. And so the question comes like, oh, okay, like how

do you do eval? So a lot of times... When we talk about evals, we

can show toy examples, but sometimes those are oversimplified and it's hard for people to

generalize that. Like, how am I going to do that for my app? My app

is complex. I have other things going on. Well, that's what we're going to show

today. Actually, we're going to look at Nurture Boss' data together and we're going to

do a minimal set of evals. We're going to do that very quickly. Yeah, that'd

be awesome. So I know on your podcast, you had Aman on already. You may

have talked about Arise. There's a lot of different solutions out there. The ones that

I come across the most in practice are Arise, Braintrust, Langsmith. Those are the kind

of three popular ones. One of the observability platforms that NurtureBoss used in the beginning

was Braintrust. They actually created their own, but I'm going to show you their data

in Braintrust because that's where we have it anonymized. Trace is basically like the chat

conversations, right? With NurtureBoss? Yeah. Okay, I'll show you what a trace is. The best

way is just to look at it rather than me trying to define it. So

I'm just going to open one. So trace is a log of all of the

history of a particular interaction that your user might be having with an AI, including

all of the internal things that might be going on that the user doesn't see.

Okay, so this is an example of a trace. And so what we see here

is we have the system message. So you are an AI assistant working... work as

a leasing team member at the Aero. And you have different instructions on how to

interact with the resident or the prospective resident. For example,

how to deal with maintenance requests or what to do with prospective

residents and tour scheduling and all kinds of things like applications. And

there's a lot of instructions here. We don't need to read all of them. You

can scroll down. And we see here, where is the building located is

one of the questions the user is asking. And so there's a tool call made,

get communities information. We can expand it and it's pulling back some information. It's making

a tool call. The tool call returns with some information, which you can see here.

And then the assistant is, is giving an answer. You know, this, the arrow is

located at this address and then it's, the user is saying, I'm interested in a

two-bedroom, what's available? And actually, then it stops. It just dead ends.

And there is a failure of some kind. It just wasn't surfaced to the

user. So this is not necessarily the most interesting error. This is just

something that you would see in the real world. Now, this is an example of

a trace. So when Jacob came to me, I said, okay, Jacob, this is what

we're going to do. We're going to look at traces and he's like he's like

what do you mean we're gonna look at traces this seems like we're going in

the wrong direction like this is gonna take forever hamel like this must be like

what what are you like what are you doing he said just trust me we're

gonna look at about 100 traces and just we're just gonna write notes about what

is going wrong the first one may be painful but we're gonna get really good

at it by the time we get to the 10th one we're gonna be really

fast so we got to this one we're like oh like something happened here um

the conversation got cut off um it failed, we had to do some investigation to

figure out what's happened, but we just wrote a note. And so in this case,

we can make an annotation. So I would write a note saying like, hey, there

was an error that was not surfaced

to the user. Okay, just write a note. And the conversation

dead ended. The point is not to do a full root cause analysis. Just observe

what's wrong. Okay. And that's all you got to do is do this. Do it

like a hundred times. Yeah. Do this a hundred times. Now, Jacob's like, what the

hell are we doing? Why are we doing this? Like I hired you Hamill for

evals and you are like a monkey looking at this data. And like, I don't

understand what is going on. Like, what is the purpose of this Hamill? It was

a very expensive consulting engagement. And we were looking at data. I said, no, trust

me. We are going to, this is going to lead somewhere. Well, I mean, you're

basically human labeling this data, right? I'm basically human labeling this data. So let's look

at a couple of more. I think it will become clear why this is valuable

really fast. Okay, so this is another trace. You are a leasing team member at

the Meadowbrick Gardens, and your name is Taylor, so it's giving it a name. Your

goals are to answer questions. Okay, so it's really the same thing. You have, you

know, you're fielding calls from residents, actually prospective residents,

because this looks like it's like a sales funnel type of mode where you're like

encouraging customers to schedule tours. So let's go to the first user message. So we

got the system message out of the way. The assistant says this call, so this

is over voice now. Okay. This call may be recorded, blah, blah, blah. And the

only thing the user says is preview program. I have no idea what that is.

Maybe it's something in the system problem, but I personally don't recall necessarily what that

is. The assistant says, it looks like you might be interested in learning more about

community and possibly scheduling a tour. Would you like to schedule a tour? He just

says no. The assistant says, it seems like you might have a specific question or

need assistance or something related to our apartment. How can I assist you? Can I

tell to a representative? Obviously, they're trying to talk to a person. Sure, I can

connect to the representative. Would you like me to do that now? Yes. Great. I'll

connect you to representative one moment. Would you like me to connect you? I'm already

frustrated. Like, yeah, the person to a human being already like so many questions like,

are you sure? Are you sure? Blah, blah. So, okay. Like ADU, the ones that

I do, it sounds like you might be interested. Yes. Could you clarify your query?

I just want to talk. We've all had this experience at one point in other

than our lives. Yeah. Every time I call. Yeah. Yeah. This is not, you know,

I want to throw my phone out the window whenever I have this interaction with

an AI. So, okay, I understand. I'll connect to you. And finally, a tool call

is made where we're transferring the call. So we feel the pain of the user.

We do, honestly. And it's pretty clear, like, hey, so this

is a clear error. In this case, I'll write the note here just like it

did a lot into the other one. And you're not trying to, like, write solutions.

You're just trying to write, like, the note of the problem. Yeah, I'm not trying

to debug it. I'm not trying to say like why it happened. I'm not trying

to root cause a solution. I'm just journaling why it happens. So you do this.

It took us, frankly, it took us like an hour to do 100 traces. It

didn't take us long at all. Okay. But by the end of it, we knew

a lot. Like we learned a lot and we learned way more than you could

possibly learn by trying to throw any kind of automated solution at this problem.

So if you tried to come at NurtureBoss and put a hallucination score, toxicity score,

coherent score, whatever you want to call it, it would not have given us anywhere

close to this insight that we have right now. We identified very specific failures and

things that are painful. By looking at the data, we immediately could see that. Did

you use some AI to summarize the trace data that you labeled? Very good

question. So... while you shouldn't use AI to like do the, you should definitely look

at the data, put your hands on the data. You can use AI to help

you do the analysis of it. So what happens is I exported all of

these logs plus the notes into a spreadsheet. Okay. Okay. And you don't have to

use a spreadsheet. I'm going to use a spreadsheet to make it, to drive home

the fact that this process can be dead simple and you can get immense value

out of it. So, What I did is I exported like all the notes. So

in this column, column A, you have all the notes that I took. So like,

you know, one note is user was probably asking about lease terms or maybe deposit,

not about specials. And the AI was talking about specials. Or another one was the

AI offered virtual tours, but there is no virtual tour. Okay. You know, so all

these are different. And there's the disparate messaging one that we just saw. And so

all these different notes that it took. So what you can do is you can

take these notes and you can do something really stupid, simple, is you

can dump them into like Claude or ChatGPT or whatever. And I say, okay, please

analyze the following CSV file. There's a metadata field, which is a message field called

ZNote that contains open codes. So this is like some terminology here. Open codes is

just a fancy word for those comments that I made. Okay. Just the notes for

analysis of LLM logs that we are conducting, please extract all of the different open

codes from the Z note field proposed five to six categories that we can create

axial codes from. So the axial code is another piece of terminology. That doesn't mean

we just want to group them into cat. We want to group these notes when

it just classified them. Okay. This like open code axial code thing is actually some

really old technique from social sciences. And it's been also used in machine learning for

decades. And so that's why we're using this terminology as a shortcut to give to

the LLM. Because the LLM knows exactly what this means. They're like, oh, I know

exactly what you're doing. You're trying to do this technique. Got it. Okay, so it's

better than just saying categorize this stuff. It's just some shortcut technique. Yeah.

It gives it some specificity. By saying open code, axial code, it knows what my

goal is. Because there's a lot of... when I use that terminology in this technique.

You can also say categorize, though. I mean, there's no... You can start wherever you

want. So if you want to say categorize, start with categorize. Totally fine. The point

is make progress and get value as fast as you can. I don't want to

be too prescriptive. But the point is, like, you can sort of... So, like, okay,

it kind of iterated a lot on how to open the CSV, so we can

skip that. But it gave me, like, you know, some... categories.

And I didn't necessarily like all of them. And I could have been, you know,

I kind of went back and forth a bit to refine the categories that made

sense to me. And I just like took, I wrote down some categories actually. And

here's some categories here. So I said like, okay, tour scheduling, rescheduling issue,

human handoff or transfer issue, formatting error with the output that had some formatting errors.

Yeah. Like putting markdown inside text messages, for example. Conversational

flow issues. So that's like the text thing where it's just abrupt, you know, the

flow. Making up promises not kept, like rescheduling and things like that. Or

not rescheduling, but other kinds of promises. And then there's another, I usually have another

field called none of the above, but I didn't do that here, just out of

simplicity. And so what you can do is then you can kind of go back

and forth. And what I did is like in the spreadsheet, you know, it can

use AI. So there's AI in this Google spreadsheet. Google sheets have AI. You can

have the AI formula. It's very handy to show you something. Categorize the

following note into one of the categories. So you're just like using the LLM to

like categorize it. And you can see I'm just categorizing these notes that I took,

like the problems that I found. I'm just putting into categories. Right. I didn't know

that Google Sheets has this AI feature already. Okay. They're usually very slow at this

stuff. It's cool. It can be slightly janky, but it's okay. It's lightweight and you

don't have to use any tools and everyone can understand how to do this. It

demystifies the whole process of what I'm doing. Because if I open some code, you

might think, oh, you need to be a software engineer to do this or something.

And no, you don't. We can use English all the way. So now you have

categorized all of these things. And now we can use one of my favorite tools,

PivotTables. So pivot tables, if you haven't seen them before,

it's really handy in spreadsheets. So you can just count how many times each of

these categories occurred. And we can see just at a high level,

hey, oh, okay, we're having this conversational flow issue is happening quite a bit. We

also have this human handoff transfer issue. And you can kind of get a sense

right away what the problem is. Now, it is likely that before you even get

to this count, you already know. You've looked at 100 traces, you know in your

gut. You're like, okay, you know what? I need to fix this human transfer thing

right now. You're like, I don't even need to do a data analysis, but it's

quick. This takes less than a few minutes, honestly. And it just gives

you some grounding and lets you see, you go from this massive, I don't know

what's going on, to okay, like I have some idea about the problems that I

have. Okay. And you have some starting point. Does that make sense? Yeah.

Yeah, this is brilliant, dude. But let me ask you this question. Yeah. So in

this case, the agent was already live in production before he started doing all this

stuff, right? But that's not the ideal approach, right? Like ideally you want to do

some of this before or maybe dog food with your team first or like some

users try it. Absolutely. How do you? Yeah. Yeah. So the best case scenario is

you dog food it with like, you know, some friendly customers, you've dog food it

yourself. That's going to be really good. You can also synthetically generate inputs into your

system. So basically what you can do is like, think about plausible user

questions they may have in, you know, and try to come up with some hypothesis

of where your system will break. Yeah. But there's a certain way to do that.

You don't want to just, ask an LLM, hey, come up with plausible questions for

a prospective tenant that might be looking to rent an apartment. The right way to

do it is to come up with some categories. So it's to come up with

some dimensions. And what I mean by that is, let's say, let's think about what

good dimensions might be for a nurture boss. So for nurture boss, you might have

like resident, you can have, okay, Like type of customer maybe? Department class

maybe? So like luxury, standard, something else.

I don't know what that is. I'm not that creative to think about on the

fly, but what was the thing you said? Just type of customer, right? You can

get the tenant manager versus the actual resident, right? Depending on who you're talking to.

Yeah. Resident, manager. Yeah.

And you can think of, put your product hat on. So like, by the way,

this whole process is very product oriented. Like, so,

you know, when you read the trace, it's not so much about engineering. It's putting

your product kind of hat on and saying, is this the experience that you want

your user to have? Does this actually make sense? When it comes to like these

dimensions I'm talking about, you kind of putting your product hat on and saying, okay,

what are the different personas? What are different categories, different dimensions that you may want

to consider? Yeah. And then what you want to do is like, you know, you

would take the kind of combination of these, you know, so like luxury

resident, luxury for resident, luxury for manager, standard for resident, standard for manager. And you

would feed those, we call it dimensions, into an LLM, say, okay, these are the

different dimensions. For this, for every one of these dimensions, generate plausible user queries.

Got it. That's like way better than just asking an LLM. Yeah. Just asking. Because

if you just ask an LLM, it'll be a lot more homogeneous. And you don't

want to have homogeneous inputs. You want to explore the space of inputs. And so

these brainstorming dimensions helps you to kind of make sure you're exploring

the space, being thoughtful about exploring the space. Does that make sense? Got it. Yeah,

you want to find all the edge cases. Yeah, got it. And that's just scratching

the surface of this. There's a lot more to generating synthetic data that we could

probably get into here, but there's like more advanced ways to generate synthetic data or

things to think about in terms of being more adversarial, how to come up with

hypotheses to help you break your system, so on and so forth. But the short

answer is user system and you can use LLMs to help you help basically

pretend as synthetic users. That makes sense. Okay, let's go back to the categories you

have on the left. So now you have these, I mean, they're not, they're basically

problems, right? Problems with the product. And now what we're going to do with this

stuff, is this how we come up with our criteria or should we just start

fixing the issues or? Very good question. Very good question. So

when we first did this, what the top issue was, date handling. It's like, and

it was very clear, you know, the user wanted to schedule an appointment.

And it was always getting the date wrong. And it was very clear, like, oh,

this is so dumb. Like, the LM doesn't know what today's date is. We just,

like, forgot. They, like, forgot to put that in the prompt. It's like, oh, do

you really need an eval for that? Maybe not. Like, you know, you don't want

to eval, Matt. You don't, like, necessarily want to do evals because, like, it feels

good. The whole purpose of anything is to make your product better. and to iterate

and move fast. And so for that one, we're like, well, let's see.

Let's just give it what today's date is. And that problem basically went away,

unsurprisingly. So we didn't really need an eval from that. Other things that are more

subjective are, so it's a cost-benefit trade-off. So there's two kinds of

evals. One is LLM as a judge, which we are going to build together. Another

one is code-based eval, where you don't really need an LLM as a judge. It's

some kind of assertion that you can make. And that's very cheap compared to LLM

as a judge. And so for the date one, we actually did

a code-based eval, which is like we had some test cases and we're able to

test, like, does the date that's coming out equal to the expected date? And that

was very cheap. We didn't have to do LLM as a judge. Got it. But

that was really easy to fix. Now, something like, hey, you should be handing off

to a human. Okay, that one, we don't know exactly. We

did have rules for that already, but the LLM is struggling, and we don't really

know how we're going to do it. That seems like a really good use case

for an LLM judge. And also, the eval is going to provide tons of value,

even though it's expensive, more expensive, because we're going to iterate against it a lot

to make products. Got it. And so we say, okay, like let's, you know, okay,

we need an LLM judge for the human handoff. Let's go ahead and do it.

It's an important problem also. Yeah. Yeah. The main problem that people encounter when doing

LLM as a judge is they just prompt in another LLM to kind of judge

what your LLM did. And they say, is it good? Now that should be suspicious,

right? Like, why is that okay? Like, Why are you just going to tell another

LLM to tell me if it's okay? Like, I don't know. And you would be

right to be very suspicious of that. And there is an answer to that. And

the answer is you can measure the judge. So it's like, it's a meta evaluation

problem is like, you need to measure if the judge is good. It's very important.

You don't want to skip that step because if you have a bunch of judges,

LLM judges floating around and your stakeholders are like, you're reporting them. On the

dashboard, your stakeholders are looking at them and everyone's like, oh, the judge, you know,

no one's going to understand what you're using in LM judge anyways. They're just going

to look at your metric. And when there becomes a significant, there's enough gap between

reality and your metrics, no one's going to trust you anymore. You want to avoid

that. You want, because like, if people don't trust your evals, they won't even trust

you. You're done. Yeah, exactly. Yeah, you can't make it have like perfect score for

something, but it's actually totally wrong, right? Yeah, yeah. Yeah. Yeah. And so, okay, so

how do you go about this? Well, thankfully, when you're doing this axial coding stuff,

right, you actually have identified really good test cases or some reasonable test cases that

you can use that are labeled. You already labeled them as a human. So you

have some ground truth. And these are things that you can use to calibrate an

LLM judge to see if you can create a judge that is good enough. Okay?

Okay. So that's what- Good enough is just like, good enough is like how close

it doesn't match to human labels, right? That's kind of what- How close does it

match to human label? Yeah. So that's what we're going to do next is we're

going to think about, first, we're going to write the prompt. And this is like

the dumbest prompt. I'm not saying like, this is a good prompt. This is just

a prompt. And the point is not to like have a prompt recipe or some

like magic thing. It's just to iterate. Okay. And you want to just specify- kind

of the requirements of like, in this case, what is a good handoff? Or when

should you be doing a handoff and when should it happen? And so, you know,

you are scoring a leasing assistant to determine if there was a handoff failure. There

should be handoff if any of these things occur, you know, or sorry,

there is a handoff failure if any of these things occur, like if a human

requested to be handed off. but you just ignored them or looped through it too

many times, that's a failure. Got it. Yeah. And there's a list of these seven

failures. You don't have to read all of them, but you get the idea. And

we also say when there's not a failure, just out of completeness. And we say

we want to return exactly true or false binary. So

it's worth lingering on this for a moment. So it's very important for an LLM

judge that you output a binary score. 99% of the time, you don't want to

export like a Likert scale or a score of one to five

or some kind of score because that introduces tremendous complexity. Yeah.

You know, LNs are not good at continuous scores, number one. Number two is the

output is not going to be clear. When you see an average score of 3.2

versus 3.7, no one really knows what the hell that means. Yeah, yeah. And

it's not really actionable, honestly. They're like, oh, it's like getting better. Honestly, like nobody

really knows whether it's getting better or not. I found that when you try to

hide behind a score, you're not really making a decision. And like what you're trying

to, the frame here is, is this feature good enough to ship? Yes or

no? Make a decision. What is the line? There is a line somewhere inside. Like

there has to be. Right. And so we don't want to score wherever possible. You

want to simplify it. The score just makes it too complex. Yeah. It's like a

fake science, you know, it's like, you know, false precision, right? Like who knows? Yeah,

it can be. Yeah, it can be. There's some cases like there's some evals where

you want a score when you get, when you go very narrowly into certain aspects

of things like, you know, when you try to have evals for retrieval, search and

things like that, like different components, then the scores make sense. But for this, like

LM as a judge case, in the overall sense, like, no. And why no explanations,

though? Like, why don't you want to explain why I'm marked? So, explanations are actually

usually good. So, you know, what we teach is you want explanations and

then a score. But this is like a spreadsheet. Okay, yeah. We just want it

to be tractable. If I try to give an explanation, then it would like, and

you know, the model here in the spreadsheet isn't the most powerful one they give

you. So it was going all over the place. So I was just trying to

simplify it here, but yeah, explanation can be good. It can help you debug the

AI model and you want to give a structured output. You want like a few,

you want to usually output two fields. Like you want it to output like an

explanation and a binary score. And then you can use the explanation to kind of

help you debug. what went wrong with the LMS thinking. Oh, so you're actually going

to do this LM judging using the Google sheet model. Yes. I'm going to stay

in the Google sheet because our goal is to demystify everything and to make it

very clear, like what is actually happening by using a spreadsheet all the way down.

So, uh, okay. So we have this element judge prompt, and now we can go

back to those traces. So we have like, this is the original trace column a,

this is an adjacent format. Um, And then we have sort of this LLM as

a judge. So this is like for one error. So you want them to be

scoped usually. So we have this LLM as judge just for the handoff error. And

we have the formula, assess this LLM trace according to these rules. Okay. And

then it's the LLM judge prompt that I just showed you. That's all it is.

And then it's giving us true or false. Okay. So we have true or false

here. And then this column H is, is the LLM judge

handoff, like what it said, the binary score, true or false, is there an error?

And then this is the human label, column G, there's an error. So we have

these two labels and we already did column G before. That was kind of happened

for free because of the axial coding in the open coding, the process. Oh, so

you have like another AI? No, the human label is like the notes, right? So

you have like an AI. Yeah, those are like the results of the notes. Okay.

You know, like basically I said, hey, if the axial code is human handoff or

transfer issue, I just said it's true. Got it. Got it. Okay. And so you

can then see how aligned the LLM is to the human. That's the main thing

that you want to test. Now, one thing you want to stay away from. So

a lot of people go to just calculate agreement. Intuitively, it makes sense,

right? Like let me calculate the agreement between the LLM. And the human. And the

human. It seems like a plausible metric. It seems like, oh, that sounds reasonable. Okay,

like agreement, sure. The problem with agreement is most errors,

hopefully if your system is not jank, most errors are not happening kind of at

the tail. So this human handoff error is not happening every single time. It's maybe

happening 15% of the time or 10% of the time, right? Got it. And so

if something is happening 10% of the time, How can you agree with it? So

it's like if something, if a system is saying something is failing 10% of the

time, you can agree with it 90% of the time by just saying it never

fails. You'll be in 90% agreement. And 90% agreement seems

really good on paper. You go into like a stakeholder meeting. It's like, yeah, I

have a judge, you know, 90% agreement. Okay, that sounds good. No,

that sounds really bad, actually, potentially. You need to really dig into that. So as

a product manager, if you ever see the word agreement, you need to pause and

be like, hmm, let me dig into this, please. And so you

need to measure two quantities. One is, and there's different

terms, but true positive rate and true negative rate. So Those are just, and there's

different words for it. Sensitivity, specificity, precision, recall, different words, but true

positive rate, true negative rate. And so true positive rate is what is your ability

to successfully identify the failures? Like when the failures

actually happen and what's your ability to successfully identify when failures don't happen? And that's

a better, those two quantities are kind of better than agreement because you, they will

show you when something is off. Like, you know, and so to make this more

concrete, because it can be a lot in your head, like, oh, what am I

saying right now? Like, why isn't that? Right. And so let's go here to this

confusion. It's called a confusion matrix. It's funny that it's called a confusion matrix. Sometimes

it causes confusion, but hopefully today it won't cause confusion. What you have here is

like, okay, in this column, you have the human label. Okay. True or false,

false and true. And then in this, going across here, you have the LM judge

label where the green diagonal is where they both agree. Yeah. Okay, because this is

like 100 traces we have. So when the human says it's false, the

LM judge agrees with it, okay, like out of, you know, the 73 times. But

then when the human says it's false, the LM judge thinks there is an error

18 times. Interesting. There is a different... there's different kinds of

errors. And this is what I'm talking about here. You don't want to just go

out in agreement. You want to know what the true positive rate, true negative rate

is. Now, how do you know what a good true positive, true negative rate is?

There is no magic bullet there. That's a business decision. Like what

top, what level of judge are you is okay. Is like okay for you in

the, in the most base case, you just need to do a sanity check. Like,

does it make sense? Okay. Like, You know, does it seem okay? Calculate the true

positive rate, calculate the true negative rate. Is one of them like really bad? Okay,

then maybe you don't want to use that. Is it really low? Or, you know,

just look at the confusion matrix and do whatever, you know, and you can use

a spreadsheet and say, hmm, is this okay? Like, am I okay with this kind

of error? You know, give yourself an intuition. Oftentimes, I would

say for most people who haven't or not, used to true positive rate to negative

rate, it takes some time for it to click. Yeah. Even I have to think

about it sometimes, honestly, even I've been doing this for years just to like ground

myself. I mean, I think the confusion matrix is actually way more clear than the

percentages. I mean, yeah. I think 18, there's 18 words marked as

true when it's false. Yeah. Yeah. So there's a problem when it's false, you have

like, you know, where it's false is where there's, um, out of these 91 times,

you have like 18 of these 91 times, it has this specific error. Is that

okay? So basically 18 times it actually did successfully handoff to the human

support, but the LM thinks it did not, or there's too many terms or something.

Yeah, yeah. 18 times LM thinks there is an error when there's not. Is that

okay? And so different situations you might be... Like the false

positives are not as expensive as the false negatives. You know, so like you might

be okay with catching things, like catching more errors that don't actually exist. You

just want to make sure you do catch all of them. So then what do

you do with this 18? Like do you look back at the traces, see what

happened, and then you try to modify the prompt? Yeah, yeah. So what you do

is you can look at these like 18 and you can, you know, you can

say, okay, like what happened here? And you can iterate and you keep iterating. a

bit on the prompt. Yeah. And oftentimes it's quite straightforward.

Sometimes not as much. But one thing I did leave out here is a lot

of times in the LMS judge, you want examples. I didn't put examples here because

I just wanted to keep it simple. Once you start putting examples in the prompt,

you do have to split the data set a bit. And you can't

just, you can start overfitting to your data. So like if I put all of

these traces in my prompt, it would get a hundred percent because it would

know the answer exactly. Right. So like, you don't want to do that. And, and

so you want to hold aside some data to make sure you're not cheating yourself.

And, you know, so that's, so we don't have to get into all that from

a product manager perspective. The best thing you can do is like, just know, let's

have a trigger in your mind about agreement and just don't, Just ask some clarifying

questions like, okay, agreement is 90%. What is the baseline error rate? If they say

10%, you know that agreement 90%, you're like, this is really bad.

Like something went wrong here. And this is like pretty common, right? For teams running

evals, they just have like an agreement store. They don't have the TPR or anything.

Very extremely common. Yeah, this is a reason I'm making a big deal out of

it is because we just see it so much. that it's worth calling out.

Okay, got it. All right, so now we have some judges live, and

what's the next step? You want to put his judges into production to run all-time?

So now you have this really... So let's say you have the judge, like this

human handoff score judge. That's right. And you like it enough. So now you have

this powerful tool that you can use. Number one, you can... you can set aside

some data, you can put it in CI, you can have a test. Anytime you

make a change to code or whatever, you can test how good you're doing on

this human handoff problem. But also you can run your judge in production. You can

run it on a sample or a large portion of your production traces. And you

can see where this handoff failure is happening. And you can debug it even more.

You can say, I want to find... all of the places where a handoff failure

is happening. I want to find a lot more situations where it's happening. And you

can put, you can do production monitoring of it, of problems. You can see, you

can use these judges to kind of run on a sample of traffic. You can

know like, Hey, our handoff problems happening. Yeah. You know, so on and so forth.

And you can build this suite of evals over time. Okay. Most of the time

people ask me how many evals I have usually have, under a dozen.

I don't really have that many because I'm pretty parsimonious about the evals that I

need. It depends, like, sometimes I have more than that. It depends how expensive they

are. It takes some work to maintain this stuff as well. You know, for the

LLM judge, the code-based ones, not so much because you don't have to do all

this, like, you know, because I don't have to do too many human, this, like,

human label stuff because, like, in the code-based stuff, there is a right in the

answer. Yeah. And that's like, that's called a reference based eval. And this is a

reference free eval. So depending on what kind of eval it is, it'll, you know,

there's a, there's like a total budget sort of roughly in my mind of like,

okay, how many you should have. So let's say you have like, you know, five

or six judge evals in production. And, and you just like, so basically in production

just means that like, like this, um, human handoff judge, like it just roundly samples,

like out of a hundred conversations, it looks like five or something. And it kind

of gives a pass fail. Yeah, it depends how many you have. Like, it depends

how many, what kind of scale you're at. You know, if you're serving, like, billions

of users a day or something, then you probably don't want to run an LM

judge across, like, everything. You know, it really depends. Like, you can get a lot

of data from just sampling. But if you have, like, very low amount of data,

like, you're only serving, like, thousands of users a day, then just run the whole,

I don't know, just, like, score the whole thing. Yeah, I mean, it's probably not

that expensive. So it really depends. And then you have a dashboard that has basically

like TPR and TNR for each judge or something? Yeah. And so what you can

do is actually like you can bake this into a score. There are ways to

like combine these TPR, TNR, and there's like F1 score and stuff like that, that

weight them equally or whatever. You can get into this. Usually do a report one

score. That's probably beyond the scope of, I would have to go into a lot

of like data science to like talk about how to do that. But usually there's

one score that report. And actually there's like all these evaluators and I try to

have like one aggregate score. Yep. That is like aggregate across all of them just

to give me a sense. And then I can drill in and see, okay, what's

going on. And when do you do like human labels? Like, you know,

cause you did in the beginning with looking at a hundred traces, but like when

do you do human labels again? Oh, so always do them all the time. Do

it, like revisit it. It depends on the dynamics of the system. So it depends,

like anytime there's like big changes, I'll definitely do it again. And I'll also do

it on a regular cadence. Let's say like once a week, once a month. It

depends like how fast the systems are changing and what the scale of the system

is. But I'll do it like on both the cadence and also, and then you

get better every time, but also you can build tools that help you do this.

So one of the things that we talk about in our course is okay, we

use like brain trust. We did this here. We did this a spreadsheet or whatever.

For nurture boss, what they ended up doing, and I took some screenshots and I

put it in my blog. So let me just share that with you. So they

actually built their own annotation tool because it's so valuable of a process. Probably the

most valuable process of evals is this like, is the annotation and counting.

Even if that's all you do, You don't build any judge. You don't do any

eval. You don't do whatever. You can get insane value by just doing that. And

that's the one part that everyone skips. They try to go directly to eat whatever.

Fancy stuff. Yeah. And so this is a screenshot

of the tool they built from themselves. And they built this in less than four

hours. Because this is the type of thing that AI is really good at. Helping

you build. So it's like, okay, you can see this is a trace viewer. You

have these selectors for different channels. You have... This is their interface. The

system prompt is hidden by default. They can just add notes here. And then they

had... They baked it into their tool where they just did axial coding for them.

You see it just tried to do it for them and it gave them a

count. They loved it. There's a video in this blog post here. This

is Jacob. And he's... Like, super happy. Yeah, he looks happy now, yeah. He

talks about how, like, he did this and how, like, the impact that it had

on his... On NurtureBoss. So... Okay. Yeah, like, you can get really

fast. So, like, you know, with these, like, your own tools, it gets ridiculously fast.

And it's not painful at all. Yeah. But basically... Yeah. I mean, but basically you

should still like basically manually, like before I make a major update or prompt or

something, like I just do a manually look at the, you know, like the traces

and like, just make sure like everything makes sense. Right. Before I ship, ship anything.

Yeah. I mean, you don't have to do it on every single prompt update. You

don't have to look at men. You don't have to manually look at traces. You

can just, you can run your evals that you have and do that way. Just

make sure you're looking at your traces every so often. Yeah, because it's like... It

might be mysterious, like, oh, like, how often and how many traces? So we tell

people, look at 100 traces minimum. And the reason we say that, that is not

a magic number. We always find that if we don't give a number, people don't

start. And then when we give a number, when people get into, like, let's say

20 or 30 traces, they keep going until they... So we tell people,

like, the term is called theoretical saturation. And that just means you keep doing the

activity until you... It's like a diminishing returns. Yeah, until you aren't learning

anything new. Got it. So we find that people, once they start this, they kind

of get addicted to it and they find it so valuable that they just do

it. So just keep it behind like 100 traces as a goal. Okay, this is

actually a great conversation. Maybe I have to take down the video of Aman because

this is actually a great conversation. Because so first of all, so... TRDR is that

the trace, looking at the actual conversations of whatever AI product you have, is the

most valuable thing. And kind of like counting on labeling that, right? Yes. And like,

okay, so let's just wrap up by dispelling some myths, okay? Yeah. So I'm going

to put a statement out there and maybe you can tell me why it's right

or wrong. Okay. So one thing that I... that I thought was right was like,

you know, you want to do an eval for a new product and then you

get your team together and then you're like, oh, you know, like, should we do

helpfulness or should we do toxicity or should we do, like, what should we do?

And what is the right criteria? Like, what is good toxicity and bad toxicity? But

that doesn't seem to be right based on what we just talked about. Yeah, that's

right. So a lot of people go straight to helpfulness, toxicity score. Yeah. It's a

very appealing idea. A lot of vendors, they sell that. They're like, hey, don't worry

about evals. You just plug and play this tool. We got you. You don't have

to worry about it. Just like push this button and then we'll give you a

dashboard. Don't worry. The fundamental problem is generic prompts and those generic things

usually don't match to the most important problems that are actually occurring in your

application. Like super generic and they lead you astray.

And they actually waste your time because you spend a lot of mental energy looking

at those metrics and looking at the dashboards and talking about the dashboards and having

meaning about the dashboard. And all of that could have been directed towards real problems

that are actually happening. Now, there is a right way to use generic metrics. There's

like an advanced Jedi trick that you can do. Once

you have learned error analysis, it will make sense automatically. And what you can do

is you can take your hallucination score. You can score this generic hallucination score on

all your traces and you can sort the traces by the hallucination score. Okay. And

you can see, you can do error analysis and see, does the top hallucination score

like, okay, you can start doing like smart sampling with these different generic scores. You

can use the generic, these like all these generic scores as like sampling mechanism to

see like, is there anything interesting there? And what you'll find is sometimes there is

interesting stuff. Sometimes it's not quite like hallucination, but something else. And you can kind

of see if any of these scores are helpful, but you shouldn't just report the

scores. You should never report the scores as is. Probably shouldn't use the scores, but

you can use them as meta tools. Got it. Okay. But like it was way

more important to identify problems wrong with your product. Yes. Okay. Then another thing is

like, and maybe this is more a question is, like how much of the stuff

that we just walked through should we do before we even launched a product? You

know, like, like, should we try to have like a bunch of judges set up

and like, you know, do a bunch of synthetic stuff? Like how, how much, because

once you launch, you actually get real sick, not right? I mean, yeah. How much

of the stuff should we do? Yeah. I wouldn't get carried away with evals, especially

in the beginning. I would, I would definitely look at lots of data and looking

at data, um, includes using it yourself okay i mean if like it says if

if you're building a tool for yourself like you are and n equals one user

so you don't need to like just use it yourself and you know the air

you're doing error analysis like just by being alive and using your tool like if

you're actually using it it's fine you don't need to do all this stuff it's

like when it kind of gets beyond the scale of your comprehension when it starts

to you know There's like lots of users or lots of things going on. Different

use cases. Yeah. Then you might think about, okay, then you can see like where

that data might be helpful. Or maybe, you know, you can roll out to like

5% of users and, you know, like maybe they get a shitty experience, but then

you can start getting real data to improve. Yeah. Then nothing beats real data. Got

it. And like the Likert stuff, is it just completely useless? Or like, are you

dogmatic about it? The one to five stuff? Should you stay away from it? I

would say I haven't seen... You have to be extremely disciplined to use it

correctly. Okay. You have to have very clear rubrics. You have to make sure

everyone is calibrated on that rubric. And it usually doesn't go well. And

for most companies, it adds tons of complexity. I would say it's exponential complexity

relative to binary scores. And so I just haven't seen it done in most cases

correctly. There's some rare exceptions where it does work. okay but it's usually

like you know when i press teams to say hey like can we just make

this a binary score like is there a point where this is like good enough

versus not good enough we're able to do it but dude then where does this

stuff come from like why do teams keep doing this stuff yeah because it's a

kind of an appealing idea right to like we've all been graded on like we

have a grading system from school a through f like You know, nothing is black

or white. Like we want to like have this like high fidelity sort of assessment.

But the problem is like, what do you even do with this high fidelity assessment?

Yeah, it just makes you feel like false precision, right? It's like, you know, I

got a three versus a four. Like what does that even mean? Like humans can't

even tell a difference between a three and a four. Yeah, yeah, yeah. Most people

can't. And it gets lost in the sauce. And, you know, it's just like it's

already complicated enough. You need to really reduce complexity in this whole thing and be

pragmatic about it. All right, dude, this is a super awesome conversation. So I guess

let's go back to the third debate. Do evals matter? It does matter if you

do the property, if you're actually solving real problems. Yeah, that's a really good question.

Do evals matter? I would say so evals don't exist in a silo. If you

just try to do, if you try to evals max and get carried away with

evals, it will probably hurt you. What you want to do is definitely ground yourself

in the data analysis, in the looking at the data part. And like, you know,

everyone says, look at your data. I think it's hard to know what that means.

And what we went through today actually shows you what it means and hopefully demystify

what it is to look at data. But it should be, it's like a very

tightly coupled with evals. And I say that it is evals, like this data, this

looking at data counting. I just say, because you can't do evals without it. Yeah,

it's not like the super sexy part of it, but yeah, it's the most important

part of it. Yes. Yeah, got it. Makes sense. All right, dude. All right, man.

I think you convinced me to take a core course now. So can you talk

about your course and we're going to teach it and if you have a discount

for folks. We are teaching a course on evals where we walk you through the

end-to-end detailed process on how to do evals correctly. We go into subjects like,

okay, how do you evaluate your rag systems? How do you evaluate retrieval? How do

you evaluate agents? How do you deal with all kinds of edge cases that you

might encounter? How do you do this effectively? How do you actually read a trace

and save yourself from all the complexities that might happen? How do you get through

this effort? And we've taught over 2000 students, including lots of people

from Google, OpenAI, things like that. You know, the big labs are really interested in

this because, you know, they focused on foundation model benchmarks, but we're talking about application

specific evals. Like if you're building an application, what is that? And that eval is

very different. Yeah. And so I teach the course with Shreya Shankar. Shreya Shankar has

been writing about evals as well for years now, and it's been doing a lot

of research in the space. So we both have a machine learning and data science

background. as well as software engineering background. You know, and the course is four weeks

long. We give students lots of resources. So we have over nine hours of office

hours. Yeah. We give students a AI evals

assistant. So it's like everything that we've ever said about evals, publicly, in the course,

blog posts, talks, papers, you name it. We've put that in an AI and we

give that to you as like an assistant as well. So, you know, it's a,

it's a modern course. And you got to have evals on top of that too.

Yeah. And we're doing evals. Yeah. We are doing evals on top of that. This

is the first time, this is the first cohort we're doing it for. So the

next one coming up in October. And so we also give people 160 page book

on evals that you can take with them. So there's a lot of resources. There

is a, it's a good community and we're offering Peter's community 35% off. Awesome, dude.

So please use the link in the, in the description. Awesome. Yeah, dude, I personally

learned a lot from this. I need to reevaluate how I do evals. So yeah,

I hope to see folks there. Definitely want to take the course in October. Thanks

so much, Hamel, and sharing your knowledge. And keep dropping knowledge, man, like on social

media. Yeah, thank you so much. Thanks for having me.

Loading...

Loading video analysis...