Understanding LLM Settings

By Elvis Saravia

Summary

Topics Covered

Temperature Controls Response Confidence
Top P Samples Diverse Alternatives
Max Length Curbs Cost Explosion
Stop Sequences Halt Code Generation

Full Transcript

hi everyone in this video I want to talk about llm settings so the idea of this section in our prompting guide is to

tell you a little bit about how to use these llm settings so when you're exploring and experimenting and prompting these models there are a

couple of settings that you can tune to get the desirable results that you want now if you are coming from the world of

chat GPT right if you use chat GPT the conversational chatbot from openi you may not know that these models are

actually using some specific fixed settings uh you don't see them you cannot really tweak those you cannot configure them but if you come from the world of

apis you do have access to certain settings that you can configure and you can adjust to get the results that you want so this is is very popular among developers so this only applies to you

if you're using some type of llm apis right this could be any provider it could be openc or any of these other llm

providers so what I want to do in this video is to go through a few of these settings and explain to you with some examples how you can leverage these

settings there are a couple of settings that do stand out here when using large language models via apis if you go to the playground you pretty much have an

idea on what are these important settings so you have them right here for instance in the open ey playground you have what's called temperature uh

maximum length stop sequencies stop B frequency penalty presence penalty and what we have done in our guide is basically provide you some explanations as to what these are now in this video

what I want to do is I want to kind of quickly go over these ideas and try to explain to you how you can leverage them when you're developing with these models

I must say that we often don't really talk about temperature or topy or you know most of these settings but they're actually quite important and useful uh but it really depends on what you're

aiming to achieve so let's go through some of these so I'll start with temperature now temperature basically helps you to it's it's a value right and

and you can see here in the playground is a value that ranges from zero all the way to two um the default is one right so this is the default that open eye

playground has set for you right and sometimes when we are doing the examples in the playground we don't even look at this but that's there for you and you can see the definition here right it actually controls Randomness so what

does it mean by that basically the way I understand temperature is you can increase it or decrease it right and this decreases or increases the the confidence a model has in its most

likely response so if you look at our uh definition for it here right you can see that you're essentially increasing the

weights of the other possible tokens if you are increasing the temperature value and why is this useful so it's useful because it really depends on the task right so let's say we were dealing with

some kind of fact-based question answering you know task or application right we want to encourage them all to be more factual and less random in these

responses right or less diverse in what it is outputting right at the end of the day it's outputting these sequence of tokens right and we want those tokens to be what the mall is confident in

generating and so if we want that what we do is we basically decrease the temperature right the closer it is to zero right the less random those opos are going to be so you can imagine that

yes for fact-based question answering it's pretty useful to have those low temperature values or use those temperature values that are kind of kind of lower closer to zero now if you're

doing something like email generation or some kind of Point generation or you're generating lyrics or something like that that's more creative on a creative side it is beneficial to increase the

temperature value and experiment with increasing those however do note that as you increase the temperature value something that we have seen in our experiments right by that I mean that you can increase it all the way to two

something that we have seen is that they become so Random to the point where the model is basically producing like giberish right producing something that doesn't make any sense nonsensical

sequence of tokens so be very careful when you're setting these temperature values really high when you're doing it low you know this is less of a problem right because it's less random but when

you're doing it you know above one and 1.5 or something like that um be very careful about that and you have to do a lot of experimentation to to see what the model is up putting for your

application hopefully that makes sense now I think temperature is one of the more important llm settings there are other configurations as well like top p and I see this with all the language

model providers right so it's really good to be familiar with these Concepts and top P basically is you could consider it like a sampling technique it's almost like an alternative in a way

and the reason I say that is because um it is a very similar concept to temperature um and actually if you look at the documentation of opening eye you can see that they're

telling you that it's basically you know they recommend to use stop P if you're using top P don't use temperature and if you're using temperature don't use top P right so do not use both at the same

time just try to set one and that should be fine and that tells you that it's basically an alternative sampling technique here um with temperature so

the idea of top the way I understood top p is that you have a high top p uh value this basically enables the model to look

at more possible words right including the ones that are less likely uh which leads to more diverse output so it has very similar effect to temperature although you may get obviously different

results when you use temperature compared to when you stop B so if you're if you're experimenting with temperature you're not getting those desired results then maybe you can you know just leave temperature default value and then kind

of go to topy and experiment with topy that's how I generally use it I never use both at the same time in fact these days um I focus a lot on promp engineering like optimizing The Prompt

as opposed to messing around with the temperature or these uh top P values so that's just something to note here uh you can read the full definition here um there's a lot of good content that goes

into like the technical details of these configurations but I think it's what I've explained is is good enough uh just like the intuition of it and when you may want to use it and when not so you

can see here the general recommendation is to alter temperature or top people that both and I think this does apply to most of the llm providers so if you're

using something like fireworks if you're using like a here or Cloud uh Gemini whatever that may be um I think you you might consider this recommendation when

you're doing that um now I've heard I've read in some forums that actually some developers combine both of them and they are getting good quality responses from these models but that's something that's

an exception I really rarely see this to be the case and we rarely use it this way now there are other settings like Max land stop sequencies frequency

penalty uh presence penalty and so on um I'll just go briefly through each one of these these we use less it really depends really on the circumstances or

our use cases so let's say we are trying to prevent some irrelevant responses which is I would say less of a problem

now with these models however there is the problem of cost right molds are getting cheaper to you so so you can make an argument that this is less important however when we started with

these language models right we they started really expensive and it was really nice to be able to control like how much tokens uh you know how much

tokens the model can generate uh so that you can control cost right so the model can go on and on and on generating taxt and and so and it doesn't finish and then next thing you know you have a really high bill so try to you know use

this and and it really depends again on the use case and your needs now stop sequence is another interesting one basically you define a string right that stops the model from generating tokens

right so you can you can have for instance in the open ey playground there is a stop sequence here right and they even explain to you what it is so here you you can provide whatever sequence

you you are using or whatever sequence you are expecting them all to Output as the final token right um again we rarely use this one it it I think it's very

Niche and and it really applies only to some type of of task and we have used it for instance for like when we are generating code that it's really interesting to use it in that setting uh

because we want the model to like don't explain the code just kind of output the code and we know what the stop sequencies are going to be and so on now we have this frequency penalty presence

penalty now if you are familiar with language models um if you go by back a few years you will know that these language models used to generate a lot

of like repeated text right and that was in very common issue with these models today it's less of a problem I would say and now if you are still having that

problem if you're still facing that problem with some of these language modes it could be the case that you're seeing this um that the mod is repeating certain tokens or using certain words in

its response a lot if you want to control for that what you can do is you can use the frequency penalty and it's available right on the playground right so the more you increase this the more

it penalizes the the model and it avoids the model from outputting or repeating you know certain words right um so that's the idea of the frequency the

presence is very similar so basically this one prevents the mod from repeating phrases often it's in its response right so it it you know unlike the other one

which is is a frequency penalty uh the penalty is the same for all repeated tokens which means you know it's going to avoid this is a good way to avoid um the model from repeating certain

sequences or certain phrases too often so yes that would be it for the explanation here hopefully it was a bit more clear and the intuition is there for you because it's important to be

aware of these when you are developing with language models uh today in my experience we use them less so like we temperature still right sometimes we

experiment with topy u Maxin sometimes because of cost to control cost um but you know and and this one is more specific to some use cases like code generation and this one we use it less

because it's these models have less issues like generating repeated tokens or repeated words so hopefully that was

useful if you have any questions please leave a comment on the YouTube page and I'll be looking at those and I'll try to provide you more guidance if that if

there's a need or try to send you to some kind of link for you to get a more technical explanation if you're interested in that just let me know and I'll see you in the next one

Loading...

Loading video analysis...