Recsys Keynote: Improving Recommendation Systems & Search in the Age of LLMs - Eugene Yan, Amazon

By AI Engineer

Summary

## Key takeaways - **Semantic IDs Beat Hash IDs for Cold Start**: Hash-based item IDs in recommendation systems struggle with cold start and sparsity. Quiaosho implemented trainable multimodal semantic IDs, which improved cold start coverage by 3.6% and cold start velocity. [01:38], [04:46] - **LLMs for Synthetic Data: A Journey**: Indeed faced challenges using LLMs for filtering bad job recommendations. Initial attempts with Mistral and Llama 2 were poor, GPT-4 was effective but costly, and GPT-3.5 had low precision. Fine-tuning GPT-3.5 improved precision but was still too slow, leading to distillation into a lightweight classifier. [06:30], [08:18] - **Spotify Expands Beyond Music with LLMs**: To introduce podcasts and audiobooks, Spotify developed a query recommendation system. They used LLMs to generate natural language queries, augmenting conventional techniques to drive a 9% increase in exploratory queries, helping grow new product categories. [10:44], [12:51] - **Netflix's UNICORN Unifies Search and RecSys**: Netflix created UNICORN, a unified contextual ranker, to address separate systems for search and recommendations. This model matches or exceeds the metrics of specialized models, simplifying the system and enabling faster iteration. [14:51], [16:25] - **Etsy Boosts Conversion with Unified Embeddings**: Etsy developed unified embeddings that combine product and query information using T5 models. This approach, including a quality vector, resulted in a 2.6% increase in site-wide conversion and over 5% increase in search purchases. [16:46], [19:30]

Topics Covered

Hash IDs are dead: Embrace semantic IDs for better recommendations.
LLMs can generate synthetic data, but fine-tuning is key for quality.
Quality over quantity: Better recommendations boost engagement.
Unified models simplify systems and improve all use cases.
Etsy uses unified embeddings to boost conversion rates by 2.6%.

Full Transcript

[Music]

Hi everyone. Um, thank you for joining

us in today's Rexis the inaugural Rexus

track at the AI engineer world's fair.

So today what I want to share about is

what the future might look like when we

use when we try to merge recommendation

recommendation systems and language

models. So my wife looked at my slides

and she's like they're so plain. So

therefore I'll be giving a talk together

with Latte and Mochi. You might have

seen Mochi wandering the halls around

somewhere but there'll be a lot of

doggos throughout the slides. I hope you

enjoy. First language modeling

techniques are not new in recommendation

systems. I mean it started with what in

2013 we started learning item embeddings

across um AC from co- occurrences in

user interaction sequences and then

after that we started using GRU for

right I don't know who here remembers

recurrent neuronet networks gated

recurrent units yeah so those were very

short-term and we predict the next item

from a short set of sequences then of

course um transformers and attention

came about and we we we became better on

attention on long range dependencies so

that's where we started that hey you

know can we just process on everything

in a user sequence hundreds 2,000 item

ids long and try to learn from that and

of course now today in this track I

wanted to share with you about three

ideas that I think are worth thinking

about semantic ids data augmentation and

unified models so the first challenge we

have is hashbased item ids who here

works on recommendation systems so you

you you probably know that hashbased

item ids actually don't encode put the

content of the item itself. And then the

problem is that every time you have a

new item, you suffer from the cold start

problem, which is that all you have to

relearn about this item all over again.

And and therefore there's also sparsity,

right? Whereby you have a long set of

tail items that have maybe one or two

interactions or even up to 10, but it's

just not enough to learn. So

recommendation systems have this issue

of being very popularity bias and they

just struggle with coign and sparity. So

the solution is semantic ids that may

even involve multimodal content. So

here's an example of trainable

multimodal semantic IDs from quaou. So

quaisho is kind of like Tik Tok or xongu

is a short video platform in China. I

think it's the number two short video

platform. You might have used their text

to video model cling which they released

sometime last year. So the problem they

had, you know, being a short video

platform, users upload hundreds of

millions of short videos every day and

it's really hard to learn from this

short video. So how can we combine

static content embeddings with dynamic

user behavior?

Here's how they did it with trainable

multimodel semantic IDs. So I'm going to

go through each step here. So this is

the quadial model. It's a standard two

tower network. Um on the left this is

the embedding layer for the user which

is a standard sequence uh sequence of

ids and the user ID and on the right is

the embedding layer for the item ids. So

these are fairly standard. What what's

new here is that they now take in

content input. So all of these slides

will be available online. Um don't don't

worry about it. I'll make it available

right media after this. Um and to encode

visual they use restnet. To encode video

descriptions they use bird and to encode

audio they use vGish.

Now the thing about the trick is this

when you have this encoder models it's

very hard to back propagate and try to

update these encoder model embeddings.

So what did they do? Well firstly they

took all these content embeddings and

then they just concatenated them

together. I know it sounds crazy right

but I just concat concat them together.

Then they learn cluster ids. So I think

they shared in the paper they had like a

100 million short videos and they

learned just be via CIN's clustering a

thousand cluster ids. So that's what you

see over there in the model encoder

which is in the boxes at at the bottom

which is the cluster ids. So be above

the cluster ids you have the

non-trainable embeddings below that you

have the trainable cluster ids which are

then all mapped to their own embedding

table. So the trick here is this. The

motor encoder as you train a model, the

motor encoder learns to map the content

space via the cluster ids which are

mapped to the embedding table to the

behavioral space.

So the output is this. Um these semantic

ids not only outperform regular

hashbased ids on clicks and likes,

right? Like that's pretty standard. But

what they were able to do was they were

able to increase co-start coverage which

is the of a 100 videos that you share

how many of them are new they were able

to increase it by 3.6%. And also

increase co-start velocity which is okay

how many new videos were able to hit

some threshold of uh views and this they

did not they did not share what the

threshold was but being able to increase

co-start and co-star velocity by these

numbers are pretty outstanding.

So the long story short, the benefits of

semantic ids, you can address coart with

the semantic ID itself and now your

recommendations understand content. So

later in the talk, we're going to see

some amazing uh sharing from Pinterest

and YouTube. And in the YouTube one, you

see how they actually blend language

models with semantic ids whereby it can

actually explain why you might like the

semantic ID because it understands the

semantic ID and is able to give human

readable explanations and vice versa.

Now the next question and I'm sure all

of this everyone here has this

challenge. The lifeblood of machine

learning is data. good quality data at

scale and this is very essential for

search and of course recommendation

systems but search is actually far more

important. You you need a lot of

metadata. You need a lot of uh query

expansion, synonyms, uh you need

spellchecking, you need um uh all sorts

of metadata to attach to a search index.

But this is very costly and high effort

to get. In the past, we used to do with

human annotations or maybe you can try

to do it automatically. But LMS have

been outstanding at this. And I'm sure

everyone here is sort of doing this to

some extent using LM for synthetic data

and labels. But I want to share with you

two examples uh from Spotify and Indeed.

Now the Indeed paper it's quite out uh I

really like it a lot. So the problem

that they were trying to face is that

they were sending job recommendations to

users via email. But some of these job

recommendations were bad. They they were

just not a good fit for the user. Right?

So they had poor user experience and

then users lost trust in the job

recommendations. Imagine and how how

they would indicate that they lost trust

was that these job recommendations are

not good a good fit for me. I'm just

going to unsubscribe. Now the moment a

user unsubscribes from your feed or for

your newsletter is very very very hard

to get them back. Almost impossible. So

while they had explicit negative

feedback, thumbs up and thumbs down,

this was very sparse. How often would

you actually give thumbs down feedback?

Very sparse. And implicit feedback is

often imprecise. What do I mean? If you

if you get some recommendations but you

actually don't act on it, is it because

you didn't like it or is it because it's

not the right time or maybe your s your

your wife works there and you don't want

to work in the same company as your

wife. So the solution they had was a

lightweight classifier to filter bad

racks. And I'll tell you why I really

like this paper from indeed in the sense

that they didn't just share their

successes but they shared the entire

process and how they get how they got

there and it was fraught with

challenges. Well, of course, the first

thing that makes me really like it a lot

was that they started with emails. So,

they had their experts label um job

recommendations and uh user pairs and

from the user you have their resume

data, you have their activity data and

they try to see hey you know is this

recommendation a good fit.

Then they prompted open ALMs uh Mistral

and Lama 2. Unfortunately, the

performance is very poor. these models

couldn't really pay attention to what

was in the resume and what was in the

job description even though they had

sufficient context length and and the

output was just very generic.

So to get it to work they prompted GB4

and GB4 worked really well um

specifically that GB4 had like 90%

precision and recall. However, it was

very costly. Um they didn't share actual

cost but it's too slow. It's 32 seconds.

Okay, if GBD4 is too slow, what can we

do? Let's try GBD 3.5. Unfortunately, GP

GBD3D 3.5 had very poor precision. What

does this mean? In the sense that of the

recommendations that he said were bad,

only 63% of them were actually bad. What

this means is that they were throwing

out 37% of recommendations, which is

one/ird. And for a company that tries on

recommendations and people are rec

recruiting through your recommendations,

throwing out oneird of them that are

actually good is is is quite a is quite

a guardrail for them. This was their key

metric here and also GB. So what they

did then is they fine-tuned GBD 3.5. So

you can see the the entire journey right

open models GBD4 GBD3 now fine-tuning

GBD3.5. um GB25 got the precision they

wanted 0.3 precision and you know it's

one quarter of GBD4's cost and latency

right but unfortunately it was too still

too slow it was about 6.7 seconds and

this would not work in an online

filtering system so therefore what they

did was they distilled a lightweight

classifier on the fine tetune GBD 2.5

labels and this lightweight classifier

was able to achieve very high

performance uh specifically 0.86 86 AU

ROC. I mean the numbers may not make

sense to you but suffice to say that in

an industrial setting this is pretty

good. And of course they didn't mention

the latency but it was good enough for

real-time filtering. I think less than

200 millconds or something.

So the outcome of this was that they

were able to reduce bad recommendations.

They they were able to cut out bad

recommendations by about 20%. So

initially they had hypothesized that by

cutting down recommendations even though

they were bad you will get fewer

subscriptions. It's just like sending

out links, right? You might have links

that are clickbait, even though they are

bad, people just click on it. And he

thought that even if if we cut down

recommendations, even if they were bad,

we will get lower application rate. But

this was not the case. In fact, because

the recommendations were now better,

application rate actually went up by 4%.

And unsubscribe rate went down by 5%.

That that's that's quite a lot. So

essentially, what this means is that in

recommendations, quantity is not

everything. Quality makes a big

difference. And the quality here moved

the needle quite a bit by 5%.

The next example I want to share with

you is Spotify. So who here knows that

Spotify has podcast and audio books?

Oh, okay. I guess you you guys are not

the target target audience in in this

use case. So Spotify is really known for

song and artists and a lot of their

users just search for songs and artists

and they're very good at that. But when

they started introducing podcasts and

audio books, how would you help your

users know that you know these new items

are available? And of course there's a

huge ass co-star problem. Now it's not

only co-star on it term is now co-star

on category. How do you start growing a

new category within your service?

And of course exploratory search was

essential to the business right for

going for expanding beyond music.

Spotify doesn't want to do just do music

songs. They just now now they doing

audio. So the solution to that is a

query recommendation system.

So how did they recommend how first how

did they generate new queries? Well um

they have a bunch of ideas which is you

know extract it from catalog titles

playlist titles you mine it from the

search logs you just take the uh you

just take the artist and then you just

add cover to it and this is what they

use from existing data. Now you might be

wondering like where's the LM in this?

Well the LM is used to generate natural

language queries. So this might not be

sexy, but this works really well, right?

Take whatever you have with conventional

techniques that work really well and use

the LM to augment it when you need it.

Don't use the LM for everything at the

start.

So now they have this exploratory

queries.

When you search for something, you still

get the you still get the immediate

results hit, right? So you take all of

this, you add the immediate results, and

then you rank these new queries. So this

is why when you do a search, this is the

UX that you're probably going to get

right now. I got this from a paper. It

may have changed recently. So you still

see the item queries at the bottom, but

at the top with the query

recommendations, this is how Spotify

informs users without having a banner.

Now we have audio books, now we have

podcast, right? You search for

something, it actually informs you that

we have these new categories.

The benefit here is plus 9% exploratory

queries. is essentially onetenth of

their users were now exploring their new

products. So imagine that onetenth every

day exploring their new products. How

quickly would you be able to grow your

your new product category, right? It's

actually 1.1 to the^ of n. You will grow

pretty fast.

Long story short, I don't have to tell

you about the benefits of LM augmented

synthetic data. Richer high quality data

at scale on the tail queries, right?

even on the tail queries and the tail

items and it's far lower cost and effort

than is even possible with human

annotation. So later we also have a talk

from Instacart who will tell us about

how they use uh LMS to improve their

search and recommend uh their search

system.

Now the last thing I want to share is

this challenge whereby right now

in a regular company the system for ads

for recommendations for search they're

all separate systems and even for

recommendations the the model for

homepage recommendations the model for

item recommendations the model for cut

to cut recommendations the model for the

thank you page recommendations they may

all be different models right so you can

imagine there you're going to have many

many models But you going to have well

leadership expects you to keep the same

amount of headcount. So then how do you

try to get around this right? You have

duplicative engineering pipelines.

There's a lot of maintenance costs and

improving one model doesn't naturally

transfer to the improvement in another

model. So the solution for this is

unified models, right? I mean it works

for vision, it works for language. So

why not recommendation systems? And

we've been doing this for a while. This

is not new. And aside maybe the text is

too small but this is a tweet from

stripe whereby they built a

transformerbased payments fraud model

right even for payments the sequence of

payments you can build a foundation

model which is transformer based

so I want to share an example of the

unified ranker for search and rexis and

Netflix right the problem I mentioned

they have teams they are building

bespoke models for search similar item

similar video recommendations and

pre-quyer recommendations like on the

search page before you even enter a

search query. High operational cost um

you know missed opportunities from

learning throughout. So their solution

is a unified ranker and they call it a

unified contextual ranker which is

unicorn. So you can see over here uh at

the bottom there's the user foundation

model and in it you put in the user

watch history and then you also have the

context and relevance model where where

you put in the context of the videos and

what they've watched.

Now the thing about this unified model

is that it takes in unified input right?

So now if you are able to find a data

schema where all your use cases and all

your features can use the same input you

can adopt an approach like this which is

multi similar to multitask learning. So

the the user the input will be the user

ID the item ID you know the video or the

drama or the series the search query if

a search query exists the country and

the task. So of course they have many

different tasks in this example in the

paper they have three different tasks uh

search pre-quyer and more like this. Now

what they did then was very smart

imputation of missing items. So for

example if you are doing an item to item

recommendation you're just done watching

this video you want to recommend the

next video you would have no search

query how would you imputee it well you

just simply use the title of the current

item and try to find similar items.

The outcome of this is that this unified

model was able to match or exceed the

metrics of their specialized models on

multiple tasks.

Think about it. I mean it doesn't seem

very impressive, right? It may not seem

very impressive. Match or exceed. It

might seem we did all this work just to

match, but imagine unifying all of it

like removing the tech depth and

building a better foundation for your

future iterations. It's going to make

you iterate faster.

The last example I want to share with

you is unified embeddings at Etsy. So

you might think that embeddings are not

very sexy but this paper from Etsy is

really uh outstanding in what they share

in terms of model architecture as well

as their system. So the problem they had

was how can we help users get better

results from very specific queries or

very broad queries and if you know that

Etsy inventory is constantly changing

they they don't have the the same

products all all throughout right it's

very home homegrown so now you might be

quering for something like mother's day

gift that would almost match very few

items I think very few items would have

mother's day gift in their description

or their title right and you know lexica

embedding the other problem is that

knowledge based embeddings like lexical

embedding retrieval don't account for

user preferences. So how do you try to

address this?

The problem the how do you address this

is with a unified embedding and

retrieval. So if you remember at the

start of my presentation I talked about

the qu show tower model right there's

the user tower and then there's the item

tower. We will see the same pattern

again over here you see the product

tower right this is the product encoder.

So how they encode the product is that

they use T5 models for text embeddings,

right? Text item descriptions as well as

a query product log for query

embeddings. What was the query that was

made and what was the product that was

eventually clicked or purchased?

And then over here on the on the left

you see the query encoder which is the

search query encoder. And they both

share encoders for the tokens which is

actually a text tokens the product

category which is a token of a token of

itself and the user location. So what

this means is that your now your

embedding is able to match user to the

location of the product itself.

And then of course to personalize this

they encode the user preferences via the

query user scaler features at the

bottom. Essentially what were the

queries that the user searched for? what

what did they buy previously all their

preferences. Now this is they also

shared their system architecture and

over here this is the product encoder

from the previous slide and the query

encoder from the previous slide. But

what's very interesting here is that

they added a quality vector because they

wanted to ensure that whatever was

searched and retrieved was actually of

good quality in terms of ratings uh

freshness and conversion rate. And you

know what they did is they just simply

concatenated this quality vector to the

product embedding vector.

But when you do that for the query

vector, you have to you have to expand

the product vector by the same dimension

so that you can do a dot product or

cosine similarity. So essentially they

just slapped on a constant vector uh for

the query embedding and it just works.

The result 2.6% increase in conversion

across entire site. That's quite crazy.

Um and more than 5% increase in search

purchases. If you search for something

the purchase rate increases by 5%. Um

this is very very these are very very

very good results for e-commerce.

Um so the benefits of uh unified models

you simplify the system uh you whatever

you build to improve one side of the

tower or improve your model your unified

model also improves other use cases that

use these unified models. That said,

there may also be the alignment text.

You you you may find that when you try

to build this, try to compress all 12

use cases into a single unified model,

you may need to split it up into maybe

two or three separate unified models

because that's just the alignment text.

We're trying to get better at one task

actually makes the other task worse. We

have a talk from uh LinkedIn in the in

this afternoon's in this afternoon blog,

the last talk of the blog, and then we

also have a talk from Netflix uh which

we'll be sharing about their unified

model at the start of the next blog. All

right, the three takeaways I have for

you, think about it, consider it.

Semantic ids, data augmentation, and

unified models. Um, and of course do

stay stay tuned for the rest of the

track uh for the rest of the talks in

this track. Okay, that's it. Thank you.

[Music]

Loading...

Loading video analysis...