Recsys Keynote: Improving Recommendation Systems & Search in the Age of LLMs - Eugene Yan, Amazon
By AI Engineer
Summary
## Key takeaways - **Semantic IDs Beat Hash IDs for Cold Start**: Hash-based item IDs in recommendation systems struggle with cold start and sparsity. Quiaosho implemented trainable multimodal semantic IDs, which improved cold start coverage by 3.6% and cold start velocity. [01:38], [04:46] - **LLMs for Synthetic Data: A Journey**: Indeed faced challenges using LLMs for filtering bad job recommendations. Initial attempts with Mistral and Llama 2 were poor, GPT-4 was effective but costly, and GPT-3.5 had low precision. Fine-tuning GPT-3.5 improved precision but was still too slow, leading to distillation into a lightweight classifier. [06:30], [08:18] - **Spotify Expands Beyond Music with LLMs**: To introduce podcasts and audiobooks, Spotify developed a query recommendation system. They used LLMs to generate natural language queries, augmenting conventional techniques to drive a 9% increase in exploratory queries, helping grow new product categories. [10:44], [12:51] - **Netflix's UNICORN Unifies Search and RecSys**: Netflix created UNICORN, a unified contextual ranker, to address separate systems for search and recommendations. This model matches or exceeds the metrics of specialized models, simplifying the system and enabling faster iteration. [14:51], [16:25] - **Etsy Boosts Conversion with Unified Embeddings**: Etsy developed unified embeddings that combine product and query information using T5 models. This approach, including a quality vector, resulted in a 2.6% increase in site-wide conversion and over 5% increase in search purchases. [16:46], [19:30]
Topics Covered
- Hash IDs are dead: Embrace semantic IDs for better recommendations.
- LLMs can generate synthetic data, but fine-tuning is key for quality.
- Quality over quantity: Better recommendations boost engagement.
- Unified models simplify systems and improve all use cases.
- Etsy uses unified embeddings to boost conversion rates by 2.6%.
Full Transcript
[Music]
Hi everyone. Um, thank you for joining
us in today's Rexis the inaugural Rexus
track at the AI engineer world's fair.
So today what I want to share about is
what the future might look like when we
use when we try to merge recommendation
recommendation systems and language
models. So my wife looked at my slides
and she's like they're so plain. So
therefore I'll be giving a talk together
with Latte and Mochi. You might have
seen Mochi wandering the halls around
somewhere but there'll be a lot of
doggos throughout the slides. I hope you
enjoy. First language modeling
techniques are not new in recommendation
systems. I mean it started with what in
2013 we started learning item embeddings
across um AC from co- occurrences in
user interaction sequences and then
after that we started using GRU for
right I don't know who here remembers
recurrent neuronet networks gated
recurrent units yeah so those were very
short-term and we predict the next item
from a short set of sequences then of
course um transformers and attention
came about and we we we became better on
attention on long range dependencies so
that's where we started that hey you
know can we just process on everything
in a user sequence hundreds 2,000 item
ids long and try to learn from that and
of course now today in this track I
wanted to share with you about three
ideas that I think are worth thinking
about semantic ids data augmentation and
unified models so the first challenge we
have is hashbased item ids who here
works on recommendation systems so you
you you probably know that hashbased
item ids actually don't encode put the
content of the item itself. And then the
problem is that every time you have a
new item, you suffer from the cold start
problem, which is that all you have to
relearn about this item all over again.
And and therefore there's also sparsity,
right? Whereby you have a long set of
tail items that have maybe one or two
interactions or even up to 10, but it's
just not enough to learn. So
recommendation systems have this issue
of being very popularity bias and they
just struggle with coign and sparity. So
the solution is semantic ids that may
even involve multimodal content. So
here's an example of trainable
multimodal semantic IDs from quaou. So
quaisho is kind of like Tik Tok or xongu
is a short video platform in China. I
think it's the number two short video
platform. You might have used their text
to video model cling which they released
sometime last year. So the problem they
had, you know, being a short video
platform, users upload hundreds of
millions of short videos every day and
it's really hard to learn from this
short video. So how can we combine
static content embeddings with dynamic
user behavior?
Here's how they did it with trainable
multimodel semantic IDs. So I'm going to
go through each step here. So this is
the quadial model. It's a standard two
tower network. Um on the left this is
the embedding layer for the user which
is a standard sequence uh sequence of
ids and the user ID and on the right is
the embedding layer for the item ids. So
these are fairly standard. What what's
new here is that they now take in
content input. So all of these slides
will be available online. Um don't don't
worry about it. I'll make it available
right media after this. Um and to encode
visual they use restnet. To encode video
descriptions they use bird and to encode
audio they use vGish.
Now the thing about the trick is this
when you have this encoder models it's
very hard to back propagate and try to
update these encoder model embeddings.
So what did they do? Well firstly they
took all these content embeddings and
then they just concatenated them
together. I know it sounds crazy right
but I just concat concat them together.
Then they learn cluster ids. So I think
they shared in the paper they had like a
100 million short videos and they
learned just be via CIN's clustering a
thousand cluster ids. So that's what you
see over there in the model encoder
which is in the boxes at at the bottom
which is the cluster ids. So be above
the cluster ids you have the
non-trainable embeddings below that you
have the trainable cluster ids which are
then all mapped to their own embedding
table. So the trick here is this. The
motor encoder as you train a model, the
motor encoder learns to map the content
space via the cluster ids which are
mapped to the embedding table to the
behavioral space.
So the output is this. Um these semantic
ids not only outperform regular
hashbased ids on clicks and likes,
right? Like that's pretty standard. But
what they were able to do was they were
able to increase co-start coverage which
is the of a 100 videos that you share
how many of them are new they were able
to increase it by 3.6%. And also
increase co-start velocity which is okay
how many new videos were able to hit
some threshold of uh views and this they
did not they did not share what the
threshold was but being able to increase
co-start and co-star velocity by these
numbers are pretty outstanding.
So the long story short, the benefits of
semantic ids, you can address coart with
the semantic ID itself and now your
recommendations understand content. So
later in the talk, we're going to see
some amazing uh sharing from Pinterest
and YouTube. And in the YouTube one, you
see how they actually blend language
models with semantic ids whereby it can
actually explain why you might like the
semantic ID because it understands the
semantic ID and is able to give human
readable explanations and vice versa.
Now the next question and I'm sure all
of this everyone here has this
challenge. The lifeblood of machine
learning is data. good quality data at
scale and this is very essential for
search and of course recommendation
systems but search is actually far more
important. You you need a lot of
metadata. You need a lot of uh query
expansion, synonyms, uh you need
spellchecking, you need um uh all sorts
of metadata to attach to a search index.
But this is very costly and high effort
to get. In the past, we used to do with
human annotations or maybe you can try
to do it automatically. But LMS have
been outstanding at this. And I'm sure
everyone here is sort of doing this to
some extent using LM for synthetic data
and labels. But I want to share with you
two examples uh from Spotify and Indeed.
Now the Indeed paper it's quite out uh I
really like it a lot. So the problem
that they were trying to face is that
they were sending job recommendations to
users via email. But some of these job
recommendations were bad. They they were
just not a good fit for the user. Right?
So they had poor user experience and
then users lost trust in the job
recommendations. Imagine and how how
they would indicate that they lost trust
was that these job recommendations are
not good a good fit for me. I'm just
going to unsubscribe. Now the moment a
user unsubscribes from your feed or for
your newsletter is very very very hard
to get them back. Almost impossible. So
while they had explicit negative
feedback, thumbs up and thumbs down,
this was very sparse. How often would
you actually give thumbs down feedback?
Very sparse. And implicit feedback is
often imprecise. What do I mean? If you
if you get some recommendations but you
actually don't act on it, is it because
you didn't like it or is it because it's
not the right time or maybe your s your
your wife works there and you don't want
to work in the same company as your
wife. So the solution they had was a
lightweight classifier to filter bad
racks. And I'll tell you why I really
like this paper from indeed in the sense
that they didn't just share their
successes but they shared the entire
process and how they get how they got
there and it was fraught with
challenges. Well, of course, the first
thing that makes me really like it a lot
was that they started with emails. So,
they had their experts label um job
recommendations and uh user pairs and
from the user you have their resume
data, you have their activity data and
they try to see hey you know is this
recommendation a good fit.
Then they prompted open ALMs uh Mistral
and Lama 2. Unfortunately, the
performance is very poor. these models
couldn't really pay attention to what
was in the resume and what was in the
job description even though they had
sufficient context length and and the
output was just very generic.
So to get it to work they prompted GB4
and GB4 worked really well um
specifically that GB4 had like 90%
precision and recall. However, it was
very costly. Um they didn't share actual
cost but it's too slow. It's 32 seconds.
Okay, if GBD4 is too slow, what can we
do? Let's try GBD 3.5. Unfortunately, GP
GBD3D 3.5 had very poor precision. What
does this mean? In the sense that of the
recommendations that he said were bad,
only 63% of them were actually bad. What
this means is that they were throwing
out 37% of recommendations, which is
one/ird. And for a company that tries on
recommendations and people are rec
recruiting through your recommendations,
throwing out oneird of them that are
actually good is is is quite a is quite
a guardrail for them. This was their key
metric here and also GB. So what they
did then is they fine-tuned GBD 3.5. So
you can see the the entire journey right
open models GBD4 GBD3 now fine-tuning
GBD3.5. um GB25 got the precision they
wanted 0.3 precision and you know it's
one quarter of GBD4's cost and latency
right but unfortunately it was too still
too slow it was about 6.7 seconds and
this would not work in an online
filtering system so therefore what they
did was they distilled a lightweight
classifier on the fine tetune GBD 2.5
labels and this lightweight classifier
was able to achieve very high
performance uh specifically 0.86 86 AU
ROC. I mean the numbers may not make
sense to you but suffice to say that in
an industrial setting this is pretty
good. And of course they didn't mention
the latency but it was good enough for
real-time filtering. I think less than
200 millconds or something.
So the outcome of this was that they
were able to reduce bad recommendations.
They they were able to cut out bad
recommendations by about 20%. So
initially they had hypothesized that by
cutting down recommendations even though
they were bad you will get fewer
subscriptions. It's just like sending
out links, right? You might have links
that are clickbait, even though they are
bad, people just click on it. And he
thought that even if if we cut down
recommendations, even if they were bad,
we will get lower application rate. But
this was not the case. In fact, because
the recommendations were now better,
application rate actually went up by 4%.
And unsubscribe rate went down by 5%.
That that's that's quite a lot. So
essentially, what this means is that in
recommendations, quantity is not
everything. Quality makes a big
difference. And the quality here moved
the needle quite a bit by 5%.
The next example I want to share with
you is Spotify. So who here knows that
Spotify has podcast and audio books?
Oh, okay. I guess you you guys are not
the target target audience in in this
use case. So Spotify is really known for
song and artists and a lot of their
users just search for songs and artists
and they're very good at that. But when
they started introducing podcasts and
audio books, how would you help your
users know that you know these new items
are available? And of course there's a
huge ass co-star problem. Now it's not
only co-star on it term is now co-star
on category. How do you start growing a
new category within your service?
And of course exploratory search was
essential to the business right for
going for expanding beyond music.
Spotify doesn't want to do just do music
songs. They just now now they doing
audio. So the solution to that is a
query recommendation system.
So how did they recommend how first how
did they generate new queries? Well um
they have a bunch of ideas which is you
know extract it from catalog titles
playlist titles you mine it from the
search logs you just take the uh you
just take the artist and then you just
add cover to it and this is what they
use from existing data. Now you might be
wondering like where's the LM in this?
Well the LM is used to generate natural
language queries. So this might not be
sexy, but this works really well, right?
Take whatever you have with conventional
techniques that work really well and use
the LM to augment it when you need it.
Don't use the LM for everything at the
start.
So now they have this exploratory
queries.
When you search for something, you still
get the you still get the immediate
results hit, right? So you take all of
this, you add the immediate results, and
then you rank these new queries. So this
is why when you do a search, this is the
UX that you're probably going to get
right now. I got this from a paper. It
may have changed recently. So you still
see the item queries at the bottom, but
at the top with the query
recommendations, this is how Spotify
informs users without having a banner.
Now we have audio books, now we have
podcast, right? You search for
something, it actually informs you that
we have these new categories.
The benefit here is plus 9% exploratory
queries. is essentially onetenth of
their users were now exploring their new
products. So imagine that onetenth every
day exploring their new products. How
quickly would you be able to grow your
your new product category, right? It's
actually 1.1 to the^ of n. You will grow
pretty fast.
Long story short, I don't have to tell
you about the benefits of LM augmented
synthetic data. Richer high quality data
at scale on the tail queries, right?
even on the tail queries and the tail
items and it's far lower cost and effort
than is even possible with human
annotation. So later we also have a talk
from Instacart who will tell us about
how they use uh LMS to improve their
search and recommend uh their search
system.
Now the last thing I want to share is
this challenge whereby right now
in a regular company the system for ads
for recommendations for search they're
all separate systems and even for
recommendations the the model for
homepage recommendations the model for
item recommendations the model for cut
to cut recommendations the model for the
thank you page recommendations they may
all be different models right so you can
imagine there you're going to have many
many models But you going to have well
leadership expects you to keep the same
amount of headcount. So then how do you
try to get around this right? You have
duplicative engineering pipelines.
There's a lot of maintenance costs and
improving one model doesn't naturally
transfer to the improvement in another
model. So the solution for this is
unified models, right? I mean it works
for vision, it works for language. So
why not recommendation systems? And
we've been doing this for a while. This
is not new. And aside maybe the text is
too small but this is a tweet from
stripe whereby they built a
transformerbased payments fraud model
right even for payments the sequence of
payments you can build a foundation
model which is transformer based
so I want to share an example of the
unified ranker for search and rexis and
Netflix right the problem I mentioned
they have teams they are building
bespoke models for search similar item
similar video recommendations and
pre-quyer recommendations like on the
search page before you even enter a
search query. High operational cost um
you know missed opportunities from
learning throughout. So their solution
is a unified ranker and they call it a
unified contextual ranker which is
unicorn. So you can see over here uh at
the bottom there's the user foundation
model and in it you put in the user
watch history and then you also have the
context and relevance model where where
you put in the context of the videos and
what they've watched.
Now the thing about this unified model
is that it takes in unified input right?
So now if you are able to find a data
schema where all your use cases and all
your features can use the same input you
can adopt an approach like this which is
multi similar to multitask learning. So
the the user the input will be the user
ID the item ID you know the video or the
drama or the series the search query if
a search query exists the country and
the task. So of course they have many
different tasks in this example in the
paper they have three different tasks uh
search pre-quyer and more like this. Now
what they did then was very smart
imputation of missing items. So for
example if you are doing an item to item
recommendation you're just done watching
this video you want to recommend the
next video you would have no search
query how would you imputee it well you
just simply use the title of the current
item and try to find similar items.
The outcome of this is that this unified
model was able to match or exceed the
metrics of their specialized models on
multiple tasks.
Think about it. I mean it doesn't seem
very impressive, right? It may not seem
very impressive. Match or exceed. It
might seem we did all this work just to
match, but imagine unifying all of it
like removing the tech depth and
building a better foundation for your
future iterations. It's going to make
you iterate faster.
The last example I want to share with
you is unified embeddings at Etsy. So
you might think that embeddings are not
very sexy but this paper from Etsy is
really uh outstanding in what they share
in terms of model architecture as well
as their system. So the problem they had
was how can we help users get better
results from very specific queries or
very broad queries and if you know that
Etsy inventory is constantly changing
they they don't have the the same
products all all throughout right it's
very home homegrown so now you might be
quering for something like mother's day
gift that would almost match very few
items I think very few items would have
mother's day gift in their description
or their title right and you know lexica
embedding the other problem is that
knowledge based embeddings like lexical
embedding retrieval don't account for
user preferences. So how do you try to
address this?
The problem the how do you address this
is with a unified embedding and
retrieval. So if you remember at the
start of my presentation I talked about
the qu show tower model right there's
the user tower and then there's the item
tower. We will see the same pattern
again over here you see the product
tower right this is the product encoder.
So how they encode the product is that
they use T5 models for text embeddings,
right? Text item descriptions as well as
a query product log for query
embeddings. What was the query that was
made and what was the product that was
eventually clicked or purchased?
And then over here on the on the left
you see the query encoder which is the
search query encoder. And they both
share encoders for the tokens which is
actually a text tokens the product
category which is a token of a token of
itself and the user location. So what
this means is that your now your
embedding is able to match user to the
location of the product itself.
And then of course to personalize this
they encode the user preferences via the
query user scaler features at the
bottom. Essentially what were the
queries that the user searched for? what
what did they buy previously all their
preferences. Now this is they also
shared their system architecture and
over here this is the product encoder
from the previous slide and the query
encoder from the previous slide. But
what's very interesting here is that
they added a quality vector because they
wanted to ensure that whatever was
searched and retrieved was actually of
good quality in terms of ratings uh
freshness and conversion rate. And you
know what they did is they just simply
concatenated this quality vector to the
product embedding vector.
But when you do that for the query
vector, you have to you have to expand
the product vector by the same dimension
so that you can do a dot product or
cosine similarity. So essentially they
just slapped on a constant vector uh for
the query embedding and it just works.
The result 2.6% increase in conversion
across entire site. That's quite crazy.
Um and more than 5% increase in search
purchases. If you search for something
the purchase rate increases by 5%. Um
this is very very these are very very
very good results for e-commerce.
Um so the benefits of uh unified models
you simplify the system uh you whatever
you build to improve one side of the
tower or improve your model your unified
model also improves other use cases that
use these unified models. That said,
there may also be the alignment text.
You you you may find that when you try
to build this, try to compress all 12
use cases into a single unified model,
you may need to split it up into maybe
two or three separate unified models
because that's just the alignment text.
We're trying to get better at one task
actually makes the other task worse. We
have a talk from uh LinkedIn in the in
this afternoon's in this afternoon blog,
the last talk of the blog, and then we
also have a talk from Netflix uh which
we'll be sharing about their unified
model at the start of the next blog. All
right, the three takeaways I have for
you, think about it, consider it.
Semantic ids, data augmentation, and
unified models. Um, and of course do
stay stay tuned for the rest of the
track uh for the rest of the talks in
this track. Okay, that's it. Thank you.
[Music]
Loading video analysis...