TLDW logo

Day 2 Livestream with Paige Bailey – 5-Day Gen AI Intensive Course | Kaggle

By Kaggle

Summary

## Key takeaways - **Embeddings Capture Semantic Similarity**: Embeddings are compact fixed numerical representations that transform complex data like text, images, and audio into vectors capturing semantic meaning and relationships between them. This allows efficient comparison, such as recognizing that 'king' is more similar to 'queen' than to 'car' in a contextual sense. [03:06], [03:54] - **RAG Overcomes LLM Knowledge Limits**: Retrieval Augmented Generation addresses LLMs' limitations by embedding and indexing documents in a vector database, retrieving the most semantically similar ones to a query, and using them as context in prompts to improve performance and reduce hallucinations. In the code lab, a simple RAG system using ChromaDB and text-embedding-004 embeds three toy documents and generates answers based on retrieved context. [05:24], [06:05] - **Approximate Nearest Neighbors Enable Scalable Search**: Vector databases use approximate nearest neighbor algorithms like HNSW and ScaNN to efficiently search through millions of embeddings and find semantically similar ones to a query in a time- and cost-efficient manner. Google's ScaNN, based on over 12 years of research, is natively integrated into products like AlloyDB and offers excellent speed-accuracy tradeoffs for high-dimensional data. [03:49], [20:32] - **Embeddings Bootstrap Data-Efficient Classification**: Using pre-trained embeddings as rich representations for downstream tasks like classifying news group posts allows building simple Keras models with just a few dense layers that achieve high accuracy quickly, even with limited data, without training from raw text scratch. In the code lab, the model reaches good validation accuracy in seven epochs by leveraging embeddings instead of complex architectures. [11:56], [15:20] - **Open-Source vs Proprietary Vector DB Tradeoffs**: Open-source vector databases offer advantages in cost, flexibility, customizability, and avoiding vendor lock-in but come with higher maintenance, management complexity, and limited support. Proprietary ones like Google Cloud's provide ease of use, advanced features, stability, and managed services but may involve higher costs, potential lock-in, and less customization, with general-purpose databases like AlloyDB and BigQuery increasingly adding native vector capabilities to avoid data duplication. [22:09], [23:06] - **Vector DBs Complement Long Context Windows**: While long context windows like Gemini's 2 million tokens are exciting, they are computationally expensive for billions of tokens in real corpora and degrade reasoning with irrelevant content, making vector DBs essential for efficient retrieval of relevant documents first. This retrieval-then-reason pattern allows focusing on recall over precision, supplying pertinent private or proprietary data that public search grounding cannot access. [26:13], [27:13]

Topics Covered

  • Embeddings capture semantic relationships across data types?
  • RAG overcomes LLM knowledge limitations?
  • Embeddings enable semantic similarity detection?
  • Pre-trained embeddings bootstrap data-efficient classification?
  • Vector databases complement long context windows?

Full Transcript

Greetings everyone and welcome back.

It is great to have you here for the second day of our generative AI intensive course, a co uh collaboration between Kaggle and the Gemini team.

Um today we're going to talk all about uh embeddings and vector databases.

Hopefully you were excited to see all of the great discussion that happened yesterday around foundational models and prompt engineering techniques. Uh and we have a great kind of set of sessions here today where we'll be walking through some code labs. Um doing a Q&A with some of the people who are building these tools and features for Google.

Uh and then also a pop quiz at the end.

So make sure to pay attention.

Uh this 5-day generative AI intensive course which is sponsored by Kaggle and by the Gemini team.

um also includes a Discord channel to discuss our live streams and to engage with the guest speakers and community and Q&A. Um so make sure to uh to log in to ask your questions and they might just be selected for a future day um uh as you're reading your assignments and and kind of learning more about all of the great content. Um, be sure to check out all of our collabs that you can run online as Kaggle kernels.

Um, as well as, uh, engaging with some of the white papers that we've been sharing.

Um, and as mentioned, this is day number two, um, of our generative AI intensive course.

Hopefully, you've read uh, the embeddings and vector databases paper.

Um, you've uh, explored some of the code labs and a not. We'll be showing you some of these coming up very soon.

Um, and you'll also be learning about the conceptual underpinnings of embeddings and vector databases and how they can be used to bring um kind of specialist data and and really exciting production techniques into your large language model apps.

Um, so with that I am going to go ahead and turn it over to Anant um who is going to be walking through some of the amazing kind of curriculum review and code labs uh that we have planned for day two. So Anant uh Anant take it away.

Thanks. Thanks Paige. So uh welcome everyone.

So before I get started with the code lab um just wanted to walk you through some of the content that you have read through in the white papers as well as maybe the podcast if you had the time to go through it. But uh so uh you might have seen in the white paper that uh it starts off with um exploring what is embeddings and the different kinds of embeddings.

So in the realm of machine learning we often grapple with different diverse data types right text, images, audio and more. And often many applications can benefit from representing these objects of different data types in a comp compact fixed numerical representation which captures the semantic meaning and relationships between them and that's essentially what embeddings are in a nutshell. So we are basically transforming complex data into semantic numerical vectors that can be used for downstream tasks.

We also looked at in the white paper that um different types of embeddings text image and multimmodal embeddings as well as structured data and graph embeddings.

Uh and then we also dived into the various algorithms and approaches that can be used to create train and use these embeddings as well as their corresponding trade-offs.

Now after that uh once we have the embeddings uh in order to use it for downstream application it is important to have an efficient way to search through uh say hundreds of thousands of millions of embeddings uh given a query and find the most semantically similar one but do so in a very time and costefficient manner.

And that's where uh approximate nearest neighbor algorithms come into play.

And we looked at um how embeddings can like the various metrics can be used to compare embeddings to see like judge similarity which embeddings are similar to one another. Um so example if king is similar to queen in the contextual sense more than king is similar to a car then that's something we can should be able to compare and um find out and to do this at scale uh approximate nearest neighbor algorithms such as hierarchical navigable uh small world as well as scan um uh come into play which can help you with this.

Then after that we looked at the various uh vector stores and vector databases that can be used which implement these approximate nearest neighbor algorithms at scale and the various operational considerations and how to manage these uh in production.

Then after that we concluded the white paper by looking at some of the applications of embeddings for example by using uh a pre-trend embedding representation uh of various objects and adding a small layer on top to benefit from the rich representations and leverage that instead of training from scratch and use it for classification tasks.

We also looked at how embeddings can be used for recommendation problems as well as ranking problems and many others.

So uh having said that that's a great segue to our collab in which I'll be walking you through um some of these let us get started.

So your first collab for um for this day for day two was one where we built um where you built a document Q&A system using rag and chromadb. Now what is rack right retrieval augmented generation it has three parts and it addresses the very important uh limitation of LLMs which is that LLMs are often limited on to the knowledge they're trained on and often do not have access to the latest dynamic information unless you provide it using uh systems like these and even if they do have access to certain knowledge providing this uh the most relevant information and latest information in the in the prompt itself can help improve uh the performance and reduce hallucinations.

So rack system has three stages. The first one is embedding and indexing the various documents uh in the database as well as embedding the query. And then the retrieval stage uh comes afterwards where you take the query uh embedding and look at all the documents in your database and say okay which ones are the most similar documents the most relevant documents that my LLM should have access to um uh in order to give me the most appropriate answer.

and then the generation part like provide this uh as part of the prompt the most relevant um context and documents and generate the uh answer.

So in this particular collab we uh built a s uh simple toy um rag system.

Let's look at how we did that.

So we installed the necessary libraries as we have done um in the previous collabs as well. retrieve the API keys um from the Kaggle secrets to call Gemini API and then we uh explore the various available models. Um so here um we look at the embedding models available so like the text embedding models which we we will be using for the rag system.

Then we create a small toy data set uh consisting of three documents um which will be which we will be using as our database for this uh collab.

After that we create a function to take a document and embed it using um the embedding API.

Um and in this case we are using the text embedding 004 embedding API and uh we uh we make this function which can be called um to embed a given document.

Then once we have the data set defined and the embedding a function defined, we use uh uh we basically um initialize Chroma DB Chroma database um and and uh use this function and uh to add the documents and it automatically embeds them uh since we are uh we are referencing the embedding function which we defined above. So it's adding the documents and indexing them uh in memory and uh using the embedding function to make sure it indexes the embedded version of the documents not the raw documents.

Then we confirm that the documents were indeed added to the um to the uh vector DB. And then here we uh we see if the semantic search with the with the vector DB is actually working.

So we give a query and we query the database and we find the most the most similar result and that is seems to be correct given the context of uh the three sample documents we pro provided it and then we use that um we use the retrieve result as context to to an LLM here in this section where we use it as uh uh we expand the prompt using that sample um retrieved uh document and we use it to generate the answer.

So congratulations uh if you've reached here you've and you've understood it you've built a retrieval augmented generation app.

Now moving on to the second collab for the day.

Here we see uh how embeddings and uh can be used um uh to understand similarity between various documents.

In this case, we'll be working particularly with text embeddings, but you can also use multimodel embeddings.

So, similar startup uh with the Kaggle secret um to retrieve the API key and initialize Gemini.

We also then kind of explore the models which are available to us for embeddings.

And then here is uh the important bit uh which will help you understand here.

We define the various text documents, sample text documents, uh, which we want to compare with one another and find okay for each one of the documents, how similar is it like semantically from a meaning perspective to the other documents in this list. So which is what we will be doing here. And so um we see we do a semantic similarity um and map the heat maps uh from each document to every other document. So in this heat map uh you see something very interesting.

So um so one the the lighter shades refer to very high similarity semantic similarity while darker shades refer to lower semantic similarity.

And you can see that the sentence the quick brown fox jumps over the lazy dog is most similar well to itself.

You can see here this is the most similar to itself and but uh uh but it's least similar to uh documents which are completely random like lortoum etc. You can see that it kind of makes sense that is it is the case right?

U and if it's uh if you look at the exact similarity scores here, the document this uh the quick brown fox jumps over the lazy dog. It's most similar of course to itself because it's the exact same sentence, but it's second most similar to just a slightly misspelled version of the same document because the meaning kind of still remains the same just a given a small typo here.

And then you can see that it's third most similar document also kind of implies the same meaning and so on and so forth.

And this is kind of how you a human would also rank it.

And the semantic similarity is captured very well using by embedding these documents.

You can use this for various purposes such as clustering, ranking, um retrieval systems etc. Now if you let's go on to the third code lab.

Now here's a slightly different use case of embeddings. Uh the first two talked about retrieval uh using semantic similarity of documents. This one talks about how you can use embeddings as a rich representation uh to to bootstrap your um uh further downstream models and utilize this rich representation as a very informative source so that you do not need a lot of data or very complex models uh and can already leverage the information within this embedding.

So in this collab um we would start off with um using pre-chain embeddings that can uh uh classify news that can help uh help uh classify news group posts with a small kas model added on top.

The kas model in question would not be trained on raw text data from scratch but be leveraging these embeddings as a source and adding a couple of more layers on top which can help uh greatly um be very data efficient and uh learn the uh learn how to classify the news group posts into various categories.

So um we start off as usual with initializing API and retrieving the API key initializing Gemini with it and then we look at the news group text data set where we have a lot of news group posts across several topics and uh our aim is to make a model which can take the text content of a news group post and classify into one of the topics.

Some of them uh are mentioned here.

So I'll be skipping us this one.

And we'll be pre in this cell we pre-process the news script posts to uh clean it a bit.

And then we make a train and test split for the post. So uh we would like to train it on the new script posts um an existing class labels or the category labels and then um learn how to predict uh um class or target labels for un uh for new news post where the um where the class label is unknown and see how it performs in the test set.

So um this is all pre pre-processing and yeah so here's the the most important bit creating the embeddings and using that for a downstream model. So in this case because we'll be performing a classification task we would be let's use the classification task category for the embeddings you can use various different categories um depending on the task you're doing and then we um we define functions to create embeddings here and then um we create the embeddings out of the training and test data sets and then um in this section we um add couple of dense keras layers.

So dense layers using the create keras framework and we will uh where we take the embedded inputs and pass it through a couple of uh layers and and then the model learns how to classify this um to the new respective news group target categories.

So keep in mind that here we are not learning from scratch.

We taking the embeddings as a input.

Hence the model does not need much as much data and it does not need as much depth and complexity.

And by doing so we see we train the model here.

Um we're doing so we can see that um the validation um u um accuracy is improving quite quickly in just seven epochs even with limited data. And um it manages to achieve quite a good accuracy with very little data because we started we bootstrapped it with embedded representations and not raw text.

So this is something which can really help you if especially if you have limited data and hopefully uh it was clear and yeah thank you and off to you page.

Awesome. Thank you so much for walking through those great notebooks.

Uh and thank you uh thank you to Mark McDonald as well for for kind of creating them uh for for making them available um and for uh for giving students the ability to kind of get in and to uh to to sort of start understanding how embeddings and vector databases work. Thank you also to the open source teams behind Chroma and be behind Caris. Um uh really excited to see some of those tools used with the uh used with the curriculum as well.

Um, so let me go back to our slide deck.

Um, and it looks like it's time to get started with our Q&A. Um, so, uh, the question and answer session is brought to you for you by you. Um, all of the great questions that you've been asking in Discord, um, uh, will be used to kind of, uh, uh, to kind of create questions for our Q&A session. I also want to make sure to thank our generative AI course moderators Pong, Ca, Mark, Eric, Irwin, and Miles who have been answering your questions in real time on Discord and also helping uh people get unstuck from some of the things that might have been a little bit challenging with some of the assignments.

Um so, please virtual round of applause um for all of our great course moderators. Um and a great uh a great and warm welcome to all of our expert hosts here for day two.

We have many folks from DeepMind as well as from our our Google cloud AI teams um Omid Alan um Shiaoi Jenhuk uh Yan Eftakar and Anant. Um wonderful to have you here today and thank you for taking the time to answer questions for our students.

Um and with that I'm going to go ahead and get started with our first question.

As a reminder, if you would like for your questions to be answered, make sure to add them in the Discord as you're going through the course assignments each day. So, first of all, um for Alan and for Yan, tell me a little bit about vector databases and embeddings.

What are they and why are they useful?

Sure. Um I will start with um an introduction of embeddings and then I'll hand over to Alan to talk about vector databases.

So embeddings uh in numerical vectors we use these uh embedding models.

Embedding models are machine learned models that convert real world objects such as text, image and video to embeddings.

And these models are trained so that the geometric distance or the similarity of the embeddings reflect the real world similar similarity of the object that they represent.

Embedding models can be used in recommendation systems, semantic search, classification, rag, and many other applications.

Google's Vert.x AI platform offers two classes of pre-trained embedding models.

text embedding models and multimodal embedding models.

Text embedding models allows you to convert plain text to embeddings while multimodal ones they work with image, audio and video input and in addition to textual input.

So to use these models you simply enable vertex AI in your GCP project and send request to vertex endpoint.

We also support customizing these models your using your own data set.

Vector database are storage solutions that specialize in managing these embedding vectors.

Uh now I hand I'll hand over to Alan to talk more about those uh vector databases.

Thanks Yan. Uh my name is Alan.

I'm part of Google cloud databases and the lead for alloy DB semantic search.

And now that we know vector embeddings capture the semantic meaning of unstructured data, vector search is the operation that given a set of stored embeddings and a query embedding finds the nearest neighbors to that query. That is the most similar items to that query.

Vector search can be done by computing the distance between the query against every stored embedding.

And this is called exact nearest neighbor search.

But imagine how slow this operation would be if there are a billion stored vectors that uh uh we need to compute the distance for.

So vector databases speed up vector search through approximate nearest neighbor search which gives highly accurate but approximate answers while doing orders of magnitude less work.

And there are a number of approximate nearest neighbor algorithms in the industry and they typically fall into two broad categories.

Graph-based the most popular being HNSW and treebase.

And one stateofthe-art treebased algorithm is Google scan algorithm.

And this is based on 12 plus years of Google research and it's used internally inside Google at Google scale in products like Google search, YouTube ads, photos and we've incorporated the scan algorithm natively in multiple Google cloud products including alloy DB, cloud spanner or manage my SQL bigquery and vertex vector search.

Awesome. So it sounds like uh embeddings and vector databases are very quick ways for you to understand how similar certain kinds of data might be to one another.

Um and we have many different managed service options um for embeddings that are offered through Google cloud also through the Gemini developer APIs as well as you can uh kind of use open-source tools like Chroma uh and other vector databases in order to to do the work and to store um to store all of the embeddings that you create for your projects. Um, awesome.

Thank you. Thank you so much.

Uh, and then next question. Um, what are the tradeoffs?

Uh, speaking of open source vector DBs, um, what are the trade-offs between using an open- source vector database, um, versus a proprietary one like some of the some of the managed services that were just being mentioned on on Google Cloud? Um, Omid, do you want to do you want to go ahead and take that question?

Yes, thank you. Uh, so, hi everyone.

My name is Omid and I'm part of um Google Cloud's BigQuery where I lead the a IML and search areas. So uh this is a great question.

So I'd say the trade-offs are similar to how the trade-offs are in other other areas of technology. So to be more specific, so the open-source vector databases may be more favorable for uh aspects such as cost, flexibility, customizability, community support, and also potentially avoiding vendor locking.

Um however, uh often times that needs to be balanced with uh things such as higher maintenance cost, uh higher management costs and complexity.

and potentials for fragmentation down the road and along the way kind of limited support options.

Now with the proprietary vector databases is kind of the essentially the flip side of that, right? So they often have an edge in terms of ease of use as a more managed service support and more advanced features and stability.

But the downsides can be cost, potential for vendor lock in in some cases and um limited customization flexibility and sometimes transparency.

Now um this being said, I would also actually like to uh draw your attention to the increasingly powerful uh vector search and indexing capabilities that are being added across various uh general purpose databases.

and data analytics and warehouse platforms and these are being added as first class citizens as Alan mentioned.

So um you know basically and by the way this is this spans both open-source and proprietary proprietary worlds but you know for example you know you can think of the PG vector extensions for Postgress and in the case of uh Google cloud you know for um database and data warehouses we have all spanner and bigquery all of them adding very strong capabilities as first class citizens.

So um now the advantage of looking into this option is that you know they would u help avoid data duplication across two databases.

So you'll have just one database as source of truth and you'll continue to benefit from decades of work that have gone into building these highly advanced capabilities into these general purpose databases and data warehouses.

And essentially you use them as a single source of truth and the you know you you know you use the added AI and vector capabilities which you know keep growing very fast. So that's also another option I wanted to bring up as as part of the answer to this question.

Yeah. And that's that's amazing to hear that instead of having for uh instead of users having to figure out should I use a vector database or should I use a more traditional database um uh these these kinds of features are being added uh and cross-pollinated across both.

So you don't have to you don't have to have data duplication and you don't have to understand uh you know which kind of tool or technique you should be using.

you can just kind of adopt those features within things like Postgress which most folks already have experience using.

Very very cool.

Thank you. Uh that that's great to hear.

Um and then also uh so next question.

Um this is uh this is for Takar. Uh do you think new features and capabilities like Gemini's longer context windows which are now up to 2 million tokens uh in context and Google's recently released search grounding will reduce the need for vector databases or these uh or these kind of features and and um tools complimentary to each other.

Thank you Paige. Uh hi I'm from Google Deep Mind and uh thank you for this great question.

uh I will start by expanding the question and explaining a little bit.

So there are two new technologies that are mentioned.

One is long context windows as Gemini and other language models can support more and more tokens as part of the input and the other technology is search grounding where we recently released the uh capability to issue a web search and ground any answer and validate if that is correct or incorrect. So uh first let's talk about the longer context window.

Of course, it's really exciting that we can scale to much longer context, but still uh we are not ready to actually do that at a very large scale for several reason. First, uh our we currently support a few million tokens, but most realistic databases or corpus, they're much larger.

We often need billions or trillions of tokens.

Also, it's really computationally expensive to do like um full run the full LLM over a massive amount of tokens.

Whereas, as uh Yan and Alan explained, vector DBs are extremely efficient.

So, they can retrieve from bill billions of items very quickly.

And finally, we saw that as we put more and more uh irrelevant content in the context, then the LLM's reasoning capability starts to degrade.

So it's often more useful for us to first use the vector DB and retrieve the most relevant documents or content and then perform reasoning on thing which are actually relevant to user question.

So in that way actually vector DBs are a really exciting and uh complimentary technique to the new long context uh language model capabilities because now the initial retrieval stage that doesn't have to be uh extremely precise instead we can focus on the recall retrieve lots of relevant documents put that in context and then LLM can reason over that large amount of context and absolutely I I I Just wanted to to say really quickly that I love that uh I love that pattern that you just described of being able to use vector databases and retrieval to figure out what is the most effective information that should go into the context window in order to help support your outputs.

Um I think I think that's something that's been underexplored to date and I I really love that you called it out.

Yeah.

Yeah. I I'm super excited and also it's helping us to think a bit differently.

In the past we would like have very limited context.

So we would focus too much on the uh precision of the retrieval whereas now we can be more recall oriented because we can support slightly more amount of tokens in the context but then throw a larger net get all the relevant content and then reason over it.

Uh there is another part of the question about search grounding.

I would just very briefly mention that search grounding allows us to ground uh to the web or the all the public information but often we want to do search or do grounding on our private information like our personal data or in a company's proprietary private corpus and for that we still need vector databases because public search will not have access to those private data.

Yeah, absolutely. And I and I know that Google Cloud has uh has implemented a feature that is um not just search grounding but also grounding on arbitrary data that might be hosted in GCS.

Um so so I think that I think that you're right.

You know there will always be a place for both u you know retrieval to get the most pertinent information and then also the ability to to kind of search um search over the web to to be able to to find the the most recent information.

Cool. Um so next question. What are the fundamental challenges and opportunities for vector databases? Alan, would you like to take this one?

Sure. Thanks, Paige.

So, uh, speaking from the perspective of building a vector search index natively within an operational database.

Um, so, uh, from an operational database, uh, customers perspective.

Um uh this is a system of record uh for these users where they expect transactional semantics, strong consistency, fast transactional operations and data that spans memory and disk.

One interesting challenge that uh we've uh been working on has been how to achieve the same performance of uh specialized vector database systems that don't have these same constraints.

Uh, as one example of an innovation, the AODB team has implemented a custom underlying buffer page format in order to leverage scans low-level optimizations.

And other opportunities that I see that apply to vector databases in general uh are in the areas of improving usability and performance when vector search is only one part of the application logic.

Uh so for example uh vector search with filtering is a well-known challenge where it may be more efficient to do post filtering or pre-filtering operation uh depending on the filtering condition and the user's intent is clear.

Return the nearest neighbors to me that passes the filtering condition.

Um right now a lot of the systems uh uh need the user to specify okay I want to apply the filter after the vector search or vice versa. Uh but uh the vector search uh databases should be able to automatically figure this out um uh figure out the actual filtering operation that will give the best performance uh without the user needing to specify that. And this extends out to combining with other application logic like joins, aggregation, and text search.

I love I love what you just called out about introducing these AI capabilities into traditional databases and then also improving the developer experience for them.

I think that that will bring um a lot more capabilities to folks without necessarily making folks learn how to use uh use new tools or techniques.

Um and I'm really excited to see uh to see what everybody builds as well as the the new improvements that are coming into products in in GCP and to open source databases.

So this is very very cool.

Um thank you for thank you for that great answer.

Yeah.

And next question. So, so we've been talking a lot about uh you know what happens when everything goes right when we're able to find the most relevant documents or or kind of data assets uh and then use them as part of our projects either by uh by taking in the information and using it with the extended context windows or or by using it directly within the databases.

Um what happens if uh retrieval uh uh if retrieval augmented generation doesn't retrieve your relevant documents or your relevant data assets? What can you do?

Um and Shiaoi uh do you want to do you want to start?

Oh yes. Uh thanks page.

Yep. So it's uh it's not uncommon that uh rack systems when they retrieve the document for specific query if the document previously is not like embedded in the corpus or the corpus doesn't have relevant informations some irrelevant documents may be retrieved as a top neighbors for the query and uh so in that case I think it's rely very heavily on the like the backend larger models factoriality capability to understand whether the retrieve the documents can or cannot answer the question and there are like other methodologies we can uh make sure like the rack systems can can still functional well when the nonrelevant documents are retrieved so I will pass and then to introduce about the agent yes thank you Zaki so yes as mentioned um it's um you can tune embeddings like you can kind of make sure that the model can understand and um the rag system works well because when the search part works well of rag the the retrieval then um you can improve the generation as well if the relevant documents are improved.

Um but uh what you can also do is uh um uh make instead of just doing uh simple rag system where you kind of use embeddings and then vector databases to find the most relevant set of documents and then uh like five or 10 or so such documents and then using those documents in your prompt for generation.

You can use an agentic system uh where you and uh an AI agent kind of figures out okay based on the query and the prompt uh it calls various it takes various different steps. It can uh re call vector databases different ones or it can also um uh take different steps such as use grounding services and various different tools and techniques in order to retrieve the most relevant documents and then supply that.

So instead of a static process of go ahead find me the top five most relevant documents and feed that it figures it's a dynamic process where it takes a steps tools uses a bunch of tools and technology and figures out how to uh how to um provide the most relevant content for the prompt. So we by the way um uh uh on the next day day three we'll be covering the topic of agents in in more detail.

So I'm super excited about this.

Excellent. So, it sounds like there are many different there are many different things that people can do in order to kind of improve the likelihood of getting the right answer, the most relevant answers for um for their rag approaches.

Um so, this is great and I am I'm sure that a lot of people are excited about uh day number three's content and getting to be hands-on with creating their own agents and using them to solve problems. Um so see excited to talk about that tomorrow.

And then next question um the mainstream of LLMs are decoder models.

Um how can we train an embedding model from a decoder only model backbone?

Um Etakar do you want to take this one as well?

Sure. Um yeah this question is a bit uh technical but it's also very exciting and interesting.

Um so typically the traditional large language models they are decoder only and what we mean by decoder only is that they are trained in a way that the task is to predict always the next word or the next token and the model can only look at all those words or tokens that came before it.

So it cannot look at ahead of the token like say words that come after the token.

So this is like a unidirectional left to right causal attention and this is very efficient.

Uh we can do lots of optimization during training assuming that this attention is just one directional and we don't have to look ahead and as a result this is being adopted more and more for LLM training.

However, for encoder style task, for example, if we want to create an embedding for a piece of text, it's actually helps us if we can have like birectional attention.

That means every word can look at all the words that came before it and all the words that came after it.

And for this actually if you look at the state-of-the-art most recent embedding models, they are all starts by initializing from decoder only large language models.

But then they are further adapted or more trained to learn to do birectional attention that means pay attention both to words before it and the words after it and that improves the embedding model quality. So in short actually uh uh the state-of-the-art embedding models do use birectional attention something like encoder style models but they are usually initialized from the decoder only backs and then some additional training happens to make them really strong encoders.

Excellent. Thank you for that great description.

And uh and if folks want to learn more about this where would be the best places for them to go look?

Do you have any favorite papers or textbooks that they could go that they could go to learn more?

Uh I think of course there is the classic paper about transformers that explains the attention really well and the BR paper which shows the advantage of birectional attention of a transformer-based encoders.

uh but there are also a few recent papers. One I can mention uh from Nvidia uh where they explain in depth how they are adopting decoder only models for encoders but there are many other similar work both inside Google and externally.

Excellent. And we can make sure to add those paper recommendations to the to the YouTube video um for folks to be able to to go and to take a look at.

Um, thank you for thank you for that great discussion.

And then our final question before we head into the pop quiz.

So hopefully everybody has been paying attention.

Um, uh, Junjuk, uh, after all of this discussion, um, does this mean that conventional methods for creating embeddings will become obsolete?

Um, you know, I I think we were discussing a little bit around, uh, kind of these multimodal techniques.

Um, do you have any uh any perspective on whether conventional methods um might need to change or evolve based on all of the all of the questions and all of the insights that we've been discussing here today?

Yeah. Uh so my name is Gini.

Uh so there's a small typo. uh it's not you but J I N but anyway uh I think the conventional method uh if you're talking about the OCR related technique where you extract the text first from like a PDF and then converting it to a like a text embedding model uh I think uh there's a good chance that ex uh the future embedding models that recognizes like text from from an image uh might be able to capture that uh into advance vector representations.

uh but I think the current status of the multimodal embeddings are not accurate enough to uh capture all the uh like the uh the exact text in the image image. So I would say uh for for maybe next year or so we might need to uh combine both methods uh to take the the advantages from the both worlds.

Excellent. And I apologize for the typo.

Um uh if folks want to to learn more about Jen Hook's research um we will make sure to add um kind of a a link to Google Scholar um as well uh in the in the YouTube notes.

Awesome. And now um without further ado um we are going into pop quiz territory.

Um, so the pop quiz is just uh five or six questions after the end of every day to kind of test your comprehension both of the podcasts, the papers that you've been reading, as well as the Q&A discussion that we've had with all of our uh all of our subject matter experts here today.

Um, so get out your pens, your pencils, um, just something to write down uh, write down your answers and we'll be walking through each one of the questions um, uh, in real time.

So uh so make sure to to get these uh to get these questions um answered.

First off um what modalities can be converted into embeddings?

Um we discussed this a little bit in the in the Q&A session and you should have read a little bit about it in the white papers as well.

Um for modalities that can be inverted uh converted into embeddings um a image, b video, c text or d all of the above.

All of the above could be turned into into embeddings.

Um, I'm going to give you a few moments to to pick your answer and count down.

Five 4 3 2 1.

Hopefully everybody picked D.

All of the above. Um, you know, as we were discussing before, we can have texton embeddings, but we can also have multimodal embeddings that encompass image, video, text, audio, uh, kind of all of the modalities that you can think of.

Um, so, uh, so definitely explore embeddings not just for text, but also for other use cases.

Question number two, um, which of the following is a major advantage of scan over other approximate nearest neighbors search algorithms?

We discussed this a little bit in the Q&A, especially uh during the the questions to Alan and Tion about uh about vector DBs and and databases.

Um so so uh get out your pens and your pencils uh and think about the think about the best answer. Um number A is uh it is open source and widely available.

B it is designed for highdimensional data and has excellent speed accuracy tradeoffs. C, scan only returns exact matches, or D, it is based on a simple hashing technique and has low computational overhead. Which is the best answer?

Um, and again, I'm going to do a countdown backwards from five.

Five 4 3 2 1. Hopefully everybody picked B.

Um, scan is designed for highdimensional data and has excellent speed accuracy trade-offs.

That's part of the reason why um we care so deeply about it at Google um because we we need to care about things being fast and being scalable.

Um so uh so hopefully everybody got the question number two correct.

Um number three, what are some of the major weaknesses of bag of words models for generating document embeddings?

Um this hopefully you you got to see a little bit in the white paper as well as maybe in some of the code labs.

Um but uh value A they ignore word ordering and semantic meetings.

B they are computationally expensive and require large amounts of data.

C they can't be used for semantic search or topic discovery.

Or D they're only effective for short documents and fail to capture long range dependencies.

What are some of the major weaknesses um of bag of words models? Um, I feel like I should have Jeopardy music, but five, uh, four, three, two, one.

Hopefully everybody picked a They ignore word ordering and semantic meanings.

Um, then next question. Which of the following is a common challenge when using embeddings for search and how can it be uh how can it be addressed? Um, is it that embeddings cannot handle large data sets and you should use a smaller data set in order to address it? Is it that embeddings are always superior to traditional search and there's no need to address that challenge? Um, is it that embeddings might not capture literal information?

Well, um, in which case you need to augment and combine with full text search or is it that embeddings change too frequently and so you should just try to prevent updates?

Um, which of the following is the is the correct answer?

Um, think about it in five, four, three, two, and one.

Um, it's that embeddings might not capture literal information very well.

Um, and so if you do use it and combine it with full text search, um, like some of the techniques that we were discussing today, um, then you will be able to get much more accurate results and and hopefully be able to build stronger and more robust systems. And then final question um what is the primary advantage of using locality sensitive hashing for vector search?

Um is it that a it guarantees finding the exact nearest neighbors?

B that it reduces the search space by grouping similar items into hash buckets.

C is it that the only method that works for highdimensional vectors?

Or D is it that it always provides the best trade-off between speed and accuracy? What is the primary advantage of using locality sensitive hashing?

Um hopefully everybody was paying attention in the white papers.

Um and five 4 3 2 1. Um and it is that it reduces the search space by grouping similar items into hash buckets.

Um so as I mentioned everybody make sure that you uh you write down your prep quiz.

Um, if you missed a question, make sure to review uh and to refer back to those items in the code labs and in the white papers.

Um, uh, and also make sure for tomorrow uh, to add all of your great questions in the Discord about agents um, so that perhaps our our subject matter experts can answer them um, answer them in real time. Um, so thank you again so much for attending uh, attending the second day of the generative AI intensive course.

I want to give a big round of applause to all of our uh to all of our subject matter experts for answering the questions, for coming today um and for sharing their knowledge.

Um and uh we'll see you back bright and early um or uh uh whatever your your time zone might be um for day number three where we we'll be discussing agents.

Um so all the best and uh and thank you for participating.

Loading...

Loading video analysis...