How DoorDash Uses AI in Their Architecture to Handle Millions of Data
By Kiki's Bytes
Summary
## Key takeaways - **SKU Enrichment Outsourced Manually**: When DoorDash onboarded a new merchant, they faced integrating product data in various formats with missing or incorrect information, so they outsourced SKU enrichment to contract operators who performed it manually, which was costly and error-prone. [01:01], [01:22] - **Token Classifier for Attribute Extraction**: DoorDash wanted an extraction model like a token classifier that breaks down contexts into tokens and assigns labels, such as 'Dove' as brand and '500 ml' as size, but building it in-house required extensive training data. [01:42], [02:23] - **Multi-Step LLM Brand Extraction Pipeline**: The LLM pipeline first uses an in-house brand classifier; uncertain SKUs go to an LLM for extraction, then another LLM checks an internal knowledge graph to avoid duplicates, adding new brands and retraining the classifier. [02:44], [03:04] - **LLM Agent Searches for Organic Labels**: For organic product labeling, after string matching, LLMs determine if a product is organic from merchant info or OCR photos; low-confidence cases use an LLM agent to conduct online searches and reason on findings. [03:23], [04:05] - **RAG Accelerates Entity Resolution Labeling**: For entity resolution to match identical SKUs, they use OpenAI embeddings and approximate nearest neighbor to retrieve similar products from golden annotation sets, then feed to GPT-4 to label unannotated SKUs, reducing hallucinations. [04:26], [05:31]
Topics Covered
- Manual SKU Enrichment Fails at Scale
- LLMs Bypass Cold Start Data Problem
- Hybrid LLM Pipeline Extracts Brands
- RAG Accelerates Entity Resolution
- Multimodal LLMs Unlock Image Extraction
Full Transcript
have you ever dealt with unstructured data well imagine having to process millions of retail product descriptions in real time and making sense of it this is a video on how door Dash used large language model to transform its retail catalog management to automatically extract and categorize product information at an unprecedented scale let's Dive Right In when door Dash decided
to expand Beyond restaurant deliveries they encountered a new challenge managing a wide variety of product information each item is presented by a list of product attributes know on the stock keeping unit or SKU for instance these Coca-Cola cans and these Pepsi would have different skus to provide the best experience door Dash needed high quality attributes for each
product think of it as the difference between simply saying Coke versus Coca-Cola original flavor 12 oz fluid can pack of 12 these detail attributes are crucial for three reasons they help customers find exactly what they wanted improving their experience they enable delivery drivers to locate the right items more efficiently they allow for more person personalized recommendations and
promotions but with millions of products to manage manually handling these data was not scalable when door Dash onboarded a new Merchant they face a significant challenge integrating their Merchants product data into their retail catalog the data often came in various formats with missing or incorrect information to address this door Dash implemented a process called SKU
enrichment this involves standardizing and enhancing the raw Merchant data to ensure accurate product attributes for each SKU however this process was outsourced to contract operators who had to perform the task manually Not only was it costly but it was also susceptible to errors door Dash soon realized that this manual approach was not sustainable for the long term they needed
an automated solution that could handle the scale and complexity of their expanding catalog this is where machine learning came into play what door Das wanted was an extraction taking model like token classifier a token classifier works by breaking down the contacts like do silk Glow Body Wash 500 mm into individual word known as token it then assigns a label or category to each token in
this example do would be the brand 500 is the size and ml is the size unit however building an in-house attribute extraction model requires extensive amount of training data to reach good accuracy this is referred to as coart problem in natural language processing where a significant amount of label training data are required to build an accurate model from scratch and I
thought that lambda's coold start was cumbersome they decide to utilize large language models which are deep learning models trained on vast amount of data these models can perform natural language processing tasks with reasonable accuracy without requiring many if any label examples they implemented this in various projects including brand extraction and organic product labeling
brands are crucial for distinguishing products and enhancing features like sponsor ads and product Affinity the new large language model pipeline uses a multi-step approach first given an SKU data an in-house brand classifier attempts to identify the brand for skus that cannot be confidently Tagg it would be passed to another large language model for brand extraction a second
large language model then checks for similar brand in an internal Knowledge Graph to avoid duplicates if it doesn't exist new brands are added to the knowledge graph and the in-house classifier is retrained this large language model power system allows door Dash to proactively identify new brands at scale significantly improving both the efficiency and accuracy
of brand ingestion they also develop a model to label organic grocery products aiming to enhance personalized Discovery experiences for customers the process starts with a single string matching looking for exact mention of the word organic in the product titles while this method offers High precis it misses case where misspellings or alternative presentations exist to address
these issues do Dash employed large language models to determine whether a product is organic based on available product information this information can be provided by the merchant directly or extracted using optical character recognition from packaging photos for a product where the large language model is not confident in they use a large language model
agent that can conduct online searches to retrieve the product information and pass the findings back to another large language model for reasoning there was still another issue to tackle which is determining whether two skus refer to the same underlying product this process is called entity resolution this is a hard problem because it requires accurate extraction of all application
attributes for each product which varies across different product Cate categories yes we use bags for our milk in Canada Prime wishes he can get some of my I can't quit to overcome the limitation of having few human generated annotations they leverage large language model and retrieval augmented generation to accelerate the labeling process retrieval augmented generation
is a technique that combines the strength of information retrieval and generative language models it first retrieves the relevant document or data from its knowledge base based on a query then uses a generative model to produce coherent and contextually accurate response it then uses open AI embeddings to represent skus as vectors of floating Point numbers in order to find related
product this is done using approximate nearest neighbor when a query comes in the algorithm only searches within the most relevant group from their golden annotation sets this aims to trade accuracy for significant speed Improvement these golden annotation sets are then fed to GPT 4 as contacts to generate labels for unannotated skus this method of selecting similar examples reduces
the likelihood of hallucination in the large language model outputs in the future door Dash is exploring multimodel large language models to process both text and images for attribute extraction addressing limitations of text only approaches they're experimenting with visual QA and chat plus OCR and building infrastructure for Dash to take product pictures for direct
attribute extraction what do you think of their approach I think it's really cool that large language models are so integrated into their system I hope that you get to to learn something new today and as always thank you so much for watching and see you next time
Loading video analysis...