Building The Intent Engine: How Instacart is Revamping Query Understanding with LLMs

11 min read2 hours ago

–

Press enter or click to view image in full size

Authors: Yuanzheng Zhu, Guanghua Shu, Raochuan Fan, Vinesh Gudla, Tejaswi Tenneti

Introduction

When people search for items on Instacart, they don’t always type perfectly worded phrases. They might write “bread no gluten” or “x large zip lock” — and that’s okay. Our job is to understand what they mean, not just what they type. This process, called Query Understanding (QU), is the intent engine that helps millions of customers find what they need on Instacart. Getting QU right is essential.

For years, we relied on traditional machine learning models. They worked well for many searches, but we wanted to deliver a truly intelligent experience for the endless variety of uncommon, highly-specific, or …

11 min read2 hours ago

–

Press enter or click to view image in full size

Authors: Yuanzheng Zhu, Guanghua Shu, Raochuan Fan, Vinesh Gudla, Tejaswi Tenneti

Introduction

This pursuit led us to a new paradigm. Instead of building another bespoke model from the ground up, we opted to “stand on the shoulder of giants.” We turned to Large Language Models (LLMs) for their vast pre-trained knowledge. We saw the opportunity not just to use these models, but to steer them into becoming deep domain experts for our vertical. This post details that journey. Our strategy was layered, moving from context-engineering with guardrails to our ultimate goal: fine-tuning to distill proprietary knowledge directly into a LLM. This approach transforms a generalist model into a true specialist. It has shifted our core challenge from feature engineering to productionizing these powerful backbones while managing latency and cost.

Challenges in Traditional Query Understanding

Our journey to LLMs began with examining where traditional QU falls short. While essential for search at Instacart, accurately interpreting user intent is notoriously difficult for several reasons:

Broad Queries: Queries like “healthy food” or “frozen snacks” are common but difficult to act on. Their lack of specificity makes it challenging to narrow down relevant results, as they can span dozens of categories.
Lack of Labeled Data: QU operates upstream and doesn’t benefit from direct feedback like clicks or conversions. The pseudo-labels we derive from user behaviors are inherently noisy — a user might search for* “bread”* but ultimately purchase bananas. Generating clean labels requires costly and time-consuming human evaluation.
Tail Queries: Highly specific or rare searches like* “red hot chili pepper spice”* or* “2% reduced-fat ultra-pasteurized chocolate milk”* suffer from data sparsity. Models trained on engagement data struggle due to limited historical clicks or conversions, leading to poor generalization.
System Complexity: To solve these problems, we historically trained and maintained multiple independent models for individual QU tasks. For instance, query classification and query rewrites were handled by entirely separate systems, each with its own logic (Figure 1). Each of these bespoke solutions demanded its own data pipeline, training and serving architecture. This heterogeneity introduced inconsistencies, slowed down development cycles, and made the overall QU system difficult to scale and evolve.

Press enter or click to view image in full size

Fig 1. Our previous QU involved multiple independent models for individual QU tasks. For instance, query classification relied on a FastText model for multi-label classification, while query rewrites were generated by a separate system that mined user session behavior.

The Advantages of LLMs

To solve these problems, we turned to LLMs to consolidate and enhance our QU models. They offer several key advantages that improve the accuracy and efficiency of Instacart Search:

World Knowledge and Inference Capabilities: Trained on diverse textual data, LLMs possess world knowledge that enables them to make logical inferences from user queries. For example, an LLM already understands that “Italian parsley” is a synonym for “flat parsley”, while “curly parsley” is a common substitute. This capability dramatically reduces the manual engineering and specialized data required by conventional models, giving us a powerful head start.
Simplified System: Because LLMs possess broad linguistic abilities, they enable us to consolidate numerous bespoke models. By replacing specialized models with a single LLM that can handle multiple NLP tasks, we eliminate the complexity of maintaining separate models and their inconsistencies.

LLM as QU: Our Strategy in Action

We integrated LLMs by adding Instacart’s domain context in three ways:

Context-Engineering: Our primary method is Retrieval-Augmented Generation (RAG). We build data pipelines that retrieve and inject Instacart-specific context, such as conversion history and catalog data, directly into the prompt. This grounds the model in our business reality.
Post-Processing Guardrails: We refine LLM outputs through validation layers. These guardrails filter out hallucinations and enforce alignment with Instacart’s product taxonomy.
Fine-Tuning for Deep Expertise: For the most advanced use cases, we fine-tune models on proprietary data. This embeds deep domain expertise directly into the model’s weights and represents a key part of our long-term strategy for handling complex, long-tail queries.

The following examples illustrate how we leverage some of these techniques to transform critical QU components.

1. Query Category Classification

Instacart’s catalog is organized into a vast, hierarchical product taxonomy that structures billions of items, from broad departments like “Meat” down to specific sub-categories like “Beef Ribs > Short Ribs”. Accurately classifying queries into our product taxonomy is essential. It directly powers recall and ranking, helping us retrieve items from the right categories and intelligently expand the search when a query is broad or ambiguous.

Our legacy approach treated this as a massive multi-class classification problem. For a given query, the model would predict the top-K most likely categories from a flat list. For example, for “butter milk”, it might predict (“Dairy”, 0.95) and (“Milk”, 0.92) as distinct, non-hierarchical outputs.

This legacy approach suffered from two primary pitfalls. First, being trained on noisy conversion data (e.g., a user searches “bread” but buys bananas) means it can produce irrelevant suggestions. Second, it lacked deeper contextual understanding, preventing it from using world knowledge to classify new or nuanced queries like “vegan roast” correctly, as shown in Table 1.

Our new LLM-powered approach greatly improves precision and recall through a three-step process: first, we retrieve the top-K converted categories for each query as initial candidates; second, we use an LLM to re-rank them with injected Instacart context; and finally, we apply a post-processing guardrail. This filter computes a semantic similarity score between the embeddings of the original query and the LLM’s predicted category path, discarding any pair that falls below our relevance threshold.

Press enter or click to view image in full size

Table 1: Comparison of category classification between the legacy model and the new LLM-based approach.

Press enter or click to view image in full size

Fig 2: Overview of the LLM for Query Category Classification system

2. Query Rewrites

Query rewrites are critical for improving recall, especially when the original query does not return sufficient results. Our legacy system mined candidate rewrites from user session data, but this approach was limited, covering only 50% of search traffic and often failing to generate useful alternatives for product discovery.

To address this, we turned to LLMs. Our initial attempt involved a simple prompt asking a single model to generate rewrites for recall enhancement. This proved too ambiguous. For example, for “1% milk”, the model might return *“one percent milk” *— a valid synonym but not a useful rewrite for discovering alternative products.

This led us to design specialized prompts for three distinct rewrite types: Substitutes, Broader queries, and* Synonyms*. Each type is handled by a dedicated prompt with advanced prompt engineering — incorporating specific instructions, chain-of-thought (COT) reasoning, and few-shot examples. To ensure the results are logical and useful, we apply post-processing guardrails, including filters for semantic relevance. This structured approach increased our query rewrite coverage to over 95% with 90%+ precision across all three types.

Building on this success, we are now adopting context engineering to make rewrites more convertible, personalized, and session-aware. We achieve this by injecting user engagement signals, such as the top-converting product categories from their subsequent searches in the same session.

Press enter or click to view image in full size

Table 2: Examples of structured query rewrites generated by specialized LLMs

3. Semantic Role Labeling (SRL)

Semantic Role Labeling (SRL) is the task of extracting structured concepts from a user query, such as product, brand, and attributes. These tags are critical for everything from search retrieval and ranking to ad targeting and filters.

Our goal was to leverage the power of LLMs to generate high-quality tags. However, the power-law nature of search traffic presents a challenge: we can’t pre-compute results for every possible query because the “long-tail” of new and unique searches is effectively infinite, and offline LLM processing is expensive.

To solve this, we designed a hybrid system. A powerful offline process generates high-quality data that serves two purposes: populating a cache for our most common “head” queries and creating the training data for a fast, real-time model that handles the “long-tail.” The system’s flow, shown in the diagram below, is determined simply by a cache-hit.

Press enter or click to view image in full size

Fig 3. Architecture of the hybrid SRL system. Live traffic is routed based on a cache-hit. High-frequency “head” queries are served instantly with cache, while “tail” queries are handled by a real-time, fine-tuned model. The entire system is powered by an offline pipeline that generates data to both populate the cache and train the real-time model

The Offline System (“Teacher”): Generating High Quality Data at Scale

For our high-frequency “head” queries, we run an offline Retrieval-Augmented Generation (RAG) and caching pipeline. Because latency is not a concern here, we can use complex techniques to ensure the highest possible quality. The core of this is context-engineering: enriching the prompt with deep Instacart-specific knowledge.

Press enter or click to view image in full size

Fig 4. Overview of RAG pipeline for query tagging. Context-engineering injects Instacart domain knowledge to ground the LLM’s inference and generate far more accurate intent signals. (Note: Brand examples used for illustration are fictitious.)

Consider the query “verdant machine”. Without context, an LLM might assume it’s for machinery. Our offline pipeline, however, automatically enriches the prompt with crucial context from our internal data systems, including:

Historical Conversion Data: The top converted brand (MuchPure) and categories (Smoothie Juices).
Product Catalog Information: Product brand names with high semantic similarity, ranked by embedding scores.

Armed with this context, the model correctly infers the users’ intent: they are looking for a smoothie brand. After generation, a post-processing guardrail validates the tags against our catalog. This rigorous process has two critical outputs:

A low-latency cache containing the validated, high quality tags for our most common queries.
A high-quality training dataset, which is used to teach a light weight real-time model.

The Real-Time System (“Student”): A Fine-Tuned Model for the Long-Tail

When a user’s query results in a cache miss (indicating a long-tail query), it is routed to our real-time model. This is a language model with a much smaller backbone (like Llama3–8B) that is fast and cost-effective for live inference.

Crucially, this model was fine-tuned on the high-quality “curriculum” dataset produced by our offline “teacher” pipeline. By doing this, the smaller model learns to replicate accuracy of its much larger counterpart, along with the domain context we injected. This allows us to deliver a consistent, high-quality experience for virtually any query a user types. This hybrid approach gives us the best of both worlds: the raw power of massive LLMs, and the speed and efficiency of a lightweight, learnable model.

Building a New Foundation: Fine-Tuning for Real-Time Inference

The success of the real-time “student” model in our SRL system was more than just a win for one project; it proved the viability of a new foundational capability for Instacart: fine-tuning smaller, open-source models to serve our specific needs at scale.

While the SRL system was the first production application, the process of building and deploying this model established a blueprint for future innovation across our platform. Here’s a closer look at how we did it.

Distilling Knowledge via Fine-Tuning

For the real-time SRL model, we fine-tuned an open-source Llama-3–8B model using LoRA (Low-Rank Adaptation). The model was trained on the dataset from the offline “teacher” pipeline. This process effectively distilled the knowledge and nuanced context from the larger model into the smaller, more efficient one.

The results were remarkable. Our fine-tuned 8B model performs on par with the much larger frontier model it learned from, achieving a similar F1-score with higher precision.

Press enter or click to view image in full size

Fig 5. Our fine-tuned 8B model achieves performance on par with a much larger foundation model. Compared with baseline (dark blue) ,Our production model (orange) has higher precision 96.4% vs 95.4%, lower recall 95% vs 96.2%, and on-par F1 score 95.7% vs 95.8%.

The Path to Production: Taming Real-Time Latency

Having a great model is only half the battle; serving it in production with a latency target in the low hundreds of milliseconds was a significant engineering challenge. The out-of-the-box latency was nearly 700ms with A100 GPU. We reduced latency through a series of crucial optimizations:

Adapter Merging & Hardware Upgrade: Merging the LoRA adapter weights directly into the base model and upgrading to H100 GPUs got us to our 300ms target.
Quantization Trade-Offs: We explored quantization (FP8), which cut latency by another 10% but with a slight drop in recall. We deployed the unquantized model to prioritize quality.
Cost Management: We enabled GPU autoscaling to run on less GPUs during off-peak hours, reducing costs without compromising performance.

A/B testing confirmed the success: the real-time LLM meaningfully improved search quality for the bottom 2% of queries. With the new SRL tagging for the tail queries, we reduce “average scroll depth” by 6% (users find items faster), with only a marginal latency increase. The system is now live, serving millions of cold-start queries weekly and reducing user complaints related to poor search results for tail queries by 50%.

Key Takeaways

Here’s what we learned from putting LLMs into our production search system:

Context is the Defensible Moat: A generic LLM is a commodity; your business context is what makes your application defensible, because domain knowledge is the most valuable asset. It’s vast, noisy, and dynamic. It includes everything from user engagement signals (what products are actually purchased after a search?) to real-world constraints (what’s on the shelf at a specific store right now?). In the past, injecting this data into traditional ML models was difficult and brittle. The central challenge today is how to effectively encode this knowledge into an LLM. Through our work, we found a clear hierarchy of effectiveness, each with its own engineering trade-offs: Fine-tuning > Context-Engineering (RAG) > Prompting. Each method progressively transforms a generalist model into a true domain expert.
Start Offline, Go Real-Time Strategically: To manage costs and prove value, we began with an offline LLM pipeline on high-frequency “head” queries. This cost-effective approach handled the bulk of traffic and generated the data needed to later train a “student” model for the long tail.
Consolidate, Don’t Complicate: We simplified our stack by replacing numerous legacy models with a single LLM backbone, reducing maintenance and accelerating development.
The Model is Only Half the Battle: A great model is useless if it can’t serve traffic at scale. We turned potential into impact through crucial production engineering: adapter merging cut latency by 30%, smart caching meant only 2% of queries needed real-time inference, and GPU autoscaling managed costs effectively.

Ultimately, this journey has armed us with more than just a more intelligent QU system; it has laid a new foundation for the future of eCommerce search. Looking ahead, we are expanding beyond single-query search to build a smarter, context-aware system. This means building a system that can understand a user’s entire journey and distinguish between complex intents — differentiating a search for “lasagna ingredients” (item search) from a query for a “quick lasagna recipe” (content discovery) or a request for “lasagna delivery near me” (restaurant search). By understanding this context, we can guide users to the perfect experience, creating a seamless journey across all of Instacart’s offerings.

Acknowledgments

This project required the collaboration of multiple teams across the company including ML, backend and infra teams to be realized. Special thanks to** Tina He, Akshay Nair, Xiao Xiao, Mostafa Rashed, Kevin Lei, Sudha Rani Kolavali** and Jonathan Bender who also contributed to this work and made this vision a reality. I’d also like to thank Naval Shal and Eric Hacke for their thoughtful and thorough review of the blog post.

Introduction

Introduction

Challenges in Traditional Query Understanding

The Advantages of LLMs

LLM as QU: Our Strategy in Action

1. Query Category Classification

2. Query Rewrites

3. Semantic Role Labeling (SRL)

The Offline System (“Teacher”): Generating High Quality Data at Scale

The Real-Time System (“Student”): A Fine-Tuned Model for the Long-Tail

Building a New Foundation: Fine-Tuning for Real-Time Inference

Distilling Knowledge via Fine-Tuning

The Path to Production: Taming Real-Time Latency

Key Takeaways

Similar Posts