6 min readJust now
–
Press enter or click to view image in full size
Source: Image by amazon. Screenshot by author. Find me on LinkedIn: Onyekachukwu Ojumah
Launching an e-commerce site without historical search data could minimise the power of search to just keyword matching which does not give grace to the complexity of human communication. The result is frustration when shoppers cannot find products you stock, or products that are not relevant to their search and ultimately higher attrition.
There is a need a method to generate plausible, diverse, and attribute‑aware search queries from product information, then organise and annotate those queries to train downstream systems and validate search be…
6 min readJust now
–
Press enter or click to view image in full size
Source: Image by amazon. Screenshot by author. Find me on LinkedIn: Onyekachukwu Ojumah
Launching an e-commerce site without historical search data could minimise the power of search to just keyword matching which does not give grace to the complexity of human communication. The result is frustration when shoppers cannot find products you stock, or products that are not relevant to their search and ultimately higher attrition.
There is a need a method to generate plausible, diverse, and attribute‑aware search queries from product information, then organise and annotate those queries to train downstream systems and validate search behaviour before real‑world traffic arrives.
Large language models (LLMs) are powerful as we have seen many use cases, but they are not magicians. The naive approach would be asking ChatGPT to “generate search query examples” which might yield a few dozen decent results, but scaling that to thousands quickly devolves into repetition, vagueness, and off-catalogue noise while costs climb. Trust me, I have been there.
That does not mean LLMs can not solve the cold-start problem; it means an unconstrained LLM won’t.
In this article, I describe an AI-driven architecture that successfully produced 8,000+ realistic, catalogue-grounded search queries. The approach integrates internal and external product data , LLM, human feedback and prompt finetuning to generate, categorise and understand search queries at a a scale.
The architecture can be broken down into four core segments: 1) Data Sourcing and Query Generation, 2) Query Enrichment 3) Structured Data Creation, and 4) Model Training and Evaluation.
Press enter or click to view image in full size
Source: Image by the author.
Step 1: Data Sourcing and Realistic Query Generation
The foundation of any good search system is data. This pipeline aggregates information from multiple streams to ensure comprehensive coverage, moving far beyond a company’s internal jargon.
- Internal Product Data: The process begins with structured about to be launched product attributes (Title, Description, Brand, Colour, Size, Finish, Material). This is the ground truth for what exists in the catalogue.
- External Open-Source Data: To capture the language customers use outside the company’s ecosystem, the system scrapes product data relating to the internal product data from the internet. This helps bridge the gap between internal product names and the descriptive, long-tail terms real people use to search. For example, this is a product data from amazon: “ROYALJOBO Ergonomic Mesh Executive Office Chair-Adjustable High Back Desk Chair with Lumbar & Neck Support, Flip-Up Arms for Home Office, Gaming, Computer Work (Black)”.
The core of this stage is the use of a Large Language Model (LLM). The LLM is prompted to generate thousands of realistic, variable-length search queries for each product. For example, for the amazon product mentioned above, the LLM might generate:
- “Black office chair”
- “Home office chairs with neck support”
- “adjustable gaming chairs”
- “office chairs” Best to generate at most seven search queries of different word length for each product data. These form the first set of search queries in the pool.
Synthetic data generation is the key to overcoming the cold-start problem for new products that have no historical search data.
Step 2: Enrichment, and Quality Control
Raw generated data is often noisy. This stage ensures quality, linguistic variety, and, most importantly, structures the search queries for two critical downstream tasks: Search Classification and Entity Recognition.
- Quality Filtering with LLM: Another LLM-based filter is employed to remove vague or unrealistic queries (e.g., “that thing for kitchen that is cold” can be filtered out or flagged for review). This ensures only high-quality, plausible search terms are used for training.
- Paraphrasing & Re-ordering: This creates variants like “Black office chair” vs. “Office chair black” which is added to the pool of search queries.
- Pluralisation: Systematically generating plural and singular forms of queries (e.g., “Office chairs black” and “ Black Office chairs”) ensures the search engine can match products regardless of the grammatical number used in the query or the product catalogue. This dramatically improves recall. This is also added to the pool.
- Building a Knowledge Base: Synonyms for all products and specific attributes (like colour) are compiled. This creates a rich vocabulary that understands that a customer searching for a “burgundy” sofa might also look for a “maroon” or “oxblood” one.
Step 3: Structured Data Creation
The enriched queries are prepared for machine learning in two parallel streams. Crucially, this process is not a one-off but an iterative, human-guided feedback loop designed for rapid improvement.
- For the Search Classification API: Taxonomy Mapping: Business categories (e.g., “Furniture –> chairs”) are converted into a flattened, machine-readable search category hierarchy (e.g.,|furniture|chairs).
- Iterative LLM-Assisted Labelling: An LLM is used in a batch process to map each generated search query to the most relevant search category. This process is refined over multiple batches with human feedback:
- Batch 1: The initial LLM output is reviewed by humans. For instance, the query “blue sofa with metal legs” might be correctly labelled with the category |furniture|livingroom|. However, the model might misclassify a “Persian rug” under |home|decor| instead of the more accurate |home|furniture|rugs|. These mistakes are identified and cleaned.
- Prompt Adjustment: The prompts for the LLM are systematically adjusted based on these mistakes. The revised prompt would now include positive examples (e.g., “Query: blue sofa → Category: |furniture|livingroom|”) and negative examples with corrections (e.g., “Query: Persian rug → Incorrect Category: |home|decor| → Correct Category: |home|furniture|rugs|”).
- Batch 2 & 3: The refined LLM, now with clearer guidelines and examples, processes new data. Its output is significantly improved but may still be reviewed for subtle edge cases, leading to further prompt tuning.
- Batch 4+: After several iterations, the LLM’s categorization becomes highly accurate. It can now reliably assign the category |furniture|livingroom| to various related queries like “comfortable couch,” “three-seater sofa,” or “leather loveseat,” requiring minimal human review. This creates the final, high-quality labelled training dataset (search_query, search_category).
- For the Search Entity Recognition API: An LLM acts as an “Automated Annotator” to perform detailed entity extraction on each query. For the query “blue sofa with metal legs,” the goal is to generate structured JSON:{ “colour”: “blue”, “product_type”: “sofa”, “component”: “legs”, “material”: “metal” }. Humans review the generated JSON, correcting inconsistent labels (e.g., “color“ vs “colour” ) or missed entities. The entity extraction prompts are updated with these corrected examples and stricter formatting rules. After several batches, the LLM produces consistently accurate and well-structured entity JSON, creating a reliable training dataset of (search_query, entity_json) pairs.
Step 4: Model Training and Evaluation
The refined pool of search queries from step 3 is now used to power live production systems.
- Search Classification model: A machine learning model (e.g., a fine-tuned transformer) is trained on the high-quality (search_query, search_category) data. Its job is to take a live user query and instantly predict the most relevant product category hierarchy, routing the search to the correct department.
- Search Entity Recognition model: A separate model is trained on the (search_query, entity_json) data. This model operates alongside the classifier, parsing the user’s query to extract precise attributes, enabling powerful faceted search and results ranking based on colour, size, brand, etc. The realism of the search queries or quality of data can be evaluated via downstream performance. With the constrained, taxonomy-anchored generation, the classification model attained 96% macro-precision, 95% macro-recall, and 97% accuracy; the entity model achieved 93% on these metrics. Under a naïve baseline free-form ChatGPT query generation at scale both models scored below 50% across precision, recall, and accuracy, reflecting degraded signal from repetitive and off-catalogue queries.
Conclusion
To conclude, the critical role of search in e-commerce is undeniable, directly impacting user satisfaction and a company’s bottom line. Tackling the challenge of launching search for new products demands a multi-faceted solution. The most effective strategy integrates a company’s own product data with insights scraped from the web, uses LLMs to generate and refine potential search queries, and relies on human experts to validate and guide the process. This collaborative, AI-augmented approach is the most reliable way to build accurate and powerful search models that deliver a tangible competitive advantage.