Ever searched for something specific, only to be met with results that are close, but not quite ? On Etsy’s Search Relevance team, that frustration is exactly what we are tackling. Our goal is simple yet ambitious: to help buyers find exactly what they’re looking for, and to help sellers reach the people seeking their special products. Search plays a central role in that mission. Historically, Etsy’s search models have relied heavily on engagement signals – such as clicks, add-to-carts, and purchases – as proxies for relevance. These signals are objective, but they can also be biased: popular listings get more clicks, even when they’re not the best match for a specific query. To address this, we introduce semantic relevance as a complementary perspective to engagement, capturing how well a…
Ever searched for something specific, only to be met with results that are close, but not quite ? On Etsy’s Search Relevance team, that frustration is exactly what we are tackling. Our goal is simple yet ambitious: to help buyers find exactly what they’re looking for, and to help sellers reach the people seeking their special products. Search plays a central role in that mission. Historically, Etsy’s search models have relied heavily on engagement signals – such as clicks, add-to-carts, and purchases – as proxies for relevance. These signals are objective, but they can also be biased: popular listings get more clicks, even when they’re not the best match for a specific query. To address this, we introduce semantic relevance as a complementary perspective to engagement, capturing how well a listing aligns with a buyer’s intent as expressed in their query. We developed a Semantic Relevance Evaluation and Enhancement Framework , powered by large language models (LLMs). It provides a comprehensive approach to measure and improve relevance through three key components: High quality data : we first establish human-curated “golden” labels of relevance categories (we’ll come back to this) for precise evaluation of the relevance prediction models, complemented by data from a human-aligned LLM that scales training across millions of query-listing pairs Semantic relevance models : we use a family of ML models with different trade-offs in accuracy, latency, and cost; tuned for both offline evaluation and real-time search Model-driven applications : we integrate relevance signals directly into Etsy’s search systems enabling both large-scale offline evaluation and real-time enhancement in production Together, this framework brings a more intent-aware search experience that better serves both buyers and sellers across our marketplace. Figure 1. Overview of the Semantic Relevance Evaluation and Enhancement Framework Capturing Shades of Relevance Let’s return to the idea of relevance categories . Based on user research, we define three categories for semantic relevance of query-listing pairs: Relevant : listing matches all parts of the query, accounting for meaning and proper nouns Partially relevant : listing matches part of the query or is thematically related but not a full match Irrelevant : listing has no meaningful connection to the query; its presence in top results would make the search feel broken Figure 2. Examples for the three relevance categories. Text highlighted in green shows how the product aligns with the search query, whereas red highlights indicate mismatches.* In an ideal world, we’d rely on human judgments for all query-listing pairs. But large-scale human annotation is time-consuming and expensive, rendering it infeasible. Instead, language models unlock the ability to generate these judgments at scale, transforming our ability to make every search on Etsy produce more relevant results. Data: Anchored by Humans, Scaled by LLMs With recent advances in LLMs, a promising approach to evaluate search relevance is to use LLM-as-a-judge : directly using LLMs to judge the relevance of our search system without looping in humans. However, this approach faces two main challenges: Domain shift : off-the-shelf LLMs may not capture the unique preferences and vocabulary of Etsy users Performance-cost tradeoff : larger LLMs offer stronger reasoning but are expensive for large-scale inference, while smaller LLMs are faster and cheaper, but less accurate To address these challenges, we start with human-curated golden labels to evaluate and align a powerful LLM with these human-labels, then use a full dataset scaled up by the LLM for training our relevance judge. In other words, humans define what good looks like, and LLMs help us scale it . LLMs do not replace human judgment, instead they align with and amplify it. We maintain a detailed, evolving relevance labeling guideline, continuously refined through user research and annotation feedback. What relevance means in our marketplace shifts over time and social context. For example, people searching for “face masks” pre-2020 were primarily looking for masks for costumes or fashion, which is a completely different intent from protective masks post-2020. These guidelines ensure our definitions of relevance accurately reflect Etsy users’ intent and capture cultural trends over time. Query-listing pairs are sampled from search logs using a mix of approaches, including both random, stratified sampling for broad coverage, and targeted sampling for challenging cases. Each query-listing pair is labeled by two Etsy admins, with an ongoing review process to both break ties and adjust labeling guidelines accordingly. For quality control, we continuously track metrics such as row-level disagreement rates, which measures how often multiple annotators disagree with each other for the same query-listing pair. To scale beyond manual annotation, we introduced a few-shot, chain-of-thought (CoT) prompting strategy using the o3 model , implemented in LangGraph. The prompt instruction is inspired by the annotation guidelines described above, and includes comprehensive query and listing features, like title, images, text description, attributes, variations, and extracted entities (read more about listing extracted entities in another one of our posts ). We also applied self-consistency sampling to improve reliability. This model, known as the LLM annotator (as seen in Figure 1), is first validated against the human-labeled golden data to ensure its judgement aligns with humans. Once validated, we use it to generate large-scale training data to develop the production models. The LLM annotator thus serves as the foundation for our teacher-student modeling pipeline, bridging the gap between expensive manual labeling and scalable automated annotation. Models: Balancing Accuracy, Latency and Cost Our modeling pipeline uses a three-tier cascaded distillation design , where each model balances accuracy and efficiency for a specific purpose. The stack includes: The LLM annotator : our most accurate and cost-intensive model, aligned closely with human-labeled golden data The teacher model : a fine-tuned smaller LLM (Qwen 3 VL 4B) that delivers high-throughput annotation at scale The student model : a lightweight, BERT-based two-tower model optimized for real-time inference The LLM annotator aligns best with the golden labels, but is too costly for recurrent, large-scale inference. To reduce cost while maintaining quality, we performed supervised fine-tuning (SFT) with a smaller LLM, Qwen 3 VL 4B, using the training data generated by the LLM annotator. This teacher model preserves human alignment while enabling us to label millions of query-listing pairs daily, which is ideal for recurring evaluation and monitoring. The teacher, however, is too slow to surface relevant search results quickly, which is critical for helping our sellers reach potential buyers. As such, we further distilled the teacher into a student model with a two-tower architecture. The distillation process aligns the student’s output with that of the teacher, so that the student judges relevance labels nearly as accurately as the teacher, while being lightweight and fast. The resulting model ensures we deliver search results almost as fast as before, with only All three models – the LLM annotator, teacher, and student – are evaluated against the same golden dataset to ensure traceable performance and consistent alignment with human judgment. Figure 3 shows their accuracy measured using multi-class Macro F1 , and individual class F1 scores. Figure 3. Performance of semantic relevance models against human golden labels Applications: From Evaluation to Action With these models in place, we can both measure and enhance search relevance across Etsy. Search relevance evaluation We use the teacher model to measure how well our search system surfaces relevant listings. Each day, we sample search requests and perform offline inference using the teacher model, then aggregate the predicted relevance labels into summary metrics. These metrics are reviewed regularly by our team, and if we observe unexpected trends like a sudden decline of relevance, we work to quickly diagnose and address the problem. Similarly, we monitor relevance metrics in A/B tests. The computed relevance metrics are discussed when we decide whether to roll out a new change to our search system, to ensure those changes affect semantic relevance of search results in a neutral to positive way. We sample sufficient amounts of requests separately from control and treatment variants, to ensure statistical power. Using vLLM for high-throughput inference, we process millions of query-listing pairs daily at a very low cost, maintaining both statistical power and operational efficiency. Improving search in production The lightweight student model is embedded directly into Etsy’s real-time search stack. It improves relevance through several integration points: Filtering : removes retrieved listings predicted as irrelevant before downstream ranking Feature enrichment : contributes model-predicted relevance scores as features for the downstream ranking model Loss weighting : adjusts training weights of the ranking model based on predicted relevance Relevance boosting : promotes listings deemed highly relevant using heuristic rules among the final returned search results How Semantic Relevance is Changing Etsy Search The Semantic Relevance Evaluation and Enhancement Framework is fully deployed in Etsy’s search stack, and continues to evolve. We’ve observed a measurable uplift in semantic relevance: the percentage of fully relevant listings (as defined by the relevance categories described earlier) has increased from 58% to 62% between August and October 2025. Figure 4. Improvement of semantic relevance metrics over time This improvement reflects Etsy’s growing ability to align search results with buyer intent. For instance, in searches like “fall decor,” the enhanced search engine now focuses on surfacing seasonal decor items, while deprioritizing loosely related listings like clothing, which appeared before the enhancement on relevance. Figure 5. Before and after comparison when searching for “fall decor” * Beyond these immediate gains, semantic relevance has shifted how we evaluate and improve search at Etsy, by adopting a user-centered approach. By grounding our evaluation in semantic intent in addition to behavioral signals, we move closer to our goal of connecting buyers with the relevant products, not just the most popular ones. While search results are influenced by multiple factors, and outcomes may vary, on the seller side, improving semantic relevance can also help surface items from small or new sellers who may not yet have the visibility of more established shops. What’s Next In ongoing and future efforts, we hope to explore the following directions: Better understanding of relevance-engagement dynamics. In online experiments, we often observe engagement metrics decline even as semantic relevance improves (a pattern also noted by other e-commerce platforms ). We suspect this results from applying uniform relevance treatments despite contextual variation. Next, we plan to explore adaptive strategies that tailor adjustments by query type. Refining partial relevance. Inspired by Amazon’s ESCI framework, we’re exploring finer-grained labels, for example, introducing new subcategories of complements and substitutes. This could potentially improve evaluation precision and power new user search experiences. Reducing annotation effort through LLM facilitation. When LLM judgments are self-consistent, they align better with human labels. This may indicate easier query-listing pairs. We are exploring using LLMs for these easy cases, focusing human effort on more complex cases. Simplifying the multi-stage model stack . Our current three-tier distillation pipeline provides flexibility but adds operational complexity. We plan to simplify this setup by exploring better performance-efficiency tradeoffs and potentially merging model tiers. Improving relevance in retrieval. So far, post-retrieval filtering is the first stage where our semantic relevance model applies. We see strong potential to enhance both inference and measurement further upstream in the retrieval layer. Conclusion Key takeaways: LLMs can meaningfully evaluate search relevance when grounded in human judgment. Aligning LLM assessments with human-labeled data ensures we measure, and continually improve, the search experience that is so essential to connecting buyers and sellers on Etsy. Semantic relevance redefines how Etsy optimizes search. By complementing engagement metrics with semantic relevance, we address real customer pain points and deliver more satisfying search experiences. Teacher-student distillation offers a flexible and efficient way to apply relevance modeling across diverse performance, latency and cost requirements. Ultimately, improving semantic relevance strengthens the human connections that define Etsy. By understanding what shoppers truly mean, we can help them find the right items. And by emphasizing relevant listings over popular ones, we can help create fairer opportunities on the search relevance factor of search visibility for our sellers – 89% of whom are businesses of one. Acknowledgments This work is brought to you in a collaborative effort by the Search Relevance Team, enabled by ML Enablement, and the Merchandising teams. Thanks to the following contributors Data: Susan Liu, Jugal Gala, David Blincoe, Yuqing Zhang, Taylor Hunt, Liz Mikolaj Models: David Blincoe, Oriane Cavrois, Orson Adams, Yuqing Zhang Application: Grant Sherrick, Kaushik Bekal, Haoming Chen, Patrick Callier, Davis Kim, Marcus Daly Product leadership: Julia Zhou, Willy Huang, Argie Angeleas Engineering leadership: Yinlin Fu, Congzhe Su, Xiaoting Zhao ML Enablement partners: Ari Carter, Stan Schwertly, Shreya Agarwal, K Ogilvie, Marvin Wang, etc. Other cross-team partners: Will Beckman, Karl Yokono, Audrey Chen, Heather Campbell, David Le, Khadeeja Din, etc. Early contributors: Ethan Benjamin, Cung Tran, Maggie Matsui, Jack Gammack, Yogeeta Chatoredussy, Austin Clapp, Benjamin Russell, Khaled Jabr Special thanks to Oriane Cavrois & David Blincoe for helping this piece come to life. * Images are provided for illustrative purposes. Item availability on Etsy may vary.