9 min readJust now
–
Han Wang | Machine Learning Engineer; Alex Whitworth | Staff Data Scientist; Pak Ming Cheung | Sr. Staff Machine Learning Engineer; Zhenjie Zhang | Sr. Staff Machine Learning Engineer
Introduction
Search relevance measures how well search results align with a user’s search query. For personalized search systems, it’s important to ensure that displayed content is pertinent to the user’s information needs, rather than over-relying on the user’s past engagement. At Pinterest Search, we track whole-page relevance in online A/B experiments to evaluate new ranking models and ensure a high-quality user experience.
Relevance measurement typically relies on human annotations, but is limited by the low availability of human labels and the high marginal cost of …
9 min readJust now
–
Han Wang | Machine Learning Engineer; Alex Whitworth | Staff Data Scientist; Pak Ming Cheung | Sr. Staff Machine Learning Engineer; Zhenjie Zhang | Sr. Staff Machine Learning Engineer
Introduction
Search relevance measures how well search results align with a user’s search query. For personalized search systems, it’s important to ensure that displayed content is pertinent to the user’s information needs, rather than over-relying on the user’s past engagement. At Pinterest Search, we track whole-page relevance in online A/B experiments to evaluate new ranking models and ensure a high-quality user experience.
Relevance measurement typically relies on human annotations, but is limited by the low availability of human labels and the high marginal cost of generating them. This led to measurement designs and sample sizes that could only detect significant topline metric movements, but were insufficient to measure heterogeneous treatment effects or small topline effects.
In this blog, we present our methodology at Pinterest Search to scale the labeling capabilities with LLMs and address these bottlenecks. We fine-tune open-source LLMs on relevance prediction tasks using human-annotated labels, then utilize the fine-tuned LLMs to evaluate the ranking results across experimental groups in online A/B experiments. This approach not only significantly reduces labeling costs and improves evaluation efficiency, but also unlocks opportunities to further improve metric quality by scaling up the query sets and refining the sampling design.
Methodology
At Pinterest, we measure the semantic relevance between queries and Pins using a 5-level guideline: Highly Relevant (L5), Relevant (L4), Marginally Relevant (L3), Irrelevant (L2), and Highly Irrelevant (L1). We use this guideline to measure the whole-page relevance for our search system.
Fine-tuned LLMs as Relevance Model
We use a cross-encoder model architecture to predict a Pin’s relevance to a given query, as illustrated in Figure 1. We fine-tune open-source LLMs on human-annotated data to optimize their performance on relevance prediction task. To support search queries and Pins across multiple languages, we leveraged multilingual LLMs to take advantage of their cross-lingual transfer capabilities. We formalize the relevance prediction as a multiclass classification problem based on the 5-scale relevance guideline, minimizing the point-wise cross-entropy loss during training.
Press enter or click to view image in full size
Figure 1: The cross-encoder architecture for LLM-based search relevance model. Take the encoder language models (e.g., BERT-based models) for illustration.
To effectively represent each Pin for relevance prediction, we leverage a comprehensive set of textual features, including Pin titles and descriptions, BLIP image captions, linked page titles and descriptions, user-curated board titles where the Pin has been saved, and highly-engaged query tokens associated with the Pin. These features together form a robust text representation crucial for accurate relevance assessment.
We experiment with various language models, including multilingual BERT-base, T5-base, mDeBERTa-V3-base, XLM-RoBERTa-large, and Llama-3–8B. The comparative performance of various LLMs and ablation studies on Pin text features can be found in a previous blog. We then use this fine-tuned search relevance model to generate 5-dimensional relevance scores and use the label corresponding to the highest score (argmax) for relevance assessment.
Stratified Sampling Design
LLM labeling significantly reduces relevance labeling costs as well as labeling time, which enables much larger sampling designs. Therefore, we propose a stratified query sampling design that enables measurement of heterogeneous treatment effects and reduces minimum detectable effects (MDEs) by an order of magnitude.
Stratification plays an important role in sampling-based measurement. First, stratification ensures the sample population is representative of the whole population. In addition, if the strata are chosen such that each stratum is relatively homogeneous, variance reduction can be achieved. We adopted the in-house query-to-interest model based on DistilBERT combined with the popularity segment, a measure of how many users issue each specific query, to determine the strata. Prior to LLM labeling, stratified query sampling with human annotations was impractical, as it required a large number of queries to adequately represent each fine-grained stratum.
We evaluate the impact of these changes on experiment sensitivity by evaluating the MDE for our experimentation system. The MDE is the smallest change in a metric that an experiment can reliably detect given the sample size, statistical power (β= 0.8), and significance level (α=0.05) chosen for the test. It can be derived as below
Press enter or click to view image in full size
Since the typical experiment for most online platforms has a small effect, achieving small MDEs is a critical factor in team velocity and shipping new features to our users. Before the introduction of LLM labeling, relevance measurement had large MDEs (e.g. 1.3%-1.5%). These large MDEs were primarily the result of the constraints on our sampling designs imposed by the high cost and time consumption of human labeling. The introduction of LLM labeling enabled us to redesign our sampling approach. We increased our sample sizes, moved from simple random sampling (SRS) to stratified sampling, and now use a stratified sampling estimator. Optimal allocation is used to allocate sample units to strata. These changes enabled us to reduce our MDEs to ≤ 0.25%.
The MDE reduction can be expressed in terms of reduction in variance and increased sample size. We present these results in Table 1. The vast majority of reduction comes from the variance reduction due to stratification. This is consistent with prior findings at Pinterest that most variance in relevance occurs across queries. Previous work has found substantial variation in relevance due to query interest and query popularity.
Press enter or click to view image in full size
**Table 1: **Improvement in metric sensitivity (MDE).
Relevance Measurement with LLMs
To measure the relevance impact of an A/B experiment on search ranking, we take a stratified sample of paired search queries from control and treatment experiment groups, ensuring that the sample is representative of overall user usage. The use of paired samples blocks between-query differences, which is an important source of variation in experiment measurement.
For each query in our paired sample, we retain the top K search results and generate LLM-based relevance labels. We then compute sDCG@K for each query and aggregate query-level metrics to derive topline experiment metrics. The sDCG@K metric is a variant of the standard nDCG@K, where we assume an infinite supply of highly relevant (L5) documents for sDCG@K computation (see Equation 2). We use K=25 throughout our evaluation.
Press enter or click to view image in full size
Lastly, we calculate heterogeneous effects by query popularity and query interest (e.g. beauty, women’s fashion, art, etc), utilizing a Benjamini-Hochberg procedure to control the false discovery rate. The LLM-based relevance measurement procedure at Pinterest Search is illustrated in Figure 2.
Press enter or click to view image in full size
Figure 2: Components of LLM-based relevance measurement at Pinterest Search.
Results
We use XLM-RoBERTa-large as the LLM backbone for our relevance model. The model is lightweight yet delivers high-quality predictions. Inference runs on a single A10G GPU, allowing us to label 150,000 rows within 30 minutes. While the Llama-3–8B model offers slight improvement in accuracy, its inference time and cost increase 6 times. Therefore, we select XLM-RoBERTa-large as it offers a good balance between prediction quality and inference efficiency. The validation results are presented below.
Alignment with Human Labels
We conducted a rigorous validation of the metrics derived from LLM labeling. On Pin-level evaluation, LLM-generated labels and human labels yield an exact match rate of 73.7%, with 91.7% of ratings deviating at most by 1 point. These results underscore the high alignment between the relevance labels produced by LLMs and those from human annotators. To measure alignment between LLMs and human labels, we also compute and report the rank-based correlation Kendall’s τ and Spearman’s ρ to assess the correlation between the two rankings at query-level sDCG@K metric. To understand the performance on queries with different popularity, we also categorize the queries into 4 popularity segments based on search volume: head, torso, tail, and single. The results are summarized in Table 2. We achieve Kendall’s τ>0.5 and Spearman’s ρ>0.65 for all query popularity segments, indicating a strong alignment across all segments.
In addition to Kendall’s τ and Spearman’s ρ, we also validate the query-level sDCG@K error distribution. Here, the error refers to the difference between the sDCG@K metric derived from LLM labels and human labels. According to Table 2, the overall error is below 0.01, with the 10-th and 90-th percentiles falling within the range of [-0.1, 0.1]. We also visualize the error distribution in Figure 3. The error is tightly centered around 0, indicating its negligible magnitude and that the average bias will approach 0 as the size of the query set grows.
For experimental evaluation, we need to calculate the metric difference between the control and treatment groups. Therefore, we also validate how well these metric differences align in paired comparison. As shown on the right-hand side of Figure 3, the errors in paired differences are even more centered around 0 with lighter tails, indicating that LLM-based labeling provides highly reliable estimates of paired differences for A/B experiment assessment.
Press enter or click to view image in full size
Table 2: Query-level LLM vs human labels alignment for different query segments in US market relevance evaluation.
Press enter or click to view image in full size
Figure 3: Query-level bias distribution for single group (left) and paired differences (right) in US market relevance evaluation.
Performance on Non-English Queries
We fine-tuned multilingual LLMs on human-annotated data, with the majority of query-Pin pairs in English. As a result, careful validation is required for non-English queries to extend LLM-based relevance assessment to those queries. For this analysis, we focus on France (FR) and Germany (DE) markets.
The query-level metric alignment is summarized in Table 3. The overall Kendall’s τ and Spearman’s ρ are approximately 0.47 and 0.61, respectively. While these rank-based correlations are lower than those observed for English queries, they are still considered strong according to existing literature. The distribution of query-level metric errors is shown in Figure 4. Similar to the results of the US market, the errors are tightly concentrated around 0 for both countries, indicating a low average bias, with an even smaller bias for paired differences. These results provide confidence that the LLM-based relevance assessment is also suitable for non-English queries. Expanding relevance evaluation to countries beyond the US leads to further reductions in labeling costs and improvements in evaluation efficiency.
Press enter or click to view image in full size
Table 3: Query-level LLM vs human labels alignment for different query segments in France (FR) and Germany (DE) markets relevance evaluation.
Press enter or click to view image in full size
Press enter or click to view image in full size
Figure 4: Query-level bias distribution for single group (left) and paired differences (right) in France (top) and Germany (bottom) markets relevance evaluation.
Summary
In this work, we explore the use of LLM-based relevance labeling to generate query-level relevance metrics for online A/B experiments evaluation. We demonstrate that fine-tuned LLMs achieve low bias on query-level 𝑠𝐷𝐶𝐺@𝐾 metrics and paired differences. Transition to LLM-based relevance assessment enables us to scale up the evaluation query set and redesign the sampling strategy to improve the quality of relevance metrics for online experiment evaluation. We have successfully deployed the LLM-based relevance assessment at Pinterest Search, significantly reducing the manual annotation costs and turnaround time, while achieving an order of magnitude reduction in MDEs for improved detection of relevance shifts. For more details, please refer to our full paper.
Future Work
We will explore using Visual Language Models (VLMs) to better leverage raw images for relevance prediction. Additionally, the observed performance gap with non-English queries highlights opportunities to further improve the multilingual capabilities of our LLM-based relevance model. We leave it for future work.
Acknowledgement
- Search: Maggie Yang, Mukuntha Narayanan, Jinfeng Rao, Krishna Kamath, Kurchi Subhra Hazra
- Relevance Measurements Tooling: Maria Alejandra Morales Gutierrez (former), Miguel Madera, Pedro Sanchez, Jorge Amigon, Francisco Navarrete