10 min readJust now
–
Rudraksh Kapil | Machine Learning Engineer I; Michal Giemza | Senior Machine Learning Engineer; Devan Srinivasan | Machine Learning Engineering Intern; Leif Sigerson | Senior Data Scientist; Stephanie Chen | Staff Quantitative Product Researcher; Wendy Matheny | Senior Lead Public Policy Manager; Jianjin Dong | Engineering Manager II; Qinglong Zeng | Senior Engineering Manager
Introductory Summary
In 2023 Pinterest became the Founding Signatory of the Inspired Internet Pledge — publicly stating our vision to adhere to three principles: (1) tuning for wellbeing, (2) listening to and acting on what we hear from users, and (3) sharing what we learn about making the internet a safer and healthier place for all, especially teens.
Now, at the end of …
10 min readJust now
–
Rudraksh Kapil | Machine Learning Engineer I; Michal Giemza | Senior Machine Learning Engineer; Devan Srinivasan | Machine Learning Engineering Intern; Leif Sigerson | Senior Data Scientist; Stephanie Chen | Staff Quantitative Product Researcher; Wendy Matheny | Senior Lead Public Policy Manager; Jianjin Dong | Engineering Manager II; Qinglong Zeng | Senior Engineering Manager
Introductory Summary
In 2023 Pinterest became the Founding Signatory of the Inspired Internet Pledge — publicly stating our vision to adhere to three principles: (1) tuning for wellbeing, (2) listening to and acting on what we hear from users, and (3) sharing what we learn about making the internet a safer and healthier place for all, especially teens.
Now, at the end of the second full year of our work with the Inspired Internet Pledge, we’re pleased to share more about what we’re learning from Pinner (a.k.a. user) surveys and how we incorporate their feedback to improve content quality on our platform.
While much of this blog will get into the weeds of how we design surveys, interpret the data we get from them, and technical details on how we train a machine learning model to understand generic Pinner perception, at the heart of all of this is Pinterest’s commitment to “Put Pinners First.” Our work demonstrates a win/win for both Pinners and the business, and allows us to do good while doing well.
Background
At Pinterest, we want people to discover high-quality content — content that makes them feel good when they see it, inspires them to keep exploring, and ultimately drives fulfilling, long-term engagement. But it’s challenging to construct guidelines on what exactly “high quality” content is without understanding the average Pinner’s notion of quality. We know that when we only show content that’s engaging, it tends to promote low-quality “clickbait” with limited long-term engagement.
Unsurprisingly, an effective way to understand Pinners’ perception of quality is to ask them directly. Surveys are a way for Pinners to tell us exactly what they think and for us to build that intentionality directly into our platform. If we use Pinner surveys to teach our recommendation systems to promote highly-rated content, we expect people will react positively — with both good feelings and as a result, with action, repins, and saves. Experts in the industry have also called for using surveys more for training recommendation systems [Stray et. al., 2022, Cunningham et. al., 2024], rather than optimizing purely for engagement. Not all engagement is good, so the latter approach can mislead the system into promoting low-quality or even harmful content. Surveys provide an excellent avenue for de-biasing the system, allowing us to understand content quality and ensure that the engagement we reward comes from high quality content. In this blog post we’ll discuss the success we’ve found with incorporating a machine learning model trained on survey data into all three of our major surfaces: Homefeed, Related Pins, and Search. This aligns with one of our company’s core values, to Put Pinners First, by optimizing our recommendations according to their feedback.
Survey Data Collection
We launched an in-app survey campaign where we asked Pinners to rate images on a scale of 1–5 for visual appeal. The exact question wording was, “How visually pleasing or displeasing is this Pin?” [Figure 1]. We intentionally left the wording somewhat vague to encourage Pinners to respond based on their own notion of quality.
Although it would be great to get survey responses for each of the billions of images in our corpus, we have a policy of not bombarding Pinners with constant survey requests. For this survey, we limited ourselves to collecting responses for just 5k Pins. We sampled 1k Pins (weighted by impressions) from each of our top five L1 interest verticals:* Art*,* Beauty*,* DIY & Crafts*,* Home Decor*,* and Women’s Fashion*.* “*L1” here refers to Pinterest’s top level taxonomy of interests. We only asked Pinners to rate mid-to-high quality images rather than exposing them to low quality images just for the purposes of this survey.
It’s important to remember that the question of visual quality is in the “sweet spot” for subjectivity. For things that are relatively objective (e.g., policy violating content), a company should pay reliable human reviewers and not burden users or decrease their experience. For things that are extremely subjective (e.g., personal relevance), we wouldn’t be able to get a reliable Pin-level signal because the score would be a function of the Pin, the context, and the individual rater. Since our goal is to understand visual quality for the average Pinner, a survey is well suited for data collection.
Press enter or click to view image in full size
**Figure 1. **In-app survey UI. “Very visually displeasing” is assigned a score of 1, while “Very visually pleasing” is assigned a score of 5.
Importantly, we collected at least 10 responses for each image to reduce subjectivity and alleviate noise. By computing each image’s average rating, we can get an idea of what the average Pinner would rate the image, which acts as a proxy for the “objective” rating of the image. Asking multiple Pinners to rate each image also leaves a buffer for misclicks; if each image was rated just once, we would have run the risk of ending up with a noisy and unreliable dataset.
The data from this survey allowed us to get thoughtful reflections directly from Pinners on what they consider to be quality content, which enables us to build more “good” into the product and weed out the “bad.” For example, the highest average ratings were received by generally appealing images across wide-ranging topics like makeup, grooming styles, maximalist home decor, landscapes, sunsets, and baby animals. There were also subtle differences in ratings between the different L1 interests in our survey. Home decor Pins were in general rated higher than the rest [Figure 2]. The highest variance in responses was for *Art *Pins, which isn’t too surprising considering the subjective nature of art, and they make up the majority of both the top and bottom 200 images [Figure 3].
Press enter or click to view image in full size
*Figure 2. ***Distribution of Pinner ratings in each interest vertical.
Press enter or click to view image in full size
**Figure 3. **Proportion of images from each L1 interest vertical in the top and bottom 100 images. Art and Home Decor dominate the top 100, whereas the bottom 100 is more evenly distributed with Art images having the highest representation once again.
Machine Learning Modelling
We leveraged the survey data to train a machine learning model for learning the average Pinner’s perception of visual quality. Given embedding features for an image, the model’s task is to map to a single score between 0 and 1, with higher scores indicating higher visual quality as perceived by the average Pinner. These in-house embedding features each represent some aspect of the Pin image, such as the relationships between the image and boards it’s saved to along with its visual and textual information. We opted for a simple fully-connected neural network with 92k parameters to learn this mapping. Not only does this help prevent the model from overfitting to the relatively small dataset of 5k Pins, but it also makes inference at scale quicker and cheaper. The model architecture is depicted in Figure 4.
Press enter or click to view image in full size
**Figure 4. **Machine learning model architecture. We utilized a fully-connected network to map content embedding features to a single score representing visual quality.
The problem was formulated as a pairwise ranking rather than a classification or regression problem — a well established machine learning technique to find the relative ranking of items instead of a composite score for each independent item. The idea can be simplified as: Given two images, we ask the model to predict which one the average Pinner would agree is “better,” rather than trying to predict the actual mean response score for a single image. To define what’s “better” quality, we take the mean of the 10 responses per image as its ground truth quality score. So the task is formulated as a two-item ranking problem, where the model needs to determine which image has the higher mean response.
We only compare images belonging to the same L1 when training the model. This is done to force the model to focus on visual quality differences between images, rather than semantic ones. During training, the model outputs a score for each image. Then, to account for differences between L1s, we separate the images into their L1 interest verticals. To introduce stochasticity for more effective training, we further randomly split the images into smaller groups or “sub-group.” We compute the pairwise margin ranking loss within each sub-group and optimize the model weights to reduce this loss. As the name indicates, the loss function is computed between all pairs within a group and summed. The training process is summarized in Figure 5 below.
Press enter or click to view image in full size
*Figure 5. ***Grouped pairwise ranking approach. Within each group we compute pairwise margin ranking loss.
Although we collected multiple responses for each image, our dataset may still be somewhat noisy. We incorporate the variance in responses for images by using a variable margin for the loss function. Briefly, the margin is the minimum difference we want the model to enforce between the scores of the two images. If the better image’s score isn’t higher by at least this margin, the model is penalized. The margin in this loss function is typically fixed. We instead vary the margin, using higher values for images whose response mean is more certain, and lower values otherwise.
Our ranking-based approach can help tackle the data sparsity issue. Although we only collected responses for 5k images due to limited survey bandwidth, the pairwise ranking approach effectively expands the dataset size to 5 * 1000C2 = ~2.5M pairs, and the model is able to learn nuances between images to understand why one may be perceived as higher quality than another.
Moreover, this ranking approach also makes it more suitable for adoption in Pinterest’s downstream surface recommendation systems, where the absolute score matters much less for these ranking models than the relative differences between different Pins.
Offline Results
Offline evaluation on a holdout test set showed that our model can correctly distinguish between higher and lower content quality, hinting early on that it could serve as an informative feature if incorporated into the surface recommenders. As expected, images that were assigned high scores by our model looked similar to those that received high ratings in our survey. From Figure 6, we can see that the model-predicted scores align well with the Pinner ratings for our holdout test set, with over 90% of predictions being within one standard deviation. The kernel density estimate plots in Figure 7 further show that our model can distinguish between high and low quality images.
Quantitatively, we looked at two key metrics. The first is pairwise ranking accuracy, which measures whether the model can correctly predict which of two given images is higher quality. The second is NDCG@20, or Normalized Discounted Cumulative Gain, which measures how correctly a list of 20 images are ranked by the model scores, with more importance on the top of the list. The results are summarized in Table 1. Overall the model performs well, with better performance on some verticals (Art) than others (Women’s Fashion).
Press enter or click to view image in full size
**Table 1. **Offline evaluation results on test set. The model performs better for some verticals like Art and Beauty than others.
Press enter or click to view image in full size
Figure 6. Distribution of model predicted scores (blue) and the “true” Pinner mean ratings (green). Outliers whose predictions fall outside one standard deviation of the mean responses are marked in red.
Press enter or click to view image in full size
Figure 7. Kernel density estimate (KDE) plot of model predicted scores for the top and bottom 100 Pins by true Pinner ratings. The peaks for the true and predicted distributions for both sets are roughly aligned, with our model predicted scores being more spread out due to the varying nature of responses.
Impact
The recommender systems for major surfaces at Pinterest such as Homefeed determine the content that makes it to a Pinner’s feed, using a combination of objective content features (e.g., the L1 interest, the Pin title and description, etc.) and subjective user features (e.g., what other Pins the Pinner has interacted with, search queries, etc.).
Online A/B experiment results showed that the visual quality signal we built is a win/win for both Pinners and the business across all three major user-facing surfaces at Pinterest: Homefeed, Search, and Related Pins. We saw significant reductions in “low quality” sessions (i.e., sessions where Pinners encounter low quality content) and increases in “successful” sessions (i.e., sessions where Pinners are able to find what they’re looking for) showing that the overall Pinner experience is improved. We also see an increase in individual engagement metrics like repins of organic content and long click-throughs on shopping content to name a few. Taken together, these suggest we’re delivering better content that Pinners want to interact with.
Press enter or click to view image in full size
**Table 2. **Examples of some of our numerous online metrics wins across all three major ranking surfaces. Although we split the table into two sets of wins for clarity, we strongly feel that any wins for Pinners are also a win for our business, and vice versa.
Building on this Success
At Pinterest, we strive to incorporate Pinner feedback directly into our product to keep improving and Putting Pinners First. Surveys are an excellent way to give Pinners a voice and learn what they think is high quality, and for them to tell us what kind of content they’d like to see more of on their feeds. Our work has shown that incorporating survey feedback into our ranking systems is a win-win for both Pinners and our business!
At the end of the day, Pinterest is about personalization. Pinners choose what ultimately makes it to their boards. Improving the quality of recommendations for the general audience helps us refine the “best” images that will eventually make it to their boards.
We plan to continue building on the success we’ve found in this work. We have conducted additional similar surveys and will iterate on our model. We’ve explored leveraging state-of-the-art Visual Language Models (VLMs) that can learn more intrinsic information from the survey responses, and we are keen to productionize these in the next iteration of this signal in 2026. Moreover, we also monitor Pinner perception of content quality via ongoing tracking surveys, and we are working on expanding our survey-based signals to include other types of content besides images.
Acknowledgements
This was a cross-functional effort that would not have been possible without excellent support from our partner teams. The Content Quality Team would sincerely like to thank
- Survey Support Team
- Anket Team
- Experience Framework Team
- Related Pins Ranking Team
- Search Ranking Team
- Homefeed Team