8 min readJust now
–
Dmitry Kislyuk | Director, Machine Learning; Ryan Galgon | Director, Product Management; Chuck Rosenberg | Vice President, Engineering; Matt Madrigal | Chief Technology Officer
Press enter or click to view image in full size
Foreword from Bill Ready, CEO
The AI landscape is undergoing a fundamental shift, and it’s not the one you think. The competitive frontier isn’t only about building the largest proprietary models. There are two other major trends emerging that haven’t had enough discussion:
Open-source models have made tremendous strides, especially on cost relative to performance.
Compact, fit-for-purpose models can meaningfully out-perform general purpose LLMs on specific tasks and do so at dramatically lower cost.
…
8 min readJust now
–
Dmitry Kislyuk | Director, Machine Learning; Ryan Galgon | Director, Product Management; Chuck Rosenberg | Vice President, Engineering; Matt Madrigal | Chief Technology Officer
Press enter or click to view image in full size
Foreword from Bill Ready, CEO
The AI landscape is undergoing a fundamental shift, and it’s not the one you think. The competitive frontier isn’t only about building the largest proprietary models. There are two other major trends emerging that haven’t had enough discussion:
Open-source models have made tremendous strides, especially on cost relative to performance.
Compact, fit-for-purpose models can meaningfully out-perform general purpose LLMs on specific tasks and do so at dramatically lower cost.
Our Chief Technology Officer and AI team share how we are using open-source AI models at Pinterest to achieve similar performance at less than 10% of the cost of leading, proprietary AI models. They also share how Pinterest has built in-house, fit-for-purpose models that are able to significantly outperform leading, proprietary general purpose models.
The race to build the largest, most powerful models is profound and meaningful. If you want to see a thriving ecosystem of innovation in an AI-driven world, you should also want to see a thriving open-source AI community that creates democratization and transparency. It’s a good thing for us all that open source is in the race.
For our part, we’ll continue to share our findings in leveraging open-source AI so that more companies and builders can benefit from the democratizing effect of open-source AI.
Pinterest helps users worldwide to search, save, and shop for the best ideas powered by our visual AI capabilities. These are powered by a mix of models operating across different modalities; a recent development is that for applications requiring LLMs and VLMs¹, we have found significant advantages in adapting open-source models with Pinterest’s unique data and existing technologies. As a result, Pinterest has been shifting more of our AI investments towards fine-tuned open-source models, achieving similar quality at a fraction of the cost, particularly for visual and multimodal tasks. This shift reflects a broader industry trend: core LLM architectures are commoditizing, while differentiation increasingly comes from domain-specific data, personalization, and product integration.
It is worth taking a closer look at the technical strategy behind foundation models at Pinterest. Just because it can be built in-house does not mean every capability should or must be. The build, buy, adapt set of tradeoffs are a well understood concept in the industry, and AI models are no different. At Pinterest, we structure our thinking about this question by looking at the primary modality over which the foundation model is optimized for:
- Users (recommendation systems). User modeling systems are typically deeply coupled with the specific product that they are optimized for and are thus nearly always built internally for large search and social platforms. Given its scale, Pinterest has published extensive work on utilizing long-term sequences of user actions and universally compatible user representations for recommendation systems, relying on an image-board-user graph consisting of hundreds of billions of nodes to build these capabilities. These systems include both representation learning approaches (e.g. PinFM) and generative recommendation models (e.g. PinRec).
- Visual (encoders and diffusion models)². As a visual platform, Pinterest has consistently invested in building frontier image understanding models. Although strong open-source models are available, we have found that the rich visual datasets curated through our visual search product and visual board collections enable the large-scale weakly-supervised pretraining that modern visual AI systems require. Thus, we largely default to training these models internally from scratch as well.
- Text (LLMs)³. The remarkable model progress on text modeling and more abstract capabilities (e.g. reasoning) has been empirically connected to training with enormous amounts of compute and internet-scale text data, and Pinterest has largely relied on both open-source and third-party proprietary LLMs to build the best possible products for our users in recent years.
What we are observing now is that the capabilities of open-source multimodal LLM architectures have begun to level the playing field of model capabilities. Critically, across many product categories at Pinterest, the core differentiation in capabilities is shifting to the ability to fine-tune models with domain-specific data, and investing in end-to-end optimization and integration.
The trend toward domain-specific data and deep product integration as a core differentiator can be seen as a reversion to a common trend in the ML industry. In the first decade of the AlexNet era, core architectures were routinely commoditized, and either fine-tuning open-source models or training models on specific web-scale datasets was the most common form of development. We saw this first-hand with our development of various visual encoders (e.g. UVE, PinCLIP), where training embedding models from scratch on Pinterest image and visual search data has yielded meaningful retrieval gains over off-the-shelf embedding models⁴. Recently, we’ve also seen this with Pinterest Canvas, our image generation foundation model, where tuning an internally-trained diffusion model for specific image editing and enhancement use-cases with Pinterest data has thus far yielded better results than using larger but more general-purpose visual generative models.
Our most recent data point in this trend comes from the beta launch of Pinterest Assistant in October of this year. We can think of the Pinterest Assistant as being broken down into two sets of ML technologies. First, there is an underlying set of multimodal retrieval systems, recommendation services, and specialized generative models (including other LLMs) that serve as tools for an agentic LLM to invoke. These tools are predominantly Pinterest-native and rely on our user and visual foundation models.
And second, there is the core multimodal LLM itself, which oversees the agentic loop and is responsible for query understanding, query planning, and effective tool calling. The key factor with this LLM is that it acts as an intelligent router that recursively delegates much of the recommendation and agentic capabilities to the aforementioned Pinterest-native tools. In this design, the biggest lever we have for product improvements is scaling the quality of the tools, and scaling test-time-compute (e.g. breaking down the call into more advanced steps), as opposed to focusing solely on using the largest core LLM possible. Indeed, as comparisons of LLMs start showing small or negligible differences, we have observed that open-source solutions meet our product needs; we’re getting more value by focusing on building out more domain-specific tools, fine-tuning for product-specific use-cases, optimizing for latency, etc. There are some benefits we are particularly excited about as we adopt more open-source multimodal LLMs at Pinterest:
- Cost. Pinterest’s visual-first AI systems have heavy image understanding requirements, often dealing with 10s of images in each conversation turn. In these situations, we are currently observing that self-hosted, fine-tuned open-source models allow for an order of magnitude reduction in inference costs. With further engineering investment in inference optimizations like disaggregated serving and smarter cache routing, we expect that the cost and throughput via internal model hosting will become even more favorable.
- Personalization. Personalization in a proprietary LLM can typically occur by passing in relevant context about a user as text, and there are many situations where this is sufficient. However, the visual nature of Pinterest means that user representations encode many more stylistic and visual preferences that do not translate well to text, and the ability to fine-tune an LLM to natively interpolate with internal embeddings, session context, and other user signals is highly desirable to produce relevant results.
- Capabilities. Beyond personalization, leveraging precomputed internal content embeddings (e.g. PinCLIP) projected to internal LLMs allows for more efficient computation of long visual contexts. For Pinterest product development, this is an important consideration, since board and collage objects are a natural target for conversations, and they may consist of dozens or hundreds of Pins. On the output side, another capability that we have already seen progress on is the ability to fine-tune for specific tool calling (e.g. refining multimodal queries), especially in ways that enable future extensibility to novel tools without needing end-to-end fine-tuning of all capabilities jointly.
- Brand Values. There are significant advantages in being able to tune and update a model directly to align with Pinterest’s investments in Inclusive AI commitments and our Community Guidelines, as opposed to layering this on as an additional module on top of a third party API call.
Looking ahead, ML and AI capabilities at Pinterest will continue to be powered by a mix of internally-developed foundation models, fine-tuned open-source models, and licensed third party models. In addition, third party AI platforms are widely used by Pinterest engineering teams for coding tools, internal productivity, and rapid prototyping. However, the scalability advantages and capability gains from all forms of internally hosted models, whether they are trained from scratch or fine-tuned, are leading to a change in technology defaults at Pinterest. Furthermore, the development of model families that provide generative capabilities across a variety of latency and throughput requirements have allowed for a development pattern where product teams can prototype and iterate with third party models, while the ML teams develop more scalable and personalized internal models for the relevant capability.
How long will this open-source trend hold? We can only make an educated guess. The large-scale buildout of AI data centers may result in more step-function jumps in quality and emergent capabilities for proprietary models. In parallel, the supply-side growth in chip production may drive down fine-tuned open-source inference costs even further. Either way, our strategy at Pinterest to bring inspiration to all of our users will remain the same: leverage our visual, graph, and recommendation data to build the best and most efficient models we can, and address any capability gaps by partnering with third-party providers, alongside regular research & development from Pinterest Labs.
¹We use “LLM” to refer to both text-only models, and multimodal visual LLMs, which contain an image encoder, which are sometimes referred to as Visual Language Models (VLMs). Most applications of generative models at Pinterest require visual inputs, so internally we assume multimodal capabilities as a default.
²Visual models are commonly trained with text supervision via contrastive learning or other forms of conditioning. But the dominant training signal for the model remains the raw visual input.
³Most LLMs benefit from a mix of modalities, with VLMs designed explicitly for this purpose. However, pre-training remains focused on an autoregressive text token prediction task, which is why we characterize them as text-dominant models.
⁴For example, we have seen our PinCLIP system outperform state of the art open-source multimodal embeddings by more than 30% on core retrieval tasks.