Don't use cosine similarity carelessly

Midas turned everything he touched into gold. Data scientists turn everything into vectors. We do it for a reason — as gold is the language of merchants, vectors are the language of AI.

Just as Midas discovered that turning everything to gold wasn’t always helpful, we’ll see that blindly applying cosine similarity to vectors can lead us astray. While embeddings do capture similarities, they often reflect the wrong kind - matching questions to questions rather than questions to answers, or getting distracted by superficial patterns like writing style and typos rather than meaning. This post shows you how to be more intentional about similarity and get better results.

Embeddings

Embeddings are so captivating that my most popular blog post remains king - man + woman = queen; but why?. We have word2vec, node2vec, food2vec, game2vec, and if you can name it, someone has probably turned it into a vec. If not yet, it’s your turn!

When we work with raw IDs, we’re blind to relationships. Take the words “brother” and “sister” — to a computer, they might as well be “xkcd42” and “banana”. But with vectors, we can discover relationships between them — both to provide as a structured input to a machine learning models, and on its own, to find similar items.

Let’s focus on sentence embeddings from Large Language Models (LLMs), as they are one of the most popular use cases for embeddings. Modern LLMs are so powerful at this that they can capture the essence of text without any fine-tuning. In fact, recent research shows these embeddings are almost as revealing as the original text - see Morris et al., Text Embeddings Reveal (Almost) As Much As Text, (2023). Yet, with great power comes great responsibility - both in terms of how we use these powerful models and how we protect the privacy of the data we store and process.

Example

Let’s look at three sentences:

A: “Python can make you rich.”
B: “Python can make you itch.”
C: “Mastering Python can fill your pockets.”

If you treated them as raw IDs, there are different strings, with no notion of similarity. Using string similarity (Levenshtein distance), A and B differ by 2 characters, while A and C are 21 characters apart. Yet semantically (unless you’re allergic to money), A is closer to C than B.

We can use OpenAI text-embedding-3-large, to get the following vectors:

Loading more...