Let’s say you want to find the top 5 similar words for a given word. Why would we do that ? Similarity search is fundamental to recommendation systems, search engines, and content discovery. Let’s build one using Spotify’s production-grade library.
I just discovered Voyager, the open-source Approximate Nearest Neighbors (ANN) library by Spotify. This library is used for key personalization features on Spotify like Weekly Discovery.
What is ANN ?
Approximate nearest neighbor (ANN) algorithms are techniques used to find data points in a dataset that are close to a given query point, but not necessarily the exact closest ones. They are designed to speed up the search process by sacrificing some accuracy for improved efficiency, making them useful in applications like recommend…
Let’s say you want to find the top 5 similar words for a given word. Why would we do that ? Similarity search is fundamental to recommendation systems, search engines, and content discovery. Let’s build one using Spotify’s production-grade library.
I just discovered Voyager, the open-source Approximate Nearest Neighbors (ANN) library by Spotify. This library is used for key personalization features on Spotify like Weekly Discovery.
What is ANN ?
Approximate nearest neighbor (ANN) algorithms are techniques used to find data points in a dataset that are close to a given query point, but not necessarily the exact closest ones. They are designed to speed up the search process by sacrificing some accuracy for improved efficiency, making them useful in applications like recommendation systems and image retrieval.
Meet our dataset : Global Vectors for Word Representation
We will work with GloVe (Global Vectors for Word Representation), a very popular set of pre-trained word embeddings used in Natural Language Processing (NLP).
Each line consists of a word followed by 50 real numbers (floats), separated by spaces. It’s something like :
word vec_0 vec_1 vec_2 ... vec_49
For the record, the "6 Billion" tokens typically come from a combination of Wikipedia 2014 and Gigaword 5.
Let’s prepare our vectors for Voyager
Now we have our embedded vectors, let’s prepare them.
def load_data_for_voyager(filepath):
"""
Reads GloVe vectors and prepares them for Voyager.
Returns:
words (list): List of words.
vectors (np.array): Array of vectors.
word_to_id (dict): Mapping from word to its index.
"""
words = []
vectors_list = []
word_to_id = {}
with open(filepath, 'r', encoding='utf-8') as f:
for i, line in enumerate(f):
parts = line.strip().split(' ')
word = parts[0]
vector = np.array([float(x) for x in parts[1:]], dtype=np.float32)
words.append(word)
vectors_list.append(vector)
word_to_id[word] = i
return words, np.array(vectors_list), word_to_id
What we are doing here is transforming each vector word vec_0 vec_1 vec_2 ... vec_49 into three parts
- Take the first element i.e. the word
- Take all elements after the first one, converts them to floating-point numbers, and creates a NumPy array.
- The id of the word, it is used to quickly find the vector for the sample_word before making the query
We can now load the file
glove_filepath = '/path/to/glove.6B.50d.txt'
if not os.path.exists(glove_filepath):
print(f"Error: GloVe file not found at {glove_filepath}")
return
print(f"Loading and preparing GloVe vectors from {glove_filepath}...")
words, vectors, word_to_id = load_data_for_voyager(glove_filepath)
print(f"Loaded {len(words)} word vectors.")
This should print an output like
Loading and preparing GloVe vectors from /path/to/glove.6B.50d.txt...
Loaded 400001 word vectors.
And if we look for a specific word we can find its vector
sample_word = "dog"
if sample_word not in word_to_id:
print(f"Word '{sample_word}' not found in GloVe vectors.")
return
query_vector = vectors[word_to_id[sample_word]]
print(f"Vector for '{sample_word}': {query_vector[:5]}...")
This will print
Vector for ‘dog’: [ 0.11008 -0.38781 -0.57615 -0.27714 0.70521]...
Voyager, create the index ! 🚀
Everything is now set to create our index and launch our query. Let’s first create our index and add the previously read vectors
# Create and build the Voyager index
print("Building Voyager index...")
index = Index(Space.Cosine, num_dimensions=vectors.shape[1])
index.add_items(vectors)
print("Index built.")
We use Space.Cosine because it measures angular similarity between vectors, which works well for word embeddings where direction matters more than magnitude.
Look for the neighbors
Everything is now set to find for the 5 words similar to our word.
# Create and build the Voyager index
neighbors, distances = index.query(query_vector, k=6, num_threads=1)
print(f"\n5 words most similar to '{sample_word}':")
for i, neighbor_id in enumerate(neighbors):
# Skip the first result as it's the query word itself
if i == 0:
continue
print(f"- {words[neighbor_id]} is {distances[i]:.4f} away from '{sample_word}'")
Note that we ask for 6 neighbors because the first one will be the word itself.
You don’t know what is close to your dog ? Well, Voyager does ! Here’s the output :
5 words most similar to 'dog':
- cat is 0.0782 away from 'dog'
- dogs is 0.1487 away from 'dog'
- horse is 0.2092 away from 'dog'
- puppy is 0.2245 away from 'dog'
- pet is 0.2275 away from 'dog'
What’s next ?
This is a simple python example of how to use Spotify’s Voyager. Many features can still be discovered :
- The created index can be saved and then reused by your queries by calling
index.save - Voyager can be used in python or java/scala
- it is fully multithreaded for index creation and querying
- Dependency-free install: only NumPy (any version) in Python, and no Java dependencies
What happens if we use Space.Euclidean instead of Space.Cosine ? What performances can we expect on large datasets ? How can we build some more awesome features in our products ? Stay tuned for more !
The complete code can be found on github Read Spotify’s blog post about Voyager Voyager is on github