Building an Image-Based Recommendation System and Search Engine with Deep Learning and Elasticsearch A Practical Guide using Keras, ResNet, and Elasticsearch in A Dockerized Environment Image Generated with ChatGPT. This story provides a practical guide on how to build an Image-based recommendation system using the ResNet50 Deep Learning model (with TensorFlow), Elasticsearch as a data store, and a vector database, using Python and Docker. Prerequisites Elasticsearch Tensorflow Docker What is an Image-Based Recommendation System? The idea behind an image-based recommendation system (IBRS) is that when a user selects a product, their primary interest is its visual features; therefore, the recommendation algorithm should suggest visually similar products. The figure below illustrates this id…
Building an Image-Based Recommendation System and Search Engine with Deep Learning and Elasticsearch A Practical Guide using Keras, ResNet, and Elasticsearch in A Dockerized Environment Image Generated with ChatGPT. This story provides a practical guide on how to build an Image-based recommendation system using the ResNet50 Deep Learning model (with TensorFlow), Elasticsearch as a data store, and a vector database, using Python and Docker. Prerequisites Elasticsearch Tensorflow Docker What is an Image-Based Recommendation System? The idea behind an image-based recommendation system (IBRS) is that when a user selects a product, their primary interest is its visual features; therefore, the recommendation algorithm should suggest visually similar products. The figure below illustrates this idea. Image Generated with ChatGPT An image-based recommendation system consists of three steps: Image Embedding Extraction, Similarity Search, and Recommendation. Image Embedding Extraction : Compute embedding vectors from images and store them in a vector space. Deep neural networks (CNNs, or Transformer-based vision models) are suitable for this step. Similarity Search : Perform vector search from the vector space. HNSW (Hierarchical Navigable Small World) is an example of a vector space. Recommendation : Show similar items to the user. Similarity search can be performed in two modes: offline and online . In offline mode, similarity scores are computed and updated in batches (step 2), stored, and later used for recommendations (step 3). In online mode, similarity search and recommendations (steps 2 and 3) are performed in real-time. Each mode has its pros and cons, and you should choose the one that best suits your needs and resources. In this story, I’m using Convolutional Neural Networks for embedding extraction and Elasticsearch as a vector database used for similarity searches in online mode. The rest of this story is structured as follows: In the remainder of this story, I will explain how to (a) set up a single Elasticsearch node to store the source data, (b) perform embedding extractions using deep neural networks and update the Elasticsearch vector database, © make online recommendations with Elasticsearch knn (k-nearest neighbors) search, and (d) conduct image-based searches with Elasticsearch. The code snippets provided in this story can be found in this GitHub repository . 1. Set up Elasticsearch and index the source data 1.1. Overview of the source dataset The dataset used in this tutorial is provided with the associated GitHub repository . It’s a zip file containing sample metadata, metadata.json , of fashion products with their images. The metadata format is: [ { “ID”: 7541, “title”: “Christina Gavioli”, “slug”: “christina-gavioli-3”, “category”: [ “Fashion Women”, “Women Blouse and Dress” ], “imPath”: “images/Fashion Women/Women Blouse and Dress/CHRISTINA_GAVIOLI.jpg” }, ] imPath is the relative path of the product’s image. 1.2. Run Elasticsearch using Docker Use the following command to run a single-node Elasticsearch instance using Docker. docker run -d –name elasticsearch -p 9200:9200 -e “discovery.type=single-node” -e “xpack.security.enabled=false” elasticsearch:9.1.8 The xpack.security.enabled=false is used to disable security for local development. It allows connecting and interacting with Elasticsearch at localhost:9200 without authentication. Please note that this is intended for local development purposes only. 1.3. Create the Elasticsearch Mapping An Elasticsearch mapping specifies the data types associated with each field in an index. In this schema, the metadata consists of five fields: ID, title, slug, category, and imgPath. Additionally, an image_features field of type dense_vector is defined to store image embeddings. mapping = { “mappings”: { “properties”: { “ID”: { “type”: “integer” }, “title”: { “type”: “text”, “fields”: { “keyword”: { “type”: “keyword” } } }, “slug”: { “type”: “text”, “fields”: { “keyword”: { “type”: “keyword” } } }, “category”: { “type”: “keyword” }, “imPath”: { “type”: “keyword” }, “image_features”: { “type”: “dense_vector”, “dims”: 2048, “index”: True, “similarity”: “cosine” } } } } from elasticsearch import Elasticsearch es = Elasticsearch(“http://localhost:9200”) index_name = “items” if not es.indices.exists(index=index_name): es.indices.create(index=index_name, body=mapping) For the image_features field, it is important to understand the following properties: dims: This should be the actual dimension of the dense vectors. In our case, the dimension is 2048. We will see it in section 2. index: When true (the default value), Elasticsearch can perform vector search. It then creates an HNSW index that is used to perform vector search. When set to false , Elasticsearch is used to store vectors, but does not perform vector search. similarity: The default value is cosine . It defines the similarity metric that ES will use for vector search. Other metrics: dot_product , l2_norm Note that vector store functionality is available starting with Elasticsearch 8. 1.4. Index the data in the Elasticsearch index import json from elasticsearch.helpers import bulk METADATA_JSON_PATH = “dataset/metadata.json” with open(METADATA_JSON_PATH, ‘r’) as file: metadata = json.load(file) actions = [ { “_index”: index_name, # index_name = “items”, check the previous code snippet. “_id”: item[“ID”], “_source”: item } for item in metadata ] bulk(es, actions) Using Postman or your web browser, you can test that the data was successfully indexed. Go to localhost:9200/items/_search. You should see a total of 1655 hits. { “took”: 7, “timed_out”: false, “_shards”: { “total”: 1, “successful”: 1, “skipped”: 0, “failed”: 0 }, “hits”: { “total”: { “value”: 1655, “relation”: “eq” }, “max_score”: 1, “hits”: [ … ] } } 2. Image Embedding Extraction Image embedding extraction involves transforming images into vectors of real values, known as embedding vectors. There are several options to extract image embeddings: (1) CNNs, (2) Vision Transformers, and (3) Embedding APIs. 2.1. Convolutional neural networks (CNNs) CNNs were specifically designed for image understanding. They emerged from a series of studies of the visual cortex [1, 2] and were introduced by LeCun et al. (1998) [3], with the famous LeNet-5 architecture. However, it’s only from 2012, thanks to the ImageNet challenges and the associated datasets [4], that CNNs started having great success in computer vision tasks, especially in image classification. The AlexNet CNN architecture by A. Krizhevsky et al (2012) [5] was the winner of the ILSVRC (ImageNet Large Scale Visual Recognition Challenge) 2012 edition. From there, several successful models emerged in other editions, such as the VGG [6] family, the Inception [7] family, and the ResNet [8] family. Most of the Deep Learning frameworks implement the above-mentioned model family. In this post, I’m using TensorFlow/Keras. 2.2. Vision Transformers models Transformers began evolving with the “ Attention Is All You Need ” [9]work from Google in 2017, which proposed an Architecture for Transformer models. Several models were derived from it, such as the BERT [10] model and subsequent versions, the GPT [11] models (the models behind ChatGPT), etc. Transformers models were exclusively applied to NLP-related tasks until October 2020, when Google Research published their first Vision Transformers (ViT) [12], showing that “ pure transformers (without CNNs) applied directly to sequences of image patches can perform very well on image classification tasks ” and that ViT “ attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train ”. Find how to use the ViT model for image classification on Hugging Face . 2.3. Embedding APIs Instead of running CNN or Transformer models locally for embedding extraction, you can delocate computations and use an embedding API. There are several options for image embeddings: the Microsoft Image Embedding API , the Google Cloud Multimodal Embedding API , the Amazon Nova Multimodal Embeddings , … to name a few. One thing to consider is the cost of the embeddings API when handling millions of images. CNNs require less computation during inference than Vision Transformers. If you have sufficient compute resources, such as GPUs or TPUs, consider using Vision Transformers. Conversely, if your budget allows, you can opt for Embedding APIs. Choose the strategy that best suits your situation. For this story, I selected CNNs because they are the most cost-effective in terms of both computation and expense, and they remain effective today. I am specifically using the ResNet [8] model, which was trained on the ImageNet dataset. 2.4. Image Embedding Extraction with the ResNet model The Residual Neural Networks (ResNet) model was developed by a research team at Microsoft [8]. They introduced Residual blocks (see the next figure) as their main contribution to the CNN architecture. Residual blocks consist of adding identity mapping after a finite (usually two or three) stack of convolutional layers. A residual building block — Images from He et al. (2016) . By stacking different numbers of the Residual block, He et al came up with the following architectures for the ImageNet challenge. ResNet Architectures for ImageNet — Table from He et al. (2016) . In this story, I’m using the ResNet50 (50 layers) for embedding extraction. Since the model was trained for image classification of the ImageNet dataset, I’m using the last average pooling layer (the layer before the classification layer) as my embedding layer. The following code snippet shows how to load the ResNet50 model for embedding extraction using Tensorflow. from tensorflow.keras.applications import resnet50 from tensorflow.keras.preprocessing import image from tensorflow.keras import Model import numpy as np # initialize the resnet50 model with the imagenet weights resnet = resnet50.ResNet50(weights=‘imagenet’) # create a submodel for embedding extration resnet50_embedding_model = Model(inputs=resnet.inputs, outputs=resnet.get_layer(‘avg_pool’).output) def load_image(path): img = image.load_img(path, target_size=(224, 224)) x = image.img_to_array(img) return np.expand_dims(x, axis=0) # Reshape (1, 224, 224, 3). def preprocess_batch(img_arrays): batch = np.stack(img_arrays, axis=0) # Shape is (x, 224, 224, 3), where x is the number of images in the batch. return preprocess_input(batch) def extract_embeddings_batch(paths): imgs = [load_image(path) for path in paths] batch = preprocess_batch(imgs) return resnet50_embedding_model.predict(batch, verbose=0) # preprocess a sample image for embedding extraction img_paths = [‘sample.png’] # list of paths embeddings = extract_embeddings_in_batch(img_paths) # the embeddings’ shape is (1, 2048) 2.5. Extract embeddings and update the Elasticsearch index The following code snippet finds documents where the image_feature field does not exist, computes image embeddings for the associated images, and persists the results back to Elasticsearch. This pattern is particularly effective in orchestrated batch pipelines, where embedding extraction is limited to newly added documents. # function to update the elasticsearch index def bulk_update_embeddings(es_client, items, embeddings): “”“create bulk actions to update the ‘image_features’ fields.”“” actions = [] for item, vector in zip(items, embeddings): actions.append({ “_op_type”: “update”, “_index”: index_name, “_id”: item[“_id”], “doc”: { “image_features”: vector.flatten().tolist() } }) bulk(es_client, actions) # Find documents where the image_feature field does not exist. query = { “size”: 64, “query”: { “bool”: { “must_not”: { “exists”: {“field”: “image_features”} } } } } response = es.search(index=index_name, body=query, scroll=“5m”) scroll_id = response[“_scroll_id”] # iterate on the results, extract the embeddings, update the elasticsearch docs while True: hits = response[“hits”][“hits”] if not hits: break # batch processing. Note that the batch size is 64. paths = [os.path.join(DATASET_DIR, hit[“_source”][“imPath”]) for hit in hits] embeddings = extract_embeddings_batch(paths) bulk_update_embeddings(es, hits, embeddings) # scroll to the next elasticsearch results response = es.scroll(scroll_id=scroll_id, scroll=“5m”) 3. Similarity Search and Recommendations with Elasticsearch Now that the documents from the Elasticsearch index have embedding vectors, we can perform vector search for recommendations. The code snippet below defines two useful methods: knn_search: This method finds the 10 most similar products, given an item ID. The apply_filter parameter decides whether the knn search should be performed in a subset of the vector space, corresponding to the same category as the query item. make_recommendations: It calls the knn_search method and displays the results using a display_knn method (find the definition of the display_knn method in this notebook from the GitHub repository). item_ids = [item[‘ID’] for item in metadata] def knn_search(item_id, k=10, num_candidates=100, apply_filter=False): # Query Elasticsearch to get all the fields of the referenced item res = es.get(index=index_name, id=item_id) ref_item = res[‘_source’] # build the knn query. knn_query = { “knn”: { “field”: “image_features”, “query_vector”: ref_item[‘image_features’], “k”: k, “num_candidates”: num_candidates } } # add filter on category if apply_filter is enable. if apply_filter: knn_query[“knn”][“filter”] = { “term”: { “category”: ref_item[‘category’][0] } } # run the knn query res = es.search(index=index_name, query=knn_query) knn_items = res[‘hits’][‘hits’] return ref_item, knn_items def make_recommendations(item_id, apply_filter=False): ref_item, knn_items = knn_search(item_id=item_id, apply_filter=apply_filter) display_knn(knn_items, ref_item=ref_item) Zoom on the knn query : field: The name of the field used for vector search. query_vector: The vector of the query item. k: The final number of nearest neighbors to return as top hits. num_candidates: The number of nearest neighbor candidates to consider per shard. For more details about kNN search with Elasticsearch, check the documentation . 3.1. Simple kNN Search Now we can use the make_recommendations method to search and display similar items. Image by the author The example above worked well without enabling category filtering. However, in the next example, the recommended items contain items that do not belong to the same category as the query item. This is because CNN embeddings capture low- and mid-level features such as edge, texture, color, etc. They do not explicitly encode global and semantic information like Vision Transformer does. Without an additional filtering step, the knn search may return visually similar items from different categories, reducing recommendation relevance. Image by the author. To mitigate the above limitation, we can use a filtered knn search by filtering on the category of the query item. 3.2. Filtered kNN Search By applying a category filter, we restrict the knn search to items within the same category as the query item. This constraint significantly improves recommendation quality by ensuring that retrieved items are not only visually similar but also semantically relevant. Image by the author. Refer to the Filtered knn with Elasticsearch documentation for more details. Bonus Part: Image-based Search Engine Once the recommendation engine is ready, setting up a search engine is straightforward. All the items in the data store have their embeddings. We just need to compute the embedding vector of the query image. query_img_path = [‘src_img.png’] query_vector = extract_embeddings_batch(query_img_path) def image_based_search(query_vector, k=10, num_candidates=100): # build the knn query. knn_query = { “knn”: { “field”: “image_features”, # use the image_features field for the knn search “query_vector”: query_vector, # the vector of the source image. “k”: k, “num_candidates”: num_candidates } } # execute the knn query res = es.search(index=index_name, query=knn_query) knn_results = res[‘hits’][‘hits’] return knn_results knn_results = image_based_search(query_vector[0]) Example of search results. The result is not that great, but there is room for improvement. Here again, the result contains items that are not from the same category as the query image. One way to mitigate this is by applying a classification model to determine the category of the item in the image, then applying a filtered knn search using the predicted category. Note : The categories (or labels) of your classification model should match the categories in your data store. Conclusion Building an image-based recommendation system involves three simple steps: image embedding computation, vector search, and recommendations. In this story, we saw how to implement these three steps using CNNs and Elasticsearch. You can try it out with any other embedding models and vector database, but in general, the principle remains the same. The relevance of the recommendations will depend on the quality of the embeddings, which in turn depend on the embedding model (whether you use a transformer-based model or a simple CNN). You should also consider the dimension of the embedding vectors and the scalability of the vector database. High-dimensional vectors and a large amount of images are computationally expensive, and may lead to recommendation latency if the vector search engine is not scaled enough. Thanks for reading. 🙏 If you found this story useful, encourage me to produce such content by: 👉 Following me on Medium , 👉 Connecting with me on LinkedIn and GitHub , Reference [1] D. Hubel, and T. Wiesel. Receptive Fields of Single Neurons in the Cat’s Striate Cortex . Journal of Physiology ( 1959 ), [2] D. Hubel. Single-unit activity in striate cortex of unrestrained cats . J Physiol . 1959 Sep 2;147(2):226–38. PubMed PMID: 14403678; PubMed Central PMCID: PMC1357023. [3] Y. LeCun et al. (1998). Gradient-based learning applied to document recognition . Proceedings of the IEEE, pages 2278–2324. [4] ImageNet Challenge [5] A. Krizhevsky et al. (2012). Imagenet classification with deep convolutional neural networks . In Advances in Neural Information Processing Systems, pages 1097–1105 [6] K. Simonyan, and A. Zisserman, (2014). Very deep convolutional networks for large-scale image recognition . CoRR, abs/1409.1556. [7] C. Szegedy et al. (2015). Rethinking the Inception Architecture for Computer Vision . [cs.CV] 11 Dec 2015, [8] K. He et al. (2016). Deep Residual Learning for Image Recognition . Microsoft Research — In CVPR, 2016 [9] A. Vaswani et al. (2017) Attention Is All You Need . Google Brain [10] J. Devlin et al. (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . Google Research — In NAACL, 2019. [11] T. Brown et al. (2020). Language Models are Few-Shot Learners . OpenAI [12] A. Dosovitskiy et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale . Google Research. [13] Vision Transformers vs. Convolutional Neural Networks (CNNs) Building an Image-Based Recommendation System and Search Engine with Deep Learning and… was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.