Been working on hybrid search (lexical + vector) for a while and accidentally discovered something fun: when you use good embeddings, you can literally search with emojis.
Not as a gimmick - it actually works because the embedding model (BGE-M3, 1024 dimensions) learned semantic relationships between concepts and their emoji representations.
Try it yourself
These are live search engines running on real e-commerce data:
Type ๐ (key emoji) โ get actual keys: https://search.opensolr.com/dedeman?q=๐
Type ๐ฒ (bike) โ get bicycles and accessories: https://search.opensolr.com/dedeman?q=๐ฒ
Type ๐จ๏ธ๐ (printer + paper) โ get printer supplies: [https://search...
Been working on hybrid search (lexical + vector) for a while and accidentally discovered something fun: when you use good embeddings, you can literally search with emojis.
Not as a gimmick - it actually works because the embedding model (BGE-M3, 1024 dimensions) learned semantic relationships between concepts and their emoji representations.
Try it yourself
These are live search engines running on real e-commerce data:
Type ๐ (key emoji) โ get actual keys: https://search.opensolr.com/dedeman?q=๐
Type ๐ฒ (bike) โ get bicycles and accessories: https://search.opensolr.com/dedeman?q=๐ฒ
Type ๐จ๏ธ๐ (printer + paper) โ get printer supplies: https://search.opensolr.com/b2b?q=๐จ๏ธ๐
This oneโs my favorite - type "cute domestic pet earrings" on a jewelry store: https://search.opensolr.com/rueb?q=cute+domestic+pet+earrings
(it finds cat and dog earrings even though the product titles are in a completely different language)
How it actually works
The pipeline is:
- Crawl website โ extract text with Trafilatura
- Generate 1024D embeddings via BGE-M3
- Store in Solr with both text + vectors
- At query time: run lexical search + KNN vector search
- Combine scores (hybrid approach)
The emoji thing works because BGE-M3 was trained on multilingual + multimodal data. The model learned that ๐ and "key" and "Schlรผssel" (German) and "cheie" (Romanian) are all semantically close.
So when someone searches ๐ฒ, the embedding is close to "bicycle", "bike", "Fahrrad", "bicicletฤ", etc.
The weird part
Cross-language search just... works. The Romanian e-commerce site has products in Romanian, but you can search in English or with emojis and it finds relevant stuff. No translation layer, no language detection preprocessing - the embeddings handle it.
Same with conceptual queries. "things to wear around neck" finds necklaces, pendants, chains - even though no product has "things to wear around neck" in the title.
Stack details for the curious
- Embeddings: BGE-M3 (BAAI), 1024 dimensions
- Inference: Running on RTX 4000 Ada, ~2-5ms per query
- Search: Solr 9.6 with dense vector support
- Crawling: Custom PHP + Python (Playwright for JS-heavy sites, Trafilatura for extraction)
- Extra features: VADER for sentiment, langid for language detection, custom price extraction
Query latency is ~40-50ms total including embedding generation.
Hybrid vs pure vector
Pure vector search is cool but has issues:
- Exact matches sometimes rank lower than "similar" results
- Product codes/SKUs get weird results
- Users expect "nike shoes" to prioritize exact Nike matches
Hybrid fixes this. Lexical handles exact matches, vectors handle the "I donโt know the exact word but I know what I want" queries.
The Solr query can be seen in the debig view (bottom-right button) where you can see the actual vector query functions.
vectorQuery = {!knn f=embeddings topK=250}[-0.032, 0.009, -0.049, ...]
lexicalQuery = {!edismax qf="title^550 description^450 uri^1 text^0.1"
pf="title^1100 description^900" ...}
q = {!func}sum(
product(1, query($vectorQuery)),
product(1, div(query($lexicalQuery), sum(query($lexicalQuery), 6)))
)
Bonus: AI-generated hints
Added an experimental feature where the search can explain results. Search "measure ๐ฅ" on a technical documentation site and it tells you which specific device to use for measuring temperature/fire:
https://search.opensolr.com/fluke?q=measure+๐ฅ
It pulls context from indexed PDFs and generates a recommendation. Uses a local LLM (running on same GPU).
Anyway, thought some of you might find the emoji thing interesting. The cross-language aspect was unexpected - I didnโt build it for that, it just emerged from using multilingual embeddings.
Happy to answer questions about the setup or hybrid search in general.