I made search engines understand emojis (and it's weirdly useful)

Been working on hybrid search (lexical + vector) for a while and accidentally discovered something fun: when you use good embeddings, you can literally search with emojis.

Not as a gimmick - it actually works because the embedding model (BGE-M3, 1024 dimensions) learned semantic relationships between concepts and their emoji representations.

Try it yourself

These are live search engines running on real e-commerce data:

Type 🔑 (key emoji) → get actual keys: https://search.opensolr.com/dedeman?q=🔑

Type 🚲 (bike) → get bicycles and accessories: https://search.opensolr.com/dedeman?q=🚲

Type 🖨️📄 (printer + paper) → get printer supplies: [https://search...

Been working on hybrid search (lexical + vector) for a while and accidentally discovered something fun: when you use good embeddings, you can literally search with emojis.

Not as a gimmick - it actually works because the embedding model (BGE-M3, 1024 dimensions) learned semantic relationships between concepts and their emoji representations.

Try it yourself

These are live search engines running on real e-commerce data:

Type 🔑 (key emoji) → get actual keys: https://search.opensolr.com/dedeman?q=🔑

Type 🚲 (bike) → get bicycles and accessories: https://search.opensolr.com/dedeman?q=🚲

Type 🖨️📄 (printer + paper) → get printer supplies: https://search.opensolr.com/b2b?q=🖨️📄

This one’s my favorite - type "cute domestic pet earrings" on a jewelry store: https://search.opensolr.com/rueb?q=cute+domestic+pet+earrings

(it finds cat and dog earrings even though the product titles are in a completely different language)

How it actually works

The pipeline is:

Crawl website → extract text with Trafilatura
Generate 1024D embeddings via BGE-M3
Store in Solr with both text + vectors
At query time: run lexical search + KNN vector search
Combine scores (hybrid approach)

The emoji thing works because BGE-M3 was trained on multilingual + multimodal data. The model learned that 🔑 and "key" and "Schlüssel" (German) and "cheie" (Romanian) are all semantically close.

So when someone searches 🚲, the embedding is close to "bicycle", "bike", "Fahrrad", "bicicletă", etc.

The weird part

Cross-language search just... works. The Romanian e-commerce site has products in Romanian, but you can search in English or with emojis and it finds relevant stuff. No translation layer, no language detection preprocessing - the embeddings handle it.

Same with conceptual queries. "things to wear around neck" finds necklaces, pendants, chains - even though no product has "things to wear around neck" in the title.

Stack details for the curious

Embeddings: BGE-M3 (BAAI), 1024 dimensions
Inference: Running on RTX 4000 Ada, ~2-5ms per query
Search: Solr 9.6 with dense vector support
Crawling: Custom PHP + Python (Playwright for JS-heavy sites, Trafilatura for extraction)
Extra features: VADER for sentiment, langid for language detection, custom price extraction

Query latency is ~40-50ms total including embedding generation.

Hybrid vs pure vector

Pure vector search is cool but has issues:

Exact matches sometimes rank lower than "similar" results
Product codes/SKUs get weird results
Users expect "nike shoes" to prioritize exact Nike matches

Hybrid fixes this. Lexical handles exact matches, vectors handle the "I don’t know the exact word but I know what I want" queries.

The Solr query can be seen in the debig view (bottom-right button) where you can see the actual vector query functions.

vectorQuery = {!knn f=embeddings topK=250}[-0.032, 0.009, -0.049, ...]

lexicalQuery = {!edismax qf="title^550 description^450 uri^1 text^0.1"
pf="title^1100 description^900" ...}

q = {!func}sum(
product(1, query($vectorQuery)),
product(1, div(query($lexicalQuery), sum(query($lexicalQuery), 6)))
)

Bonus: AI-generated hints

Added an experimental feature where the search can explain results. Search "measure 🔥" on a technical documentation site and it tells you which specific device to use for measuring temperature/fire:

https://search.opensolr.com/fluke?q=measure+🔥

It pulls context from indexed PDFs and generates a recommendation. Uses a local LLM (running on same GPU).

Anyway, thought some of you might find the emoji thing interesting. The cross-language aspect was unexpected - I didn’t build it for that, it just emerged from using multilingual embeddings.

Happy to answer questions about the setup or hybrid search in general.

Try it yourself

Try it yourself

How it actually works

The weird part

Stack details for the curious

Hybrid vs pure vector

Bonus: AI-generated hints

Similar Posts