10 min read6 days ago
–
Press enter or click to view image in full size
Illustration by Author
There’s a very specific kind of nerd joy in hearing a random track in a cafe, pulling out your phone, and having an app instantly tell you not only what it is, but also recommend five more tracks in the same vibe. That magic is no longer reserved for Shazam or Google. With modern audio embeddings and a good vector database you can ship something dangerously close from your laptop.
Shazams classic architecture is built on audio fingerprinting not neural embeddings. The acutally rough idea is quick simple:
- Compute a spectrogram of the audio
- Pick the most prominent peaks to form a constellation map.
- Convert these peak patterns into compact hashes (audio fingerprints).
- Perform fast hash lookups against a giant fingerprint index to find a match.
This is insanely fast and robust to noise. But it’s also extremely literal. It’s built to answer:
“Is this exact recording in the database?”
It’s not built to answer:
- “Give me tracks that sound like this.”
- “Find covers / remixes / live versions of this.”
- “Find songs in the same mood / timbre / instrumentation.”
Those are semantic questions. They need something richer than discrete hashes: dense vector embeddings.
The modern stack for “Shazam‑but‑smarter” looks like this, this is what I am thinking..
Song audio you dig ↓Embedding model ↓Embedding vector ↓Vector DB (Qdrant) + ANN search (HNSW) ↓Top‑K similar songs + metadata
Audio embeddings in practice
Instead of fingerprints, we feed audio into a pre‑trained model and get back a dense vector that encodes timbre, phonetics, texture, rhythm, and sometimes even higher‑order semantics.
One very practical model here: Wav2Vec2‑large‑XLSR‑53. Architecturally, it’s a BERT‑like transformer with a 768‑dim hidden state, trained on multilingual speech/audio.
The trick is:
- You feed in audio at 16 kHz.
- The model outputs a sequence of hidden states over time.
- You pool over time to get a single 768dim embedding for the clip.
Two clips that sound similar → vectors close together (high cosine similarity)
Why Qdrant is a best fit for This
At this point, you have vectors. You could technically dump them into any vector DB or even FAISS. But Qdrant brings several very specific properties that matter a lot once you go beyond proof‑of‑concept.
1. Rust + HNSW = Fast, Predictable Latency
Qdrant is an open‑source vector DB written in Rust, using HNSW (Hierarchical Navigable Small World graphs) under the hood for approximate nearest neighbor search.
Key points that matter for audio: