Text Search Can't See. Multimodal Embeddings Can.
I typed cozy introvert heaven into a search box and got back a reading nook bathed in warm light, a person curled up with a book in a rainy cafe, and a minimalist Japanese bedroom with soft neutral tones.
No tags. No keywords. No one labeled these images "introvert heaven." The search worked because the text and the images were embedded in the same vector space, and the math said they belong together.
This is what multimodal embeddings do. I built Vibe Search to understand how.
What changed between v1 and v2
Gemini Embedding 1 was text-only. If you wanted to search images, you had to describe them with words first, then search those descriptions. The system never looked at the image. It looked at what someone wrote about the image.
That works fine when captions are accurate. It falls apart when the thing you're searching for is a feeling. Nobody tags a photo "the calm before everything changes" or "existential dread meets golden hour." But those are real moods that real photos capture.
Gemini Embedding 2 changed the game. It embeds the actual pixels alongside text, in the same 768-dimensional space. A photo of storm clouds and the phrase "dramatic weather" land near each other without anyone writing a caption. The image itself participates in retrieval.
That's not an incremental upgrade. It's a different kind of search.
The build
Vibe Search is a deliberately small testbed: 152 items (122 Unsplash photos across 18 categories, 30 meme templates). I kept it small to isolate retrieval behavior, not to benchmark scale.
The pipeline is three steps:
Step 1: Embed once. A Python script downloads images, sends each one (pixels + caption) to the Gemini embedContent API, and saves 768-dimensional vectors to a static JSON file.
Step 2: Embed the query. When you type a vibe, the app sends your text to the same model. When you upload a photo, it sends the pixels. Either way, you get back a 768-dim vector in the same space as the dataset.
Step 3: Rank client-side. Five lines of cosine similarity, running entirely in the browser:
function cosineSimilarity(a, b) {
let dot = 0, normA = 0, normB = 0
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i]
normA += a[i] * a[i]
normB += b[i] * b[i]
}
return dot / (Math.sqrt(normA) * Math.sqrt(normB))
}
No vector database. No backend ranking. The embeddings JSON loads once, then every search is pure math on the client. For 152 items this is instant. For 10,000+ you'd want a vector DB or approximate nearest neighbors, but the retrieval quality would be identical.
The result: type a mood, get photos. Upload a photo, get visually similar results. Click any result to explore its neighbors, zero API calls needed (the embedding is already loaded).
Where it breaks: the meme problem
Memes were the real test, and where the model hit its limits.
I searched post-rain loneliness on the Memes tab and got Hide the Pain Harold as the top result. Technically, the model saw a man in visible emotional pain. But the meme isn't about literal pain. It's about masking sadness with a fake smile at work, pretending everything is fine when it's not. That cultural context lives in how the internet uses the image, not in the pixels.
The fix was telling the model what the meme means, not what it looks like. I rewrote every meme caption from visual descriptions ("Older man smiling through visible emotional pain") to usage-context descriptions ("When you are smiling on the outside but dying on the inside. Masking sadness with a fake smile. Pretending to be happy at work.").
After re-embedding with rich captions, when the deploy breaks on Friday correctly returned "This Is Fine" as the #1 result. And post-rain loneliness correctly returned nothing, because no meme in the dataset actually captures that feeling.
The insight: multimodal embeddings are powerful for visual-semantic similarity, but cultural meaning still lives in usage context, not pixels. The model can see what's in the image. It can't see what the internet decided the image means.
Why this matters
Every app with a search bar is sitting on the same assumption: users describe what they want in words, and the system matches those words against a text index. Multimodal embeddings break that assumption. Users can now show what they want (upload a photo), describe a feeling (type a vibe), or explore by similarity (click something close).
The biggest shift isn't that search got better. It's that captions are no longer the only bridge between what users mean and what the system can retrieve.
Image-first e-commerce, visual bookmarking, mood-based playlists, "find me something that feels like this." These aren't hypothetical anymore. The embedding model exists, and as you just saw, the client-side math is trivial.
Build it yourself
If you want to try this with your own images:
- Pick 100-200 images from Unsplash, your own photos, or any collection
- Embed them once using the Gemini embedContent API (image + caption together)
- Save vectors + metadata to a JSON file (filename, caption, category, embedding)
- Run cosine similarity in-browser against the user's query embedding
- Test where captions help and where pixels are enough. That's where the interesting learning happens.
The entire Vibe Search dataset is a 2.4MB JSON file. The search runs in your browser. There is no infrastructure to manage.
Want to see more projects like this? Browse all builds for interactive tools, dashboards, and case studies with source and build times. Or learn more about how I work.