RAG & Vector DB Interview: Embeddings, Cosine Similarity, Dimensions, Models Compared

RAG & Vector DB Interview: Embeddings, Cosine Similarity, Dimensions, Models Compared

Learn about embeddings and the metrics used for measuring similarity, and discover effective chunking strategies. This knowledge is essential for optimizing data retrieval and understanding vector databases.

12 audio · 6:04

Nortren·

What is a text embedding and how does it represent meaning?

0:27
A text embedding is a fixed-length vector of numbers that represents the meaning of text in a high-dimensional space. Texts with similar meaning produce vectors close together, measured by cosine similarity or dot product. Modern embeddings come from neural models trained on massive corpora, typically transformer encoders that learn to map semantically related sentences near each other. A typical embedding has between 384 and 3072 dimensions, with each dimension capturing some abstract feature of meaning learned during training.

What is the difference between cosine similarity and dot product for embeddings?

0:30
Cosine similarity measures the angle between two vectors regardless of their magnitude, returning a value between negative one and one. Dot product multiplies vectors element-wise and sums the result, factoring in both angle and magnitude. For unit-normalized vectors the two metrics produce identical rankings, since cosine equals the dot product divided by the product of magnitudes. Dot product is faster to compute because it skips the normalization step, which is why most production systems normalize embeddings once at ingest and use dot product at query time.

Why are most production embeddings normalized to unit length?

0:29
Normalization to unit length lets you use dot product instead of cosine similarity without changing the ranking, which is faster in vector databases. It also makes distances comparable across texts, since two unit vectors cannot have inflated similarity from large magnitudes. OpenAI text-embedding-3 models, Cohere embeddings, and most sentence-transformers return normalized vectors by default. If your model does not normalize, divide each vector by its L2 norm before insertion, or your database may behave unexpectedly with dot product search.

What is the difference between sparse and dense embeddings?

0:30
Dense embeddings are low-dimensional vectors, typically 384 to 3072 floats, where every dimension has a nonzero value and captures abstract semantic features. Sparse embeddings are high-dimensional vectors, often the size of the vocabulary, where most values are zero and nonzero entries correspond to specific terms with weights from algorithms like BM25 or learned methods like SPLADE. Dense embeddings excel at semantic similarity, while sparse embeddings handle exact keyword matches and rare terms better. Hybrid search combines both for best results.

How do you choose embedding dimensions for a RAG system?

0:34
Higher dimensions improve retrieval quality but cost more in storage, memory, and search latency. A 1536-dimensional vector takes six kilobytes per item in float32, so a million items consumes six gigabytes before index overhead. Most teams start with a strong small model at 768 or 1024 dimensions, measure recall on a domain evaluation set, and only move to 3072 dimensions if the smaller model misses queries. Matryoshka embeddings let you truncate a longer vector to fewer dimensions with minimal quality loss, giving flexibility without re-embedding.

What are the most popular embedding models for production RAG in 2026?

0:28
Top hosted choices are OpenAI text-embedding-3-small and text-embedding-3-large, Cohere embed-v3 for multilingual workloads, and Voyage AI for high-recall English retrieval. Open-source leaders on the MTEB benchmark include the BGE family from BAAI, the E5 family from Microsoft, NV-Embed from NVIDIA, and the Stella and Jina models. For self-hosting on a budget, sentence-transformers like all-MiniLM-L6-v2 remain widely used despite being older. Choice depends on language coverage, dimension cost, and latency targets.

What is the difference between OpenAI text-embedding-3-small and text-embedding-3-large?

0:34
text-embedding-3-small produces 1536-dimensional vectors and costs about five times less than text-embedding-3-large per token. text-embedding-3-large produces 3072-dimensional vectors with measurably higher accuracy on the MTEB benchmark, scoring around 64.6 versus 62.3. Both support Matryoshka truncation, so you can shrink the large model to 256 or 1024 dimensions and still beat the small model at the same size. Choose small for cost-sensitive production at scale, and large when retrieval quality is the bottleneck and storage cost is acceptable.

What is Matryoshka representation learning and why does it matter?

0:32
Matryoshka representation learning trains an embedding model so that the first N dimensions of its output remain useful when the rest are discarded. A 3072-dimensional vector can be truncated to 1024, 512, or even 256 dimensions with graceful quality loss, instead of needing separate models for each size. OpenAI text-embedding-3 and many open-source models support Matryoshka, letting you store full vectors for high-recall reranking and short vectors for fast first-stage retrieval. The technique was introduced by Kusupati and colleagues in 2022.

How do you evaluate embedding model quality for retrieval?

0:29
Build a domain-specific evaluation set of query-document pairs where you know which documents are relevant, then measure recall at k, precision at k, and normalized discounted cumulative gain. Public benchmarks like MTEB and BEIR give a general signal but rarely match your domain, so a few hundred hand-labeled examples from your real traffic beats any leaderboard score. Always test the full pipeline, since chunking strategy, reranking, and metadata filters can change which embedding model wins. A good model on news data may underperform on legal or medical text.

What is the MTEB benchmark and what does it measure?

0:30
The Massive Text Embedding Benchmark, or MTEB, is a public leaderboard that evaluates embedding models across more than 50 tasks in eight categories including retrieval, classification, clustering, and semantic similarity. It was introduced by Muennighoff and colleagues in 2022 and is hosted on Hugging Face. MTEB lets practitioners compare models using a single composite score, but high ranking does not guarantee strong performance on a specific domain. Treat MTEB as a starting filter, not a final answer for production model selection.

When should you fine-tune an embedding model versus use an off-the-shelf one?

0:31
Fine-tune when off-the-shelf models miss domain terminology, when you have at least a few thousand query-document pairs for training, and when retrieval quality is the bottleneck of your system. Domains like legal, medical, code, and scientific literature often benefit because pretraining corpora underweight these vocabularies. Use off-the-shelf models when you lack training data, when your domain matches general web text, or when iteration speed matters more than the last few percent of recall. Fine-tuning adds operational cost, since you must retrain when corpus distribution shifts.

What is the difference between symmetric and asymmetric embedding tasks?

0:30
Symmetric tasks compare two pieces of text of similar style and length, like duplicate question detection or sentence similarity. Asymmetric tasks compare a short query against a longer document, which is the standard RAG retrieval pattern. Models trained for symmetric similarity may underperform on asymmetric retrieval because query and passage have different distributions, so many embedding models use separate query and passage encoders or special prefixes like query and passage prepended to the input. Always check which mode your model expects before deployment. ---