LLM Engineer Interview Questions: Embeddings, Vector Search, and Cosine Similarity Explained

LLM Engineer Interview Questions: Embeddings, Vector Search, and Cosine Similarity Explained

Questions and materials on "LLM Engineer Interview Questions: Embeddings, Vector Search, and Cosine Similarity Explained"

15 audio · 5:03

Nortren·

What are embeddings in the context of LLMs?

0:17
Embeddings are dense numeric vectors that represent text in a way that preserves semantic meaning. Texts with similar meanings produce vectors that are close in the high-dimensional embedding space. Embeddings are the foundation of semantic search, retrieval-augmented generation, clustering, classification, and recommendation systems.

What is the difference between sparse and dense embeddings?

0:23
Sparse embeddings, like those produced by BM25 or TF-IDF, have one dimension per vocabulary word and most values are zero. They capture exact word matches well. Dense embeddings have a few hundred to a few thousand dimensions, all nonzero, and capture semantic meaning rather than exact words. Dense embeddings handle paraphrasing and synonyms, while sparse embeddings handle precise terminology.

What is cosine similarity and how is it used in vector search?

0:18
Cosine similarity measures the angle between two vectors, ranging from negative one to positive one, where one means identical direction. It ignores vector magnitude, focusing only on direction, which is useful when comparing embeddings of different lengths or normalization. It is the most common similarity metric in semantic search and RAG systems.

What is the difference between cosine similarity, dot product, and Euclidean distance?

0:21
Cosine similarity measures the angle between vectors and ignores magnitude. Dot product accounts for both angle and magnitude. Euclidean distance measures the straight-line distance in vector space. For normalized embeddings, cosine similarity and dot product are equivalent. The choice depends on whether magnitude is meaningful for your embeddings; most modern embedding models are designed to be used with cosine similarity.

What is embedding dimensionality and how does it affect performance?

0:19
Embedding dimensionality is the number of components in each embedding vector, typically ranging from 384 to 4096 for modern models. Higher dimensions can capture more nuance but increase storage cost and search latency. Many production systems use dimension reduction techniques like Matryoshka embeddings to allow flexible tradeoffs between quality and cost.

What are Matryoshka embeddings?

0:21
Matryoshka embeddings are trained so that prefixes of the full embedding vector are themselves usable as lower-dimensional embeddings. For example, the first 256 dimensions of a 1536-dimensional Matryoshka embedding are still meaningful, allowing applications to choose dimensionality at query time. This enables flexible tradeoffs between accuracy, storage, and latency without retraining.

How do you choose an embedding model for a RAG system?

0:20
Choose an embedding model based on language support, domain, dimension, cost, and licensing. Evaluate candidates on your actual data using metrics like recall at K and mean reciprocal rank. The MTEB leaderboard provides standardized benchmarks across many tasks. For production, also consider inference speed, batch efficiency, and whether you can self-host the model.

What is the MTEB benchmark?

0:24
MTEB stands for Massive Text Embedding Benchmark. It is a standardized evaluation suite covering many languages and dozens of tasks, including retrieval, classification, clustering, and semantic similarity. The MTEB leaderboard is the main reference for comparing embedding models in 2026, with both proprietary models from OpenAI, Cohere, and Voyage and open models from BAAI, Jina, and Nomic.

What is a vector database?

0:22
A vector database is a specialized database designed to store, index, and search high-dimensional vectors efficiently. Unlike traditional databases that use B-tree or hash indices for exact lookups, vector databases use approximate nearest neighbor algorithms to find similar vectors at scale. Examples include Pinecone, Weaviate, Milvus, Qdrant, Chroma, and the pgvector extension for PostgreSQL.

What is approximate nearest neighbor search?

0:22
Approximate nearest neighbor search, or ANN, finds vectors close to a query vector without examining every vector in the database. ANN trades a small amount of accuracy for dramatic speedups, often 100 to 1000 times faster than brute-force exact search. ANN is essential for vector databases at scale because exact nearest neighbor in high dimensions is computationally infeasible.

What is HNSW and how does it work?

0:21
HNSW stands for Hierarchical Navigable Small World. It is a graph-based ANN algorithm that builds a multi-layer graph where higher layers have fewer nodes and longer connections. Search starts at the top layer and descends, navigating to closer and closer neighbors. HNSW offers excellent recall and low latency and is used by most production vector databases including Qdrant, Weaviate, and Milvus.

What is IVF indexing and when would you use it?

0:19
IVF stands for Inverted File index. It clusters vectors into partitions and only searches the most relevant partitions for each query. IVF is faster than HNSW for very large datasets and uses less memory but typically has slightly lower recall. It is often combined with product quantization, called IVF-PQ, for billion-scale vector search.

What is vector quantization in vector databases?

0:20
Vector quantization compresses embeddings by mapping them to a small set of representative vectors, dramatically reducing memory usage and search cost. Product quantization splits vectors into subvectors and quantizes each independently. Scalar quantization reduces precision from float32 to int8 or even binary. Quantization can shrink memory by 4 to 32 times with modest accuracy loss.

How do you handle multilingual embeddings?

0:20
For multilingual embeddings, choose a model trained on multiple languages, such as multilingual-e5, BGE-M3, or Cohere multilingual embed. These models map text from different languages into a shared embedding space, so a query in one language can retrieve documents in another. Test on your actual languages, since quality varies significantly even within multilingual models.

What is embedding drift and how do you handle it?

0:16
Embedding drift happens when the embedding model is updated, producing different vectors for the same text. This breaks any system that mixes old and new embeddings, since they live in different vector spaces. The standard solution is to re-embed all documents whenever you change the embedding model, treating it as a versioned migration. ---