RAG & Vector DB Interview: HNSW, IVF, Product Quantization, ANN Search Explained

RAG & Vector DB Interview: HNSW, IVF, Product Quantization, ANN Search Explained

Explore sophisticated search techniques such as HNSW, IVF, and hybrid search methods. Understanding these will enhance your ability to implement efficient retrieval systems in real-world applications.

12 audio · 5:46

Nortren·

What is approximate nearest neighbor (ANN) search and why is it used?

0:29
Approximate nearest neighbor search, or ANN, finds vectors close to a query vector without guaranteeing exact top-k results, trading a small amount of recall for massive speedups. Exact search requires comparing the query to every vector in the index, which is linear in dataset size and becomes impractical beyond a few hundred thousand vectors. ANN algorithms like HNSW, IVF, and DiskANN build index structures that achieve logarithmic or sublinear query time while holding 95 to 99 percent recall in production settings.

What is HNSW and how does it work?

0:28
Hierarchical Navigable Small World, or HNSW, is a graph-based ANN algorithm that organizes vectors into a multi-layer graph where each layer contains a progressively smaller random subset of points. Search starts at the top layer with few nodes, greedily walks toward the query, and descends to denser layers, narrowing the search region at each step. The hierarchy gives logarithmic search time while the small-world connections ensure the greedy walk converges quickly. HNSW is the default index in Qdrant, Weaviate, Milvus, and pgvector.

What do the HNSW parameters M, ef_construction, and ef_search control?

0:29
M controls the maximum number of connections per node, typically 16 to 64, where higher M improves recall but increases memory and index build time. ef_construction controls how many candidate neighbors are considered when building each node's connections, with higher values producing a better graph at the cost of longer indexing. ef_search is the query-time parameter that sets how many candidates the search keeps in its priority queue, directly trading latency for recall. You can tune ef_search per query without rebuilding the index.

What is an IVF index and when should you use it?

0:27
Inverted File Index, or IVF, partitions vectors into clusters using k-means, then searches only the few clusters closest to the query. The two key parameters are nlist, the number of clusters, typically the square root of the dataset size, and nprobe, the number of clusters to search at query time. IVF uses less memory than HNSW and indexes faster on billion-scale datasets, but HNSW usually beats it on recall at equivalent latency. Use IVF when memory is tight or when you need fast batch indexing.

What is product quantization (PQ) in vector search?

0:30
Product Quantization, or PQ, compresses vectors by splitting each into M subvectors and encoding each subvector with a short code from a learned codebook, typically 8 bits per subvector. A 1024-dimensional float32 vector shrinks from 4096 bytes to around 64 bytes, roughly a 64x reduction. Distance computations use precomputed lookup tables over the codebooks, making search fast even with compressed storage. PQ trades some recall for massive memory savings, making billion-vector indexes fit in a single machine's RAM.

What is scalar quantization and how does it differ from product quantization?

0:30
Scalar quantization compresses each dimension of a vector independently, typically from 32-bit float to 8-bit integer, giving a 4x memory reduction with minimal recall loss. Product quantization compresses groups of dimensions jointly using a learned codebook, achieving 16x to 64x compression but with higher recall loss and more complex training. Scalar quantization is simpler, requires no training, and is a safe default in Qdrant and Milvus. Use product quantization only when memory pressure is extreme and some recall sacrifice is acceptable.

What is binary quantization and when does it work well?

0:31
Binary quantization reduces each vector dimension to a single bit, storing only the sign of each float, giving a 32x memory reduction and enabling distance computation via fast bitwise XOR and population count. It works well when embeddings have high dimensionality and well-distributed values, since random projections preserve relative distances in Hamming space. Binary quantization pairs with a rescoring step that re-ranks top candidates using full precision vectors. Modern embedding models from OpenAI and Cohere are explicitly designed to support binary quantization.

What is the difference between HNSW and IVF for vector search?

0:27
HNSW builds a layered graph for logarithmic-time greedy search, giving excellent recall and low latency but high memory use, typically 1.5x to 2x the raw vector size. IVF partitions vectors into clusters and searches only the nearest ones, using less memory and indexing faster but with lower recall at the same latency. HNSW dominates for small-to-medium datasets up to a few hundred million vectors, while IVF combined with product quantization wins at billion-plus scale where memory becomes the hard constraint.

What is a flat index and when should you use exact search?

0:28
A flat index stores vectors without any approximation structure and compares the query against every vector, guaranteeing exact top-k results. It is the baseline for correctness and the only choice when recall must be 100 percent, such as for deduplication, legal discovery, or benchmarking an ANN index against ground truth. Flat search is practical up to a few hundred thousand vectors on CPU or a few million on GPU. Beyond that, linear scan latency becomes unacceptable and you must switch to HNSW or IVF with quantization.

What are the trade-offs between recall, latency, and memory in ANN indexes?

0:31
The three metrics form a triangle you cannot optimize all at once. Higher recall requires more candidates to consider, which costs latency, or denser graph connections, which cost memory. Lower latency requires aggressive pruning or quantization, which cost recall. Lower memory requires compression like product quantization, which also costs recall. Production systems typically target 95 to 99 percent recall under 50 millisecond latency, then tune memory to fit hardware. The ann-benchmarks project makes these trade-offs explicit across algorithms.

What is DiskANN and when should you use disk-based vector search?

0:27
DiskANN is a graph-based ANN algorithm from Microsoft designed to serve billion-scale vector indexes from SSD instead of RAM, reducing memory cost by 10x or more. It uses a modified graph structure called Vamana that keeps good recall with lower connectivity than HNSW, and a compressed in-memory quantized copy for initial routing before fetching full vectors from disk. Use DiskANN when your dataset has billions of vectors and you cannot afford to keep them all in memory. Turbopuffer and Azure use DiskANN-style designs.

How do you tune HNSW for higher recall versus lower latency?

0:29
Increase ef_search at query time to raise recall, since a larger priority queue explores more of the graph before returning results, at the cost of proportionally higher latency. Increase M and ef_construction at build time for a better underlying graph, which lifts the ceiling on recall but requires reindexing. Typical production settings use M of 16 to 32, ef_construction of 100 to 200, and ef_search between 40 and 200 tuned per workload. Measure recall against a flat-index ground truth before deploying the tuned values. ---