RAG & Vector DB Interview: Hybrid Search, BM25, Rerankers, ColBERT, RRF Explained

RAG & Vector DB Interview: Hybrid Search, BM25, Rerankers, ColBERT, RRF Explained

Explore sophisticated search techniques such as HNSW, IVF, and hybrid search methods. Understanding these will enhance your ability to implement efficient retrieval systems in real-world applications.

12 audio · 5:52

Nortren·

What is BM25 and why is it still used in modern RAG systems?

0:33
BM25, or Best Matching 25, is a probabilistic ranking function from the 1990s that scores documents by term frequency, inverse document frequency, and document length normalization. Despite its age, BM25 remains competitive because it handles exact keyword matches, rare terms, product names, and identifiers that dense embeddings often smooth over. Modern RAG systems combine BM25 with dense retrieval in hybrid search, letting each method cover the other's weaknesses. It is the default scoring method in Elasticsearch, OpenSearch, and most hybrid search implementations.

What is hybrid search and how does it combine dense and sparse retrieval?

0:29
Hybrid search runs both a dense vector query and a sparse lexical query like BM25 in parallel, then fuses the two result lists into a single ranking. Dense retrieval catches semantic matches where users phrase queries differently from documents, while sparse retrieval catches exact keyword matches on names, identifiers, and rare terms. Fusion uses either weighted score combination or Reciprocal Rank Fusion, which combines ranks rather than raw scores. Hybrid search typically lifts recall 5 to 15 percent over either method alone.

What is Reciprocal Rank Fusion (RRF) and how does it work?

0:31
Reciprocal Rank Fusion, or RRF, combines multiple ranked result lists into one by summing the reciprocal of each document's rank in each list, with a constant k usually set to 60. A document ranked first in both lists scores 1 over 61 plus 1 over 61. RRF requires no score calibration between retrievers, which makes it robust when combining systems with wildly different score distributions like BM25 and cosine similarity. It was introduced by Cormack and colleagues in 2009 and is the default fusion method in Elasticsearch, Weaviate, and Qdrant.

What is the difference between a bi-encoder and a cross-encoder?

0:30
A bi-encoder processes the query and document independently, producing two vectors that are compared with cosine similarity or dot product, which makes it fast enough to index millions of documents offline. A cross-encoder processes the query and document together through a single transformer pass, producing a direct relevance score that captures fine-grained interactions between terms. Cross-encoders score much better but require a full model inference per pair, so they only work on a pre-filtered candidate set of dozens to hundreds of items, not millions.

What is a reranker and when should you use one in RAG?

0:28
A reranker is a model that takes the top-k results from a first-stage retriever, typically 50 to 200 candidates, and reorders them by true relevance to the query. Rerankers are usually cross-encoders that deliver much higher precision than the bi-encoder used in first-stage retrieval, at the cost of higher latency per candidate. Use a reranker when precision at the top-3 or top-5 matters, such as in RAG where only a few chunks fit into the language model prompt. Typical lift is 10 to 30 percent on retrieval metrics.

What is ColBERT and how does late interaction work?

0:29
ColBERT is a retrieval model by Khattab and Zaharia from 2020 that stores one embedding per token rather than one per document, enabling late interaction matching. At query time, every query token is compared to every document token, and the maximum similarity per query token is summed to score the document. This preserves fine-grained token-level matching while remaining scalable because document embeddings are precomputed. ColBERT models often match cross-encoder quality at bi-encoder speed, making late interaction attractive for large-scale retrieval.

What is SPLADE and how does it differ from BM25?

0:27
SPLADE, or Sparse Lexical and Expansion model, is a learned sparse retrieval model that produces sparse vectors where each dimension corresponds to a vocabulary term, like BM25, but with learned weights and query expansion via masked language modeling. SPLADE can add terms not present in the original text, such as synonyms or related concepts, closing the vocabulary gap that hurts BM25. It combines the interpretability and exact-match strengths of sparse retrieval with some of the semantic understanding that dense embeddings provide.

When should you use a reranker versus hybrid search alone?

0:29
Hybrid search improves recall cheaply by combining two retrievers, so use it as a baseline for almost every production RAG system. Add a reranker on top when you need higher precision in the final top-k passed to the language model, especially when context window cost or generation latency forces you to pass only three to five chunks. Rerankers add 20 to 200 milliseconds of latency per query and require a GPU or managed service, so the cost-benefit depends on traffic volume. Both together is the 2026 production default.

What are the most popular rerankers for RAG in 2026?

0:28
Top hosted choices are Cohere Rerank, Voyage AI rerankers, and Jina Reranker, which offer high quality with simple APIs and per-query pricing. Open-source leaders include the BGE reranker family from BAAI, mixedbread rerankers, and Jina cross-encoders, which can be self-hosted for cost control at scale. ColBERT-style late-interaction rerankers like answer-ai's ColBERTv2 offer a middle ground of quality and speed. Choice depends on language coverage, latency target, and whether on-premises deployment is required.

How does reranker latency affect production RAG systems?

0:30
Reranker latency scales linearly with the number of candidates reranked, since cross-encoders score each query-document pair independently. Reranking 100 candidates through a base-sized cross-encoder takes 50 to 200 milliseconds on a GPU and several seconds on CPU, which often doubles end-to-end query latency. Strategies to manage this include limiting candidates to 50 or fewer, batching pairs in a single model call, distilling smaller rerankers, and caching reranked results for common queries. Managed reranker APIs also offer low-latency inference.

What is the difference between learned sparse and lexical sparse retrieval?

0:28
Lexical sparse retrieval like BM25 uses vocabulary terms as dimensions and hand-crafted statistical weights like term frequency and inverse document frequency, with no training required. Learned sparse retrieval like SPLADE or uniCOIL uses the same sparse vocabulary structure but learns weights from query-document pairs, often adding expansion terms not in the original text. Learned sparse matches exact keywords like lexical methods, while also closing vocabulary gaps like dense retrieval. It serves through the same inverted index infrastructure as BM25.

How do you choose weights for dense and sparse score fusion in hybrid search?

0:30
If you use Reciprocal Rank Fusion, no weight tuning is needed since RRF combines ranks rather than scores. For weighted score fusion, normalize each retriever's scores to a comparable range, then try alpha values from 0.2 to 0.8 on a labeled evaluation set to find the sweet spot. Dense weight tends to dominate for semantic queries and natural-language questions, while sparse weight helps on queries with proper nouns, product identifiers, or technical terms. Many systems tune weights per query intent using a classifier. ---