RAG & Vector DB Interview: Advanced RAG, HyDE, Multi-Query, Self-Query, GraphRAG

RAG & Vector DB Interview: Advanced RAG, HyDE, Multi-Query, Self-Query, GraphRAG

This section focuses on the foundational elements of RAG architecture, including key components and various use cases. Grasping these concepts is vital for anyone looking to excel in retrieval-augmented generation.

13 audio · 6:11

Nortren·

What is HyDE and how does it improve RAG retrieval?

0:26
HyDE, or Hypothetical Document Embeddings, is a technique where a language model first generates a hypothetical answer to the query, then that generated text is embedded and used as the retrieval query instead of the original question. This works because hypothetical answers tend to be semantically closer to real answer passages in embedding space than short questions are, improving retrieval on queries that differ stylistically from documents. HyDE was introduced by Gao and colleagues in 2022 and adds one extra language model call per query.

What is multi-query retrieval and when should you use it?

0:29
Multi-query retrieval generates several paraphrased versions of the user's query using a language model, runs retrieval for each, and merges the results. It improves recall when the original query phrasing differs from document phrasing or when the question has multiple valid interpretations. The cost is additional language model calls and retrieval operations, typically three to five times the baseline. Use multi-query when retrieval misses relevant documents that exist in the corpus, verified by inspection of failed cases on an evaluation set.

What is self-query retrieval and how does it work?

0:27
Self-query retrieval uses a language model to parse the user's natural-language query into a structured query that includes both a semantic search component and explicit metadata filters. For example, a query like "action movies from the 1990s" becomes a vector search on "action movies" with a filter on year between 1990 and 1999. This handles queries that mix semantic intent with structured constraints, which naive vector search cannot express. It requires a schema description of filterable metadata fields.

What is GraphRAG and how does it differ from standard RAG?

0:29
GraphRAG, introduced by Microsoft Research in 2024, builds a knowledge graph from the corpus by extracting entities and relationships with a language model, then uses the graph structure for retrieval rather than or in addition to vector similarity. It answers questions that require reasoning across multiple documents, like "summarize the main themes of this corpus," which vector search handles poorly. GraphRAG is expensive to build because it requires many language model calls during ingest, but it enables answering global questions standard RAG cannot.

What is agentic RAG and how does it extend basic RAG?

0:29
Agentic RAG uses a language model as an autonomous agent that decides when to retrieve, what to query for, and whether retrieved results are sufficient, rather than running a fixed retrieve-then-generate pipeline. The agent can issue multiple targeted queries, use other tools like calculators or APIs, and iterate until it produces a complete answer. This handles complex multi-step questions but increases latency and cost due to multiple language model calls and the risk of loops. It is most valuable for research, analysis, and open-ended tasks.

What is query rewriting and why is it used in RAG?

0:28
Query rewriting uses a language model to reformulate the user's query before retrieval, improving recall on queries that are too short, ambiguous, or stylistically different from documents. Techniques include expansion to add synonyms, decomposition to split complex queries, and clarification using conversation history in chat applications. Query rewriting is particularly valuable in conversational RAG where pronouns and references to previous turns must be resolved before retrieval can find relevant context. It adds one language model call per query.

What is multi-hop retrieval in RAG?

0:29
Multi-hop retrieval handles questions whose answer requires combining information from multiple documents that are not individually sufficient. The system retrieves an initial set of documents, uses them to refine the query or extract intermediate facts, then retrieves again based on the refined understanding. For example, answering "who founded the company that makes Claude" requires first retrieving that Anthropic makes Claude, then retrieving that Dario and Daniela Amodei founded Anthropic. Multi-hop patterns are central to agentic RAG and research-oriented applications.

What is contextual retrieval and how does it work?

0:31
Contextual retrieval, introduced by Anthropic in 2024, prepends a document-level context string to each chunk before embedding and indexing, rather than embedding chunks in isolation. The context, generated by a language model, explains how the chunk relates to the whole document, like "This chunk is from section 3 of the Acme annual report discussing revenue." This preserves document context that standard chunking destroys, improving retrieval on queries that depend on section context. It adds ingest cost but improves retrieval quality substantially.

What is parent document retrieval and when is it useful?

0:26
Parent document retrieval stores small chunks for precise embedding-based retrieval but returns the larger parent document or section to the language model after matches are found. This solves the trade-off between precision in retrieval, which favors small chunks, and sufficient context in generation, which favors large passages. It is standard in LlamaIndex and LangChain for document-heavy RAG on reports, textbooks, and legal documents where answers reference surrounding material that small chunks lose.

What is RAG Fusion and how does it combine queries?

0:28
RAG Fusion generates multiple query variations with a language model, retrieves results for each in parallel, then fuses the results using Reciprocal Rank Fusion before passing to the generator. This combines the recall benefits of multi-query retrieval with the score-agnostic fusion of RRF, often improving retrieval quality without score calibration headaches. RAG Fusion was popularized in 2023 and is implemented in most RAG frameworks. It costs more than single-query retrieval but typically improves recall by 10 to 20 percent on hard queries.

What is routing in modular RAG?

0:30
Routing uses a classifier or language model to direct queries to different retrieval strategies, data sources, or prompt templates based on query intent. For example, factual questions might go to a FAQ index, code questions to a codebase index, and complex reasoning to an agentic workflow. Routing improves efficiency by avoiding unnecessary retrieval for conversational turns that do not need external knowledge, and quality by matching each query to its best-suited retrieval method. LangChain and LlamaIndex both support routing as a first-class pattern.

What is Corrective RAG or CRAG?

0:29
Corrective RAG, introduced by Yan and colleagues in 2024, evaluates retrieved documents with a lightweight relevance classifier and takes different actions based on confidence. High-confidence retrievals proceed to generation directly, ambiguous cases trigger web search for additional context, and low-confidence cases rewrite the query and retry. This reduces hallucinations caused by irrelevant retrieved context, a major failure mode in naive RAG. CRAG adds classifier overhead but substantially improves answer quality on queries where the corpus is incomplete.

What is Self-RAG?

0:30
Self-RAG, introduced by Asai and colleagues in 2023, fine-tunes a language model to emit special reflection tokens that control when to retrieve, rate the relevance of retrieved passages, and critique its own answer. The model decides adaptively whether retrieval is needed, which chunks to use, and whether its draft answer is well-grounded. This beats static retrieval pipelines on many benchmarks but requires a specially fine-tuned model with the reflection tokens learned during training, limiting deployment to open-weight models or custom fine-tunes. ---