LLM Engineer Interview Questions: RAG Pipeline Design, Chunking Strategies, Hybrid Retrieval

Question 1

What is Retrieval-Augmented Generation (RAG)?

Accepted Answer

Retrieval-Augmented Generation, or RAG, is a technique that combines an LLM with an external knowledge source. Instead of relying only on what the model learned during training, RAG retrieves relevant documents at query time and adds them to the prompt. This reduces hallucinations, enables citing sources, and lets the model use information that postdates its training cutoff.

Question 2

What are the four stages of a RAG pipeline?

Accepted Answer

A typical RAG pipeline has four stages. Ingestion: documents are loaded, cleaned, and chunked into passages. Indexing: chunks are embedded and stored in a vector database with metadata. Retrieval: at query time, the user's question is embedded and used to find the most similar chunks. Generation: retrieved chunks are added to the prompt and sent to the LLM with instructions to answer based on them.

Question 3

When should you use RAG instead of fine-tuning?

Accepted Answer

Use RAG when knowledge changes frequently, when you need to cite sources, when your data is too large to fit in model weights, or when you need to control which information the model uses. Use fine-tuning when you need consistent output style, domain-specific reasoning patterns, or when you want a smaller model to replicate a larger one. Many production systems combine both.

Question 4

What is chunking and why does it matter?

Accepted Answer

Chunking is the process of splitting documents into smaller passages before embedding them. Chunking matters because retrieval quality depends heavily on chunk size and boundaries. Chunks that are too small lose context, while chunks that are too large dilute relevance and may exceed embedding model token limits. Good chunking is one of the highest-leverage decisions in a RAG system.

Question 5

What chunking strategies are commonly used?

Accepted Answer

Common strategies include fixed-size chunking by character or token count, recursive chunking that respects document structure like paragraphs and sections, semantic chunking that splits on topic shifts using embeddings, and document-specific chunking that uses native structure like markdown headers or HTML elements. Production systems often combine recursive and semantic strategies.

Question 6

What is chunk overlap and why is it used?

Accepted Answer

Chunk overlap is the practice of including the last few tokens of one chunk at the start of the next. Typical overlaps are 10 to 20 percent of chunk size. Overlap helps preserve context across boundaries so that a query about a sentence near a chunk edge can still match. It comes at the cost of duplicated content and slightly larger storage.

Question 7

How do you choose the right chunk size?

Accepted Answer

Chunk size depends on the embedding model's optimal input length, the granularity of your queries, and the structure of your documents. Common starting points are 256 to 512 tokens for question answering and 1024 to 2048 tokens for summarization. The best size is found empirically by measuring retrieval quality on a representative evaluation set.

Question 8

What is hybrid retrieval?

Accepted Answer

Hybrid retrieval combines dense vector search with sparse keyword search like BM25. Dense retrieval captures semantic similarity, while sparse retrieval captures exact term matches. Combining them with techniques like reciprocal rank fusion gives better results than either alone, especially for queries containing rare terms, product names, or technical jargon.

Question 9

What is BM25 and why is it still relevant?

Accepted Answer

BM25 is a classical sparse retrieval algorithm from the 1990s that ranks documents based on term frequency and inverse document frequency, with adjustments for document length. Despite being decades old, BM25 remains a strong baseline and is essential for matching exact terms. Modern RAG systems combine BM25 with dense embeddings rather than replacing it.

Question 10

What is reranking and why is it used?

Accepted Answer

Reranking is a second-stage process that takes a candidate set returned by initial retrieval and rescores them using a more accurate but slower model, typically a cross-encoder. Initial retrieval finds 50 to 100 candidates quickly, then reranking narrows them to the top 5 to 10 most relevant. Reranking dramatically improves retrieval quality at modest cost.

Question 11

What is the difference between a bi-encoder and a cross-encoder?

Accepted Answer

A bi-encoder embeds the query and each document independently, then compares with cosine similarity. This is fast because document embeddings can be precomputed. A cross-encoder takes the query and document together as input and outputs a relevance score directly. Cross-encoders are more accurate but cannot precompute, making them suitable for reranking small candidate sets.

Question 12

What is reciprocal rank fusion?

Accepted Answer

Reciprocal Rank Fusion, or RRF, is a method for combining ranked lists from multiple retrievers. Each document's score is the sum of one over its rank in each list, plus a small constant. RRF is simple, parameter-free, and works well in practice without needing to tune weights. It is the standard way to merge dense and sparse retrieval results.

Question 13

What metadata should you store with chunks in a vector database?

Accepted Answer

Store metadata that supports filtering, citation, and debugging. Common fields include the source document ID, file name, page number, section heading, chunk index within the document, creation date, author, and any access control tags. Metadata enables filtered search such as "only retrieve from documents updated in the last month" and lets you cite sources in answers.

Question 14

What is metadata filtering in vector search?

Accepted Answer

Metadata filtering restricts vector search to documents matching specific criteria, like a date range, language, or department. It is essential for multi-tenant applications where each user should only retrieve their own documents. Most vector databases support filters either pre-search, narrowing the candidate pool first, or post-search, filtering after vector matching.

Question 15

How do you handle document updates in a RAG system?

Accepted Answer

For document updates, track each chunk's source document ID and version. When a document changes, delete the old chunks and reindex the new content. For frequent updates, schedule incremental sync jobs and use webhooks where possible. Always design ingestion to be idempotent so reruns produce the same state without duplicates.

---

LLM Engineer Interview Questions: RAG Pipeline Design, Chunking Strategies, Hybrid Retrieval

What is Retrieval-Augmented Generation (RAG)?

What are the four stages of a RAG pipeline?

When should you use RAG instead of fine-tuning?

What is chunking and why does it matter?

What chunking strategies are commonly used?

What is chunk overlap and why is it used?

How do you choose the right chunk size?

What is hybrid retrieval?

What is BM25 and why is it still relevant?

What is reranking and why is it used?

What is the difference between a bi-encoder and a cross-encoder?

What is reciprocal rank fusion?

What metadata should you store with chunks in a vector database?

What is metadata filtering in vector search?

How do you handle document updates in a RAG system?

LLM Engineer Interview Questions: Transformer Architecture, Self-Attention, and Modern LLM Foundations

LLM Engineer Interview: Tokenization, BPE, SentencePiece, and Token Counting in Production

LLM Engineer Interview Questions: Embeddings, Vector Search, and Cosine Similarity Explained

LLM Engineer Interview Questions: Advanced RAG Techniques — Self-RAG, GraphRAG, Agentic RAG

LLM Engineer Interview Questions: Fine-Tuning, LoRA, QLoRA, PEFT, and Instruction Tuning

LLM Engineer Interview Questions: Prompt Engineering, Few-Shot, Chain-of-Thought, Structured Outputs

LLM Engineer Interview Questions: LLM Agents, Tool Use, Multi-Step Reasoning, MCP Protocol

LLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization

LLM Engineer Interview Questions: LLM Evaluation, Hallucinations, Guardrails, Production Monitoring

LLM Engineer Interview Questions: Choosing Between OpenAI, Anthropic, Open Source Models, and Self-Hosting

LLM Engineer Interview Questions: RAG Pipeline Design, Chunking Strategies, Hybrid Retrieval

What is Retrieval-Augmented Generation (RAG)?

What are the four stages of a RAG pipeline?

When should you use RAG instead of fine-tuning?

What is chunking and why does it matter?

What chunking strategies are commonly used?

What is chunk overlap and why is it used?

How do you choose the right chunk size?

What is hybrid retrieval?

What is BM25 and why is it still relevant?

What is reranking and why is it used?

What is the difference between a bi-encoder and a cross-encoder?

What is reciprocal rank fusion?

What metadata should you store with chunks in a vector database?

What is metadata filtering in vector search?

How do you handle document updates in a RAG system?

LLM Engineer Interview Questions: Transformer Architecture, Self-Attention, and Modern LLM Foundations

LLM Engineer Interview: Tokenization, BPE, SentencePiece, and Token Counting in Production

LLM Engineer Interview Questions: Embeddings, Vector Search, and Cosine Similarity Explained

LLM Engineer Interview Questions: Advanced RAG Techniques — Self-RAG, GraphRAG, Agentic RAG

LLM Engineer Interview Questions: Fine-Tuning, LoRA, QLoRA, PEFT, and Instruction Tuning

LLM Engineer Interview Questions: Prompt Engineering, Few-Shot, Chain-of-Thought, Structured Outputs

LLM Engineer Interview Questions: LLM Agents, Tool Use, Multi-Step Reasoning, MCP Protocol

LLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization

LLM Engineer Interview Questions: LLM Evaluation, Hallucinations, Guardrails, Production Monitoring

LLM Engineer Interview Questions: Choosing Between OpenAI, Anthropic, Open Source Models, and Self-Hosting

Related topics: IT & Technology