RAG & Vector DB Interview: RAG Evaluation, RAGAS, Faithfulness, Retrieval Metrics

RAG & Vector DB Interview: RAG Evaluation, RAGAS, Faithfulness, Retrieval Metrics

Understand RAG evaluation metrics, common pitfalls, and production-level considerations such as latency and caching. These insights are critical for deploying robust RAG systems in practical scenarios.

12 audio · 5:37

Nortren·

How do you evaluate a RAG system?

0:31
RAG evaluation requires measuring both retrieval quality and generation quality separately and end-to-end. For retrieval, use recall at k, precision at k, and mean reciprocal rank on a labeled query-document evaluation set. For generation, measure faithfulness to retrieved context, answer relevance to the query, and optionally answer correctness against reference answers. Frameworks like RAGAS, TruLens, and DeepEval automate these measurements. Build a domain-specific evaluation set of at least 100 to 500 query-answer pairs from real user traffic for meaningful results.

What is RAGAS and what metrics does it provide?

0:28
RAGAS, or Retrieval-Augmented Generation Assessment, is an open-source framework for evaluating RAG systems using reference-free metrics computed by language models. Its main metrics are faithfulness, which checks whether the answer is grounded in retrieved context, answer relevance, which measures how well the answer addresses the question, context precision, which measures the proportion of retrieved chunks that are relevant, and context recall, which measures whether all necessary information was retrieved. RAGAS requires a judge language model to compute most metrics.

What is faithfulness in RAG evaluation?

0:27
Faithfulness measures whether the generated answer is fully grounded in the retrieved context, with no information added from the model's parametric memory or fabricated. It is computed by extracting claims from the answer and verifying each against the retrieved passages, typically using a language model as judge. Low faithfulness means the model is hallucinating even with context provided, a serious production issue. Faithfulness is often more important than answer correctness in RAG because the whole point is grounding answers in verifiable sources.

What is answer relevance and how is it measured?

0:25
Answer relevance measures how well the generated answer addresses the actual question the user asked, separate from whether the answer is correct or grounded. It is often computed by having a language model generate hypothetical questions that the answer could address, then measuring their similarity to the original question. Low answer relevance means the model is answering a different question than asked, often due to misleading retrieved context or prompt issues. High faithfulness with low relevance means the model is technically grounded but off-topic.

What is the difference between context precision and context recall in RAGAS?

0:24
Context precision measures the proportion of retrieved chunks that are actually relevant to the query, indicating whether the retriever returns useful results or drowns the generator in noise. Context recall measures whether all necessary information for the ground-truth answer is present in the retrieved chunks, requiring reference answers to compute. Precision tells you if you retrieved too much irrelevant content, and recall tells you if you retrieved too little relevant content. Both must be high for RAG to work well.

What is retrieval recall at k and why does it matter?

0:30
Retrieval recall at k is the proportion of relevant documents that appear in the top-k retrieved results for a query. For example, recall at 10 of 0.8 means 80 percent of relevant documents are in the top 10. It is the most important retrieval metric because if relevant documents are not retrieved, no amount of reranking or generation can recover them. Measure recall on a labeled evaluation set with at least several dozen queries. Typical production targets are recall at 10 above 0.9 and recall at 5 above 0.8.

What is precision at k in retrieval evaluation?

0:26
Precision at k is the proportion of retrieved top-k documents that are actually relevant to the query. It measures whether the retriever ranks relevant documents near the top, which matters when only the top few are passed to the generator. Precision at 3 is particularly important in RAG because LLM context windows often force limiting to three or five chunks. Precision is less critical than recall because irrelevant documents can be filtered by a reranker, but relevant documents missed in recall cannot be recovered.

What is Mean Reciprocal Rank (MRR) and when is it used?

0:27
Mean Reciprocal Rank measures how high the first relevant document appears in the ranking, computed as the average of 1 divided by the rank of the first relevant result across queries. MRR of 1.0 means the first result is always relevant, while 0.5 means the first relevant result is usually at position 2. MRR is useful when users typically look at only the top result or two, such as in FAQ bots or known-item search. For RAG, recall at k is usually more informative because the system uses multiple retrieved chunks.

What is NDCG and how does it evaluate retrieval?

0:28
Normalized Discounted Cumulative Gain, or NDCG, evaluates ranked retrieval results by assigning higher weight to relevant documents at higher ranks, using a logarithmic discount for position. It handles graded relevance judgments where some documents are more relevant than others, not just binary relevant-or-not. NDCG at 10 is a standard metric in information retrieval benchmarks like BEIR and MTEB. It is the most rigorous retrieval metric when you have labeled relevance grades, though binary recall at k is often sufficient for RAG evaluation in practice.

What is BEIR and what does it measure?

0:30
BEIR, or Benchmarking Information Retrieval, is a heterogeneous benchmark with 18 datasets across domains like scientific papers, biomedical literature, news, and fact checking. It evaluates retrieval models in zero-shot settings, since models are not fine-tuned on each dataset, measuring NDCG at 10 as the primary metric. BEIR revealed that dense retrievers often underperform BM25 in out-of-domain settings, motivating hybrid search as a robust default. It is the standard benchmark for evaluating general-purpose retrieval models.

How do you build a RAG evaluation dataset?

0:30
Collect real user queries from production logs, then label whether the system's retrieved documents and generated answers are correct. For faster coverage, use a language model to generate synthetic question-answer pairs from your corpus, then verify or correct a sample manually. Include diverse query types: factual, multi-hop, comparison, and ambiguous. Target at least 100 examples for initial evaluation and 500 or more for production decision making. Re-run evaluations whenever you change retrieval, chunking, embedding, or prompt configuration.

What is LLM-as-a-judge and what are its limitations?

0:31
LLM-as-a-judge uses a language model to evaluate outputs of another language model, scoring qualities like faithfulness, relevance, or correctness. It scales cheaper than human evaluation and is the backbone of RAGAS and similar frameworks. Limitations include systematic biases, such as preferring longer or more confident answers, sensitivity to prompt wording, and less reliability on judgments requiring domain expertise. Use LLM-as-judge for relative comparisons of system variants, periodically calibrate against human-labeled samples, and do not trust absolute scores without verification. ---