Question

How do you build a RAG evaluation dataset?

Accepted Answer

Collect real user queries from production logs, then label whether the system's retrieved documents and generated answers are correct. For faster coverage, use a language model to generate synthetic question-answer pairs from your corpus, then verify or correct a sample manually. Include diverse query types: factual, multi-hop, comparison, and ambiguous. Target at least 100 examples for initial evaluation and 500 or more for production decision making. Re-run evaluations whenever you change retrieval, chunking, embedding, or prompt configuration.