Question

What is LLM-as-a-judge and what are its limitations?

Accepted Answer

LLM-as-a-judge uses a language model to evaluate outputs of another language model, scoring qualities like faithfulness, relevance, or correctness. It scales cheaper than human evaluation and is the backbone of RAGAS and similar frameworks. Limitations include systematic biases, such as preferring longer or more confident answers, sensitivity to prompt wording, and less reliability on judgments requiring domain expertise. Use LLM-as-judge for relative comparisons of system variants, periodically calibrate against human-labeled samples, and do not trust absolute scores without verification. ---