Question

How do you evaluate embedding model quality for retrieval?

Accepted Answer

Build a domain-specific evaluation set of query-document pairs where you know which documents are relevant, then measure recall at k, precision at k, and normalized discounted cumulative gain. Public benchmarks like MTEB and BEIR give a general signal but rarely match your domain, so a few hundred hand-labeled examples from your real traffic beats any leaderboard score. Always test the full pipeline, since chunking strategy, reranking, and metadata filters can change which embedding model wins. A good model on news data may underperform on legal or medical text.