How does reranker latency affect production RAG systems?
RAG & Vector DB Interview: Hybrid Search, BM25, Rerankers, ColBERT, RRF Explained
Audio flashcard · 0:30Nortren·
How does reranker latency affect production RAG systems?
0:30
Reranker latency scales linearly with the number of candidates reranked, since cross-encoders score each query-document pair independently. Reranking 100 candidates through a base-sized cross-encoder takes 50 to 200 milliseconds on a GPU and several seconds on CPU, which often doubles end-to-end query latency. Strategies to manage this include limiting candidates to 50 or fewer, batching pairs in a single model call, distilling smaller rerankers, and caching reranked results for common queries. Managed reranker APIs also offer low-latency inference.
docs.cohere.com